The study presents a methodology for the ‘Nuanced Arabic Dialect Identification (NADI) Shared Task 2023’. It focuses on country-level dialect identification, which is crucial for various Natural Language Processing (NLP) tasks like speech recognition and translation. The authors use the Twitter dataset (TWT-2023) that includes 18 dialects for the multiclass classification problem. They employ various transformer-based models, pre-trained on Arabic language, to identify these dialects. The models are fine-tuned on the provided dataset and an ensembling method is used to improve system performance. The approach achieved an F1-score of 76.65.
Publication date: 1 Dec 2023
Project Page: unavailable
Paper: https://arxiv.org/pdf/2311.18739