TY - GEN
T1 - ARADICE
T2 - 31st International Conference on Computational Linguistics, COLING 2025
AU - Mousi, Basel
AU - Durrani, Nadir
AU - Ahmad, Fatema
AU - Hasan, Md Arid
AU - Hasanain, Maram
AU - Kabbani, Tameem
AU - Dalvi, Fahim
AU - Chowdhury, Shammur Absar
AU - Alam, Firoj
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ≈45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We have released the dialectal translation models and benchmarks developed in this study.
AB - Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ≈45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We have released the dialectal translation models and benchmarks developed in this study.
UR - http://www.scopus.com/inward/record.url?scp=85218505285&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85218505285
T3 - Proceedings - International Conference on Computational Linguistics, COLING
SP - 4186
EP - 4218
BT - Main Conference
A2 - Rambow, Owen
A2 - Wanner, Leo
A2 - Apidianaki, Marianna
A2 - Al-Khalifa, Hend
A2 - Di Eugenio, Barbara
A2 - Schockaert, Steven
PB - Association for Computational Linguistics (ACL)
Y2 - 19 January 2025 through 24 January 2025
ER -