AraBench: Benchmarking Dialectal Arabic-English Machine Translation

Hassan Sajjad, Ahmed Abdelali, Nadir Durrani, Fahim Dalvi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

22 Citations (Scopus)

Abstract

Low-resource machine translation suffers from the scarcity of training data and the unavailability of standard evaluation sets. While a number of research efforts target the former, the unavailability of evaluation benchmarks remain a major hindrance in tracking the progress in low-resource machine translation. In this paper, we introduce AraBench, an evaluation suite for dialectal Arabic to English machine translation. Compared to Modern Standard Arabic, Arabic dialects are challenging due to their spoken nature, non-standard orthography, and a large variation in dialectness. To this end, we pool together already available Dialectal Arabic-English resources and additionally build novel test sets. AraBench offers 4 coarse, 15 fine-grained and 25 city-level dialect categories, belonging to diverse genres, such as media, chat, religion and travel with varying level of dialectness. We report strong baselines using several training settings: fine-tuning, back-translation and data augmentation. The evaluation suite opens a wide range of research frontiers to push efforts in low-resource machine translation, particularly Arabic dialect translation. The evaluation suite1 and the dialectal system2 are publicly available for research purposes.

Original languageEnglish
Title of host publicationCOLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference
EditorsDonia Scott, Nuria Bel, Chengqing Zong
PublisherAssociation for Computational Linguistics (ACL)
Pages5094-5107
Number of pages14
ISBN (Electronic)9781952148279
Publication statusPublished - 2020
Event28th International Conference on Computational Linguistics, COLING 2020 - Virtual, Online, Spain
Duration: 8 Dec 202013 Dec 2020

Publication series

NameCOLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference

Conference

Conference28th International Conference on Computational Linguistics, COLING 2020
Country/TerritorySpain
CityVirtual, Online
Period8/12/2013/12/20

Fingerprint

Dive into the research topics of 'AraBench: Benchmarking Dialectal Arabic-English Machine Translation'. Together they form a unique fingerprint.

Cite this