Verifiably effective arabic dialect identification

Kareem Darwish, Hassan Sajjad, Hamdy Mubarak

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

47 Citations (Scopus)

Abstract

Several recent papers on Arabic dialect identification have hinted that using a word unigram model is sufficient and effective for the task. However, most previous work was done on a standard fairly homogeneous dataset of dialectal user comments. In this paper, we show that training on the standard dataset does not generalize, because a unigram model may be tuned to topics in the comments and does not capture the distinguishing features of dialects. We show that effective dialect identification requires that we account for the distinguishing lexical, morphological, and phonological phenomena of dialects. We show that accounting for such can improve dialect detection accuracy by nearly 10% absolute.

Original languageEnglish
Title of host publicationEMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages1465-1468
Number of pages4
ISBN (Electronic)9781937284961
DOIs
Publication statusPublished - 2014
Event2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014 - Doha, Qatar
Duration: 25 Oct 201429 Oct 2014

Publication series

NameEMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014
Country/TerritoryQatar
CityDoha
Period25/10/1429/10/14

Fingerprint

Dive into the research topics of 'Verifiably effective arabic dialect identification'. Together they form a unique fingerprint.

Cite this