A fine-grained annotated multi-dialectal Arabic corpus

Anis Charfi, Syed Hassan Mehdi, Wajdi Zaghouani, Esraa Mohamed

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Citations (Scopus)

Abstract

We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohens Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.

Original languageEnglish
Title of host publicationInternational Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings
EditorsGalia Angelova, Ruslan Mitkov, Ivelina Nikolova, Irina Temnikova, Irina Temnikova
PublisherIncoma Ltd
Pages198-204
Number of pages7
ISBN (Electronic)9789544520557
DOIs
Publication statusPublished - 2019
Event12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019 - Varna, Bulgaria
Duration: 2 Sept 20194 Sept 2019

Publication series

NameInternational Conference Recent Advances in Natural Language Processing, RANLP
Volume2019-September
ISSN (Print)1313-8502

Conference

Conference12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019
Country/TerritoryBulgaria
CityVarna
Period2/09/194/09/19

Fingerprint

Dive into the research topics of 'A fine-grained annotated multi-dialectal Arabic corpus'. Together they form a unique fingerprint.

Cite this