TY - GEN
T1 - A fine-grained annotated multi-dialectal Arabic corpus
AU - Charfi, Anis
AU - Mehdi, Syed Hassan
AU - Zaghouani, Wajdi
AU - Mohamed, Esraa
N1 - Publisher Copyright:
© 2019 Association for Computational Linguistics (ACL). All rights reserved.
PY - 2019
Y1 - 2019
N2 - We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohens Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.
AB - We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohens Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.
UR - http://www.scopus.com/inward/record.url?scp=85076459028&partnerID=8YFLogxK
U2 - 10.26615/978-954-452-056-4_023
DO - 10.26615/978-954-452-056-4_023
M3 - Conference contribution
AN - SCOPUS:85076459028
T3 - International Conference Recent Advances in Natural Language Processing, RANLP
SP - 198
EP - 204
BT - International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings
A2 - Angelova, Galia
A2 - Mitkov, Ruslan
A2 - Nikolova, Ivelina
A2 - Temnikova, Irina
A2 - Temnikova, Irina
PB - Incoma Ltd
T2 - 12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019
Y2 - 2 September 2019 through 4 September 2019
ER -