A pilot study on arabic multi-genre corpus diacritization annotation

Houda Bouamor, Wajdi Zaghouani, Mona Diab, Ossama Obeid, Kemal Oflazer, Mahmoud Ghoneim, Abdelati Hawwari

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

18 Citations (Scopus)

Abstract

Arabic script writing is typically underspecified for short vowels and other mark up, referred to as diacritics. Apart from the lexical ambiguity found in words, similar to that exhibited in other languages, the lack of diacritics in written Arabic script adds another layer of ambiguity which is an artifact of the orthography. Diacritization of written text has a significant impact on Arabic NLP applications. In this paper, we present a pilot study on building a diacritized multi-genre corpus in Arabic. We annotate a sample of nondiacritized words extracted from five text genres. We explore different annotation strategies: Basic where we present only the bare undiacritized forms to the annotators, Intermediate (Basic forms+their POS tags), and Advanced (automatically diacritized words). We present the impact of the annotation strategy on annotation quality. Moreover, we study different diacritization schemes in the process.

Original languageEnglish
Title of host publication2nd Workshop on Arabic Natural Language Processing, ANLP 2015 - held at 53rd Annual Meeting of the Association for Computational Linguistics, ACL 2015 - Proceedings
EditorsNizar Habash, Stephan Vogel, Kareem Darwish
PublisherAssociation for Computational Linguistics (ACL)
Pages80-88
Number of pages9
ISBN (Electronic)9781941643587
Publication statusPublished - 2015
Externally publishedYes
Event2nd Workshop on Arabic Natural Language Processing, ANLP 2015 - Beijing, China
Duration: 30 Jul 2015 → …

Publication series

Name2nd Workshop on Arabic Natural Language Processing, ANLP 2015 - held at 53rd Annual Meeting of the Association for Computational Linguistics, ACL 2015 - Proceedings

Conference

Conference2nd Workshop on Arabic Natural Language Processing, ANLP 2015
Country/TerritoryChina
CityBeijing
Period30/07/15 → …

Fingerprint

Dive into the research topics of 'A pilot study on arabic multi-genre corpus diacritization annotation'. Together they form a unique fingerprint.

Cite this