Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM

Randah Alharbi, Walid Magdy, Kareem Darwish, Ahmed AbdelAli, Hamdy Mubarak

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Citations (Scopus)

Abstract

Part-of-speech (POS) tagging is one of the most important addressed areas in the natural language processing (NLP). There are effective POS taggers for many languages including Arabic. However, POS research for Arabic focused mainly on Modern Standard Arabic (MSA), while less attention was directed towards Dialect Arabic (DA). MSA is the formal variant which is mainly found in news and formal text books, while DA is the informal spoken Arabic that varies among different regions in the Arab world. DA is heavily used online due to the large spread of social media, which increased research directions towards building NLP tools for DA. Most research on DA focuses on Egyptian and Levantine, while much less attention is given to the Gulf dialect. In this paper, we present a more effective POS tagger for the Arabic Gulf dialect than currently available Arabic POS taggers. Our work includes preparing a POS tagging dataset, engineering multiple sets of features, and applying two machine learning methods, namely Support Vector Machine (SVM) classifier and bi-directional Long Short Term Memory (Bi-LSTM) for sequence modeling. We have improved POS tagging for Gulf dialect from 75% accuracy using a state-of-the-art MSA POS tagger to over 91% accuracy using a Bi-LSTM labeler.

Original languageEnglish
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages3925-3932
Number of pages8
ISBN (Electronic)9791095546009
Publication statusPublished - 2019
Externally publishedYes
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: 7 May 201812 May 2018

Publication series

NameLREC 2018 - 11th International Conference on Language Resources and Evaluation

Conference

Conference11th International Conference on Language Resources and Evaluation, LREC 2018
Country/TerritoryJapan
CityMiyazaki
Period7/05/1812/05/18

Keywords

  • Bidirectional Long Short Term Memory (Bi-LSTM)
  • Dialectal Arabic (DA)
  • Gulf Arabic (GA)
  • Part-of-Speech (POS)

Fingerprint

Dive into the research topics of 'Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM'. Together they form a unique fingerprint.

Cite this