A single-model approach for Arabic segmentation, POS tagging, and named entity recognition

Abed Alhakim Freihat, Gabor Bella, Hamdy Mubarak, Fausto Giunchiglia

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Citations (Scopus)

Abstract

This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.

Original languageEnglish
Title of host publication2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-8
Number of pages8
ISBN (Electronic)9781538645437
DOIs
Publication statusPublished - 6 Jun 2018
Event2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018 - Algiers, Algeria
Duration: 25 Apr 201826 Apr 2018

Publication series

Name2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018

Conference

Conference2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018
Country/TerritoryAlgeria
CityAlgiers
Period25/04/1826/04/18

Keywords

  • Lemmatization
  • Machine learning
  • NLP
  • Named entity recognition
  • POS tagging
  • Segmentation

Fingerprint

Dive into the research topics of 'A single-model approach for Arabic segmentation, POS tagging, and named entity recognition'. Together they form a unique fingerprint.

Cite this