Improving Arabic text categorization using transformer training diversification

Shammur A. Chowdhury, Ahmed Abdelali, Kareem Darwish, Joni Salminen, Soon-Gyo Jung, Bernard James Jansen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Automatic categorization of short texts, such as news headlines and social media posts, has many applications ranging from content analysis to recommendation systems. In this paper, we use such text categorization i.e., labeling the social media posts to categories like ‘sports’, ‘politics’, ‘human-rights’ among others, to showcase the efficacy of models across different sources and varieties of Arabic. In doing so, we show that diversifying the training data, whether by using diverse training data for the specific task (an increase of 21% macro F1) or using diverse data to pre-train a BERT model (26% macro F1), leads to overall improvements in classification effectiveness. In our work, we also introduce two new Arabic text categorization datasets, where the first is composed of social media posts from a popular Arabic news channel that cover Twitter, Facebook, and YouTube, and the second is composed of tweets from popular Arabic accounts. The posts in the former are nearly exclusively authored in modern standard Arabic (MSA), while the tweets in the latter contain both MSA and dialectal Arabic
Original languageEnglish
Title of host publicationProceedings of the Fifth Arabic Natural Language Processing Workshop
Number of pages11
Publication statusPublished - 20 Dec 2020

Fingerprint

Dive into the research topics of 'Improving Arabic text categorization using transformer training diversification'. Together they form a unique fingerprint.

Cite this