ADI17: A Fine-Grained Arabic Dialect Identification Dataset

Suwon Shon, Ahmed Ali, Younes Samih, Hamdy Mubarak, James Glass

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

41 Citations (Scopus)

Abstract

In this paper, we describe a method to collect dialectal speech from YouTube videos to create a large-scale Dialect Identification (DID) dataset. Using this method, we collected dialectal Arabic from known YouTube channels from 17 Arabic speaking countries in the Middle East and Northern Africa. After a refinement process, a total of 3,000 hours of speech was available for training DID systems, with an additional 57 hours of speech for development and testing. For detailed evaluations, the DID data was divided into three sub-categories based on the segment duration: short (less than 5s), medium (5-20s), and long (over 20s). We compare state-of-the-art DID techniques on these data, and also analyze a DID system trained on these data. Since the training and test data share the same channel domain, we also used the Multi-Genre Broadcast 3 (MGB-3) test set to evaluate on domain mismatched condition.

Original languageEnglish
Title of host publication2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages8244-8248
Number of pages5
ISBN (Electronic)9781509066315
DOIs
Publication statusPublished - May 2020
Event2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain
Duration: 4 May 20208 May 2020

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2020-May
ISSN (Print)1520-6149

Conference

Conference2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Country/TerritorySpain
CityBarcelona
Period4/05/208/05/20

Keywords

  • Arabic dialect
  • Dataset
  • Dialect Identification
  • Language Identification
  • Large-scale

Fingerprint

Dive into the research topics of 'ADI17: A Fine-Grained Arabic Dialect Identification Dataset'. Together they form a unique fingerprint.

Cite this