MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge

Suwon Shon, Ahmed Ali, James Glass

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

25 Citations (Scopus)

Abstract

In order to successfully annotate the Arabic speech content found in open-domain media broadcasts, it is essential to be able to process a diverse set of Arabic dialects. For the 2017 Multi-Genre Broadcast challenge (MGB-3) there were two possible tasks: Arabic speech recognition, and Arabic Dialect Identification (ADI). In this paper, we describe our efforts to create an ADI system for the MGB-3 challenge, with the goal of distinguishing amongst four major Arabic dialects, as well as Modern Standard Arabic. Our research focused on dialect variability and domain mismatches between the training and test domain. In order to achieve a robust ADI system, we explored both Siamese neural network models to learn similarity and dissimilarities among Arabic dialects, as well as i-vector post-processing to adapt domain mismatches. Both Acoustic and linguistic features were used for the final MGB-3 submissions, with the best primary system achieving 75% accuracy on the official 10hr test set.

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages374-380
Number of pages7
ISBN (Electronic)9781509047888
DOIs
Publication statusPublished - 2 Jul 2017
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan
Duration: 16 Dec 201720 Dec 2017

Publication series

Name2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Volume2018-January

Conference

Conference2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
Country/TerritoryJapan
CityOkinawa
Period16/12/1720/12/17

Keywords

  • Arabic
  • Dialect Recognition
  • Domain Adaptation
  • MGB challenge
  • Siamese Network

Fingerprint

Dive into the research topics of 'MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge'. Together they form a unique fingerprint.

Cite this