TY - GEN
T1 - ADI17
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
AU - Shon, Suwon
AU - Ali, Ahmed
AU - Samih, Younes
AU - Mubarak, Hamdy
AU - Glass, James
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - In this paper, we describe a method to collect dialectal speech from YouTube videos to create a large-scale Dialect Identification (DID) dataset. Using this method, we collected dialectal Arabic from known YouTube channels from 17 Arabic speaking countries in the Middle East and Northern Africa. After a refinement process, a total of 3,000 hours of speech was available for training DID systems, with an additional 57 hours of speech for development and testing. For detailed evaluations, the DID data was divided into three sub-categories based on the segment duration: short (less than 5s), medium (5-20s), and long (over 20s). We compare state-of-the-art DID techniques on these data, and also analyze a DID system trained on these data. Since the training and test data share the same channel domain, we also used the Multi-Genre Broadcast 3 (MGB-3) test set to evaluate on domain mismatched condition.
AB - In this paper, we describe a method to collect dialectal speech from YouTube videos to create a large-scale Dialect Identification (DID) dataset. Using this method, we collected dialectal Arabic from known YouTube channels from 17 Arabic speaking countries in the Middle East and Northern Africa. After a refinement process, a total of 3,000 hours of speech was available for training DID systems, with an additional 57 hours of speech for development and testing. For detailed evaluations, the DID data was divided into three sub-categories based on the segment duration: short (less than 5s), medium (5-20s), and long (over 20s). We compare state-of-the-art DID techniques on these data, and also analyze a DID system trained on these data. Since the training and test data share the same channel domain, we also used the Multi-Genre Broadcast 3 (MGB-3) test set to evaluate on domain mismatched condition.
KW - Arabic dialect
KW - Dataset
KW - Dialect Identification
KW - Language Identification
KW - Large-scale
UR - http://www.scopus.com/inward/record.url?scp=85089241712&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9052982
DO - 10.1109/ICASSP40776.2020.9052982
M3 - Conference contribution
AN - SCOPUS:85089241712
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 8244
EP - 8248
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 4 May 2020 through 8 May 2020
ER -