TY - JOUR
T1 - Arabic speech recognition by end-to-end, modular systems and human
AU - Hussein, Amir
AU - Watanabe, Shinji
AU - Ali, Ahmed
N1 - Publisher Copyright:
© 2021
PY - 2022/1
Y1 - 2022/1
N2 - Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM–DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM–DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.5% on average.
AB - Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM–DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM–DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.5% on average.
KW - Dialectal arabic
KW - End-to-end speech recognition
KW - Human speech recognition
KW - Modern standard arabic
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85112817071&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2021.101272
DO - 10.1016/j.csl.2021.101272
M3 - Article
AN - SCOPUS:85112817071
SN - 0885-2308
VL - 71
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101272
ER -