TY - GEN
T1 - WERD
T2 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
AU - Ali, Ahmed
AU - Nakov, Preslav
AU - Bell, Peter
AU - Renals, Steve
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - We study the problem of evaluating automatic speech recognition (ASR) systems that target dialectal speech input. A major challenge in this case is that the orthography of dialects is typically not standardized. From an ASR evaluation perspective, this means that there is no clear gold standard for the expected output, and several possible outputs could be considered correct according to different human annotators, which makes standard word error rate (WER) inadequate as an evaluation metric. Such a situation is typical for machine translation (MT), and thus we borrow ideas from an MT evaluation metric, namely TERp, an extension of translation error rate which is closely-related to WER. In particular, in the process of comparing a hypothesis to a reference, we make use of spelling variants for words and phrases, which we mine from Twitter in an unsupervised fashion. Our experiments with evaluating ASR output for Egyptian Arabic, and further manual analysis, show that the resulting WERd (i.e., WER for dialects) metric, a variant of TERp, is more adequate than WER for evaluating dialectal ASR.
AB - We study the problem of evaluating automatic speech recognition (ASR) systems that target dialectal speech input. A major challenge in this case is that the orthography of dialects is typically not standardized. From an ASR evaluation perspective, this means that there is no clear gold standard for the expected output, and several possible outputs could be considered correct according to different human annotators, which makes standard word error rate (WER) inadequate as an evaluation metric. Such a situation is typical for machine translation (MT), and thus we borrow ideas from an MT evaluation metric, namely TERp, an extension of translation error rate which is closely-related to WER. In particular, in the process of comparing a hypothesis to a reference, we make use of spelling variants for words and phrases, which we mine from Twitter in an unsupervised fashion. Our experiments with evaluating ASR output for Egyptian Arabic, and further manual analysis, show that the resulting WERd (i.e., WER for dialects) metric, a variant of TERp, is more adequate than WER for evaluating dialectal ASR.
KW - ASR evaluation
KW - Automatic speech recognition
KW - dialectal ASR
KW - multi-reference WER
KW - word error rate
UR - http://www.scopus.com/inward/record.url?scp=85050551927&partnerID=8YFLogxK
U2 - 10.1109/ASRU.2017.8268928
DO - 10.1109/ASRU.2017.8268928
M3 - Conference contribution
AN - SCOPUS:85050551927
T3 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
SP - 141
EP - 148
BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 December 2017 through 20 December 2017
ER -