TY - GEN
T1 - Findings of the first shared task on machine translation robustness
AU - Li, Xian
AU - Michel, Paul
AU - Anastasopoulos, Antonios
AU - Belinkov, Yonatan
AU - Durrani, Nadir
AU - Firat, Orhan
AU - Koehn, Philipp
AU - Neubig, Graham
AU - Pino, Juan
AU - Sajjad, Hassan
N1 - Publisher Copyright:
© 2019 Association for Computational Linguistics
PY - 2019
Y1 - 2019
N2 - We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models' robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evaluated on a blind test set consisting of noisy comments on Reddit and professionally sourced translations. As a new task, we received 23 submissions by 11 participating teams from universities, companies, national labs, etc. All submitted systems achieved large improvements over baselines, with the best improvement having +22.33 BLEU. We evaluated submissions by both human judgment and automatic evaluation (BLEU), which shows high correlations (Pearson's r = 0.94 and 0.95). Furthermore, we conducted a qualitative analysis of the submitted systems using compare-mt, which revealed their salient differences in handling challenges in this task. Such analysis provides additional insights when there is occasional disagreement between human judgment and BLEU, e.g. systems better at producing colloquial expressions received higher score from human judgment.
AB - We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models' robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evaluated on a blind test set consisting of noisy comments on Reddit and professionally sourced translations. As a new task, we received 23 submissions by 11 participating teams from universities, companies, national labs, etc. All submitted systems achieved large improvements over baselines, with the best improvement having +22.33 BLEU. We evaluated submissions by both human judgment and automatic evaluation (BLEU), which shows high correlations (Pearson's r = 0.94 and 0.95). Furthermore, we conducted a qualitative analysis of the submitted systems using compare-mt, which revealed their salient differences in handling challenges in this task. Such analysis provides additional insights when there is occasional disagreement between human judgment and BLEU, e.g. systems better at producing colloquial expressions received higher score from human judgment.
UR - http://www.scopus.com/inward/record.url?scp=85120937843&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85120937843
T3 - WMT 2019 - 4th Conference on Machine Translation, Proceedings of the Conference
SP - 91
EP - 102
BT - Shared Task Papers, Day 1
PB - Association for Computational Linguistics (ACL)
T2 - 4th Conference on Machine Translation, WMT 2019 held at the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
Y2 - 1 August 2019 through 2 August 2019
ER -