TY - GEN
T1 - Creating a Multilingual Dataset in Arabic and Croatian from Sports Videos Through a Data Processing Pipeline Combining ASR and MT
AU - Zaghouani, Wajdi
AU - Seljan, Sanja
AU - Dunđer, Ivan
AU - Yahiaoui, Rashid
AU - Al-Adwan, Amer
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Machine translation is getting more and more attention in the research community that deals with natural language processing, language resources and language technologies. It is considered to be one of the most important disruptive technologies with immense implications and benefits for mankind. Closely related is the field of speech technologies that enable tasks, such as automatic speech recognition and speech generation. Both machine translation and automatic speech recognition are explored in this research. The main goal of this paper is to examine the possibilities and obstacles of combining automatic speech recognition with machine translation in a web-based audio-video environment, and in a real-time setting in the sports domain that covers football matches for the purpose of creating a multilingual dataset. The research is performed for two language pairs, English-Arabic and English-Croatian. Captions from videos that contain live sports comments were automatically generated by an automatic speech recognition approach, then machine-translated by a popular online machine translation service, and afterwards edited in three distinct processing phases that considered different aspects of human involvement. Quality evaluations are performed by native speakers with regard to the criterion of usability, and by applying BLEU, the most prominent automatic machine translation quality metric today.
AB - Machine translation is getting more and more attention in the research community that deals with natural language processing, language resources and language technologies. It is considered to be one of the most important disruptive technologies with immense implications and benefits for mankind. Closely related is the field of speech technologies that enable tasks, such as automatic speech recognition and speech generation. Both machine translation and automatic speech recognition are explored in this research. The main goal of this paper is to examine the possibilities and obstacles of combining automatic speech recognition with machine translation in a web-based audio-video environment, and in a real-time setting in the sports domain that covers football matches for the purpose of creating a multilingual dataset. The research is performed for two language pairs, English-Arabic and English-Croatian. Captions from videos that contain live sports comments were automatically generated by an automatic speech recognition approach, then machine-translated by a popular online machine translation service, and afterwards edited in three distinct processing phases that considered different aspects of human involvement. Quality evaluations are performed by native speakers with regard to the criterion of usability, and by applying BLEU, the most prominent automatic machine translation quality metric today.
KW - Automatic Speech Recognition
KW - BLEU
KW - Data Processing Pipeline
KW - Human Evaluation
KW - Machine Translation
KW - Multilingual Dataset Creation
KW - Post-Editing
KW - Quality Evaluation
KW - Terminology
UR - http://www.scopus.com/inward/record.url?scp=85219200550&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-80438-0_14
DO - 10.1007/978-3-031-80438-0_14
M3 - Conference contribution
AN - SCOPUS:85219200550
SN - 9783031804373
T3 - Communications in Computer and Information Science
SP - 182
EP - 195
BT - Arabic Language Processing
A2 - Hdioud, Boutaina
A2 - Aouragh, Si Lhoussain
PB - Springer Science and Business Media Deutschland GmbH
T2 - 8th International Conference on Arabic Language Processing, ICALP 2023
Y2 - 19 April 2024 through 20 April 2024
ER -