Creating a Multilingual Dataset in Arabic and Croatian from Sports Videos Through a Data Processing Pipeline Combining ASR and MT

Wajdi Zaghouani*, Sanja Seljan, Ivan Dunđer, Rashid Yahiaoui, Amer Al-Adwan

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Machine translation is getting more and more attention in the research community that deals with natural language processing, language resources and language technologies. It is considered to be one of the most important disruptive technologies with immense implications and benefits for mankind. Closely related is the field of speech technologies that enable tasks, such as automatic speech recognition and speech generation. Both machine translation and automatic speech recognition are explored in this research. The main goal of this paper is to examine the possibilities and obstacles of combining automatic speech recognition with machine translation in a web-based audio-video environment, and in a real-time setting in the sports domain that covers football matches for the purpose of creating a multilingual dataset. The research is performed for two language pairs, English-Arabic and English-Croatian. Captions from videos that contain live sports comments were automatically generated by an automatic speech recognition approach, then machine-translated by a popular online machine translation service, and afterwards edited in three distinct processing phases that considered different aspects of human involvement. Quality evaluations are performed by native speakers with regard to the criterion of usability, and by applying BLEU, the most prominent automatic machine translation quality metric today.

Original languageEnglish
Title of host publicationArabic Language Processing
Subtitle of host publicationFrom Theory to Practice - 8th International Conference, ICALP 2023, Proceedings
EditorsBoutaina Hdioud, Si Lhoussain Aouragh
PublisherSpringer Science and Business Media Deutschland GmbH
Pages182-195
Number of pages14
ISBN (Print)9783031804373
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event8th International Conference on Arabic Language Processing, ICALP 2023 - Rabat, Morocco
Duration: 19 Apr 202420 Apr 2024

Publication series

NameCommunications in Computer and Information Science
Volume2340 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference8th International Conference on Arabic Language Processing, ICALP 2023
Country/TerritoryMorocco
CityRabat
Period19/04/2420/04/24

Keywords

  • Automatic Speech Recognition
  • BLEU
  • Data Processing Pipeline
  • Human Evaluation
  • Machine Translation
  • Multilingual Dataset Creation
  • Post-Editing
  • Quality Evaluation
  • Terminology

Fingerprint

Dive into the research topics of 'Creating a Multilingual Dataset in Arabic and Croatian from Sports Videos Through a Data Processing Pipeline Combining ASR and MT'. Together they form a unique fingerprint.

Cite this