Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Training LLMs in low resources languages usually utilizes machine translation (MT) data augmentation from English language. However, translation brings a number of challenges: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, the quality of the data degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories generated by a capable LLM in Arabic, representing 1% of the original training data. We show, using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the translation pitfalls. We illustrate the improvement through case studies of linguistic and cultural bias issues.

Original languageEnglish
Title of host publicationArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference
EditorsNizar Habash, Houda Bouamor, Ramy Eskander, Nadi Tomeh, Ibrahim Abu Farha, Ahmed Abdelali, Samia Touileb, Injy Hamed, Yaser Onaizan, Bashar Alhafni, Wissam Antoun, Salam Khalifa, Hatem Haddad, Imed Zitouni, Badr AlKhamissi, Rawan Almatham, Khalil Mrini
PublisherAssociation for Computational Linguistics (ACL)
Pages73-88
Number of pages16
ISBN (Electronic)9798891761322
Publication statusPublished - 2024
Event2nd Arabic Natural Language Processing Conference, ArabicNLP 2024 - Bangkok, Thailand
Duration: 16 Aug 2024 → …

Publication series

NameArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference

Conference

Conference2nd Arabic Natural Language Processing Conference, ArabicNLP 2024
Country/TerritoryThailand
CityBangkok
Period16/08/24 → …

Fingerprint

Dive into the research topics of 'Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis'. Together they form a unique fingerprint.

Cite this