QMorphVec: A Morphologically-Aware Embedding of Quranic Vocabulary

Doratossadat Dastgheib, Alireza Sahebi, Ehsan Khadangi, Ehsaneddin Asgari

Research output: Contribution to conferencePaperpeer-review

Abstract

Developing effective word representations that incorporate linguistic features and capture contextual information is an essential step in natural language processing (NLP) tasks. When working with a text corpus from a specific domain with profound meanings, such as the Holy Quran, deriving word representations based on domain-specific textual contexts is particularly valuable. In this research, we employ a context-masking approach to generate separate embedding spaces for Quranic roots, lemmas, and surface forms, and then project them into a common space through linear mapping. We demonstrate that our in-domain embeddings, trained solely on Quranic text and it morphological contexts, perform comparably to—and, in some cases, better than—OpenAI’s large embeddings while surpassing the multilingual XLM-R embeddings. Additionally, through qualitative analysis, we illustrate their utility in Quranic word analogy tasks. The code and the embeddings are available at: https://github.com/language-ml/QMorphVec.
Original languageEnglish
Publication statusPublished - 1 Dec 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: 10 Dec 202415 Dec 2024

Conference

Conference38th Conference on Neural Information Processing Systems, NeurIPS 2024
Country/TerritoryCanada
CityVancouver
Period10/12/2415/12/24

Fingerprint

Dive into the research topics of 'QMorphVec: A Morphologically-Aware Embedding of Quranic Vocabulary'. Together they form a unique fingerprint.

Cite this