Abstract
Developing effective word representations that incorporate linguistic features and capture contextual information is an essential step in natural language processing (NLP) tasks. When working with a text corpus from a specific domain with profound meanings, such as the Holy Quran, deriving word representations based on domain-specific textual contexts is particularly valuable. In this research, we employ a context-masking approach to generate separate embedding spaces for Quranic roots, lemmas, and surface forms, and then project them into a common space through linear mapping. We demonstrate that our in-domain embeddings, trained solely on Quranic text and it morphological contexts, perform comparably to—and, in some cases, better than—OpenAI’s large embeddings while surpassing the multilingual XLM-R embeddings. Additionally, through qualitative analysis, we illustrate their utility in Quranic word analogy tasks. The code and the embeddings are available at: https://github.com/language-ml/QMorphVec.
Original language | English |
---|---|
Publication status | Published - 1 Dec 2024 |
Event | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada Duration: 10 Dec 2024 → 15 Dec 2024 |
Conference
Conference | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 |
---|---|
Country/Territory | Canada |
City | Vancouver |
Period | 10/12/24 → 15/12/24 |