CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

Mohammad Mahdi Abootorabi, Ehsaneddin Asgari*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP’s audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval methods that rely on transcribing speech into text for subsequent text retrieval, especially in specific scenarios.

Original languageEnglish
Title of host publicationAdvances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Proceedings
EditorsClaudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, Nicola Tonellotto
PublisherSpringer Science and Business Media Deutschland GmbH
Pages10-20
Number of pages11
ISBN (Print)9783031887161
DOIs
Publication statusPublished - 3 Apr 2025
Event47th European Conference on Information Retrieval, ECIR 2025 - Lucca, Italy
Duration: 6 Apr 202510 Apr 2025

Publication series

NameLecture Notes in Computer Science
Volume15575 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference47th European Conference on Information Retrieval, ECIR 2025
Country/TerritoryItaly
CityLucca
Period6/04/2510/04/25

Keywords

  • Contrastive Learning
  • Multimodal IR
  • Speech Retrieval

Fingerprint

Dive into the research topics of 'CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval'. Together they form a unique fingerprint.

Cite this