TY - GEN
T1 - Speech Representation Analysis Based on Inter- and Intra-Model Similarities
AU - El Kheir, Yassine
AU - Ali, Ahmed
AU - Chowdhury, Shammur Absar
N1 - Publisher Copyright:
©2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intramodel similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm - Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts. Our results highlight that these models converge to similar representation subspaces but not to similar neuronlocalized concepts(1). We made the code publicly available for facilitating further research, we publicly released our code(2).
AB - Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intramodel similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm - Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts. Our results highlight that these models converge to similar representation subspaces but not to similar neuronlocalized concepts(1). We made the code publicly available for facilitating further research, we publicly released our code(2).
KW - Inter- and Intra- Similarities
KW - Self-Supervised Learning
KW - Speech Models
UR - http://www.scopus.com/inward/record.url?scp=85202281466&partnerID=8YFLogxK
U2 - 10.1109/ICASSPW62465.2024.10669908
DO - 10.1109/ICASSPW62465.2024.10669908
M3 - Conference contribution
AN - SCOPUS:85202281466
SN - 979-8-3503-7452-0
T3 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
SP - 848
EP - 852
BT - 2024 Ieee International Conference On Acoustics, Speech, And Signal Processing Workshops, Icasspw 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 49th IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Y2 - 14 April 2024 through 19 April 2024
ER -