TY - GEN
T1 - What does an end-to-end dialect identification model learn about non-dialectal information?
AU - Chowdhury, Shammur A.
AU - Ali, Ahmed
AU - Shon, Suwon
AU - Glass, James
N1 - Publisher Copyright:
© 2020 ISCA
PY - 2020
Y1 - 2020
N2 - An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model's ability to represent speech input for differentiating non-dialectal information - such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality - and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.
AB - An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model's ability to represent speech input for differentiating non-dialectal information - such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality - and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.
KW - Dialect identification
KW - End-to-end model
KW - Interpretability
KW - Language identification
KW - Speaker information
UR - http://www.scopus.com/inward/record.url?scp=85098138928&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2235
DO - 10.21437/Interspeech.2020-2235
M3 - Conference contribution
AN - SCOPUS:85098138928
SN - 9781713820697
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 462
EP - 466
BT - Interspeech 2020
PB - International Speech Communication Association
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -