TY - JOUR
T1 - On the effect of dropping layers of pre-trained transformer models
AU - Sajjad, Hassan
AU - Dalvi, Fahim
AU - Durrani, Nadir
AU - Nakov, Preslav
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/1
Y1 - 2022/1
N2 - Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as: (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using different objective function exhibit different learning patterns and w.r.t the layer dropping.
AB - Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as: (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using different objective function exhibit different learning patterns and w.r.t the layer dropping.
KW - Efficient transfer learning
KW - Interpretation and analysis
KW - Pre-trained transformer models
UR - http://www.scopus.com/inward/record.url?scp=85135933730&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2022.101429
DO - 10.1016/j.csl.2022.101429
M3 - Article
AN - SCOPUS:85135933730
SN - 0885-2308
VL - 77
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101429
ER -