Pre-trained transformers have rapidly become very popular in the Natural Language Processing (NLP) community, surpassing the previous state of the art in a wide variety of tasks. While their effectiveness is indisputable, these methods are expensive to fine-tune on the target domain due to the high number of hyper-parameters; this aspect significantly affects the model selection phase and the reliability of the experimental assessment. This paper serves a double purpose: we first describe five popular transformer models and survey their typical use in previous literature, focusing on reproducibility; then, we perform comparisons in a controlled environment over a wide range of NLP tasks. Our analysis reveals that only a minority of recent NLP papers that use pre-trained transformers reported multiple runs (20%), standard deviation or statistical significance (10%), and other crucial information, seriously hurting replicability and reproducibility. Through a vast empirical comparison on real-world datasets and benchmarks, we also show how the hyper-parameters and the initial seed impact results, and highlight the low models’ robustness.

Pre-trained transformers: an empirical comparison

Silvia Casola;Ivano Lauriola;Alberto Lavelli
2022-01-01

Abstract

Pre-trained transformers have rapidly become very popular in the Natural Language Processing (NLP) community, surpassing the previous state of the art in a wide variety of tasks. While their effectiveness is indisputable, these methods are expensive to fine-tune on the target domain due to the high number of hyper-parameters; this aspect significantly affects the model selection phase and the reliability of the experimental assessment. This paper serves a double purpose: we first describe five popular transformer models and survey their typical use in previous literature, focusing on reproducibility; then, we perform comparisons in a controlled environment over a wide range of NLP tasks. Our analysis reveals that only a minority of recent NLP papers that use pre-trained transformers reported multiple runs (20%), standard deviation or statistical significance (10%), and other crucial information, seriously hurting replicability and reproducibility. Through a vast empirical comparison on real-world datasets and benchmarks, we also show how the hyper-parameters and the initial seed impact results, and highlight the low models’ robustness.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/332930
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact