Pre-trained transformers have rapidly become very popular in the Natural Language Processing (NLP) community, surpassing the previous state of the art in a wide variety of tasks. While their effectiveness is indisputable, these methods are expensive to fine-tune on the target domain due to the high number of hyper-parameters; this aspect significantly affects the model selection phase and the reliability of the experimental assessment. This paper serves a double purpose: we first describe five popular transformer models and survey their typical use in previous literature, focusing on reproducibility; then, we perform comparisons in a controlled environment over a wide range of NLP tasks. Our analysis reveals that only a minority of recent NLP papers that use pre-trained transformers reported multiple runs (20%), standard deviation or statistical significance (10%), and other crucial information, seriously hurting replicability and reproducibility. Through a vast empirical comparison on real-world datasets and benchmarks, we also show how the hyper-parameters and the initial seed impact results, and highlight the low models’ robustness.
Pre-trained transformers: an empirical comparison
Silvia Casola;Ivano Lauriola;Alberto Lavelli
2022-01-01
Abstract
Pre-trained transformers have rapidly become very popular in the Natural Language Processing (NLP) community, surpassing the previous state of the art in a wide variety of tasks. While their effectiveness is indisputable, these methods are expensive to fine-tune on the target domain due to the high number of hyper-parameters; this aspect significantly affects the model selection phase and the reliability of the experimental assessment. This paper serves a double purpose: we first describe five popular transformer models and survey their typical use in previous literature, focusing on reproducibility; then, we perform comparisons in a controlled environment over a wide range of NLP tasks. Our analysis reveals that only a minority of recent NLP papers that use pre-trained transformers reported multiple runs (20%), standard deviation or statistical significance (10%), and other crucial information, seriously hurting replicability and reproducibility. Through a vast empirical comparison on real-world datasets and benchmarks, we also show how the hyper-parameters and the initial seed impact results, and highlight the low models’ robustness.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.