Combining internal representations of a pre-trained Transformer model, such as the popular BERT, is an interesting and challenging task nowadays. Usually, internal representations are combined by simple heuristics, e.g. concatenation or average of a subset of layers, with a consequent need for calibrating multiple hyper-parameters during the fine-tuning phase. Inspired by the recent literature, we propose a principled approach to optimally combine internal representations of a Transformer model via Multiple Kernel Learning strategies. Broadly speaking, the proposed system consists of two elements. The former is a canonical Transformer model fine-tuned on the target task. The latter is a Multiple Kernel Learning algorithm that extracts and combines representations developed in the internal layers of the Transformer and performs predictions. Most important, we use the system as a powerful tool to inspect the information encoded into the Transformer network, emphasizing the limits of state-of-the-art models.

Exploring the structure of BERT through Kernel Learning

Lauriola Ivano;Lavelli Alberto;
2021-01-01

Abstract

Combining internal representations of a pre-trained Transformer model, such as the popular BERT, is an interesting and challenging task nowadays. Usually, internal representations are combined by simple heuristics, e.g. concatenation or average of a subset of layers, with a consequent need for calibrating multiple hyper-parameters during the fine-tuning phase. Inspired by the recent literature, we propose a principled approach to optimally combine internal representations of a Transformer model via Multiple Kernel Learning strategies. Broadly speaking, the proposed system consists of two elements. The former is a canonical Transformer model fine-tuned on the target task. The latter is a Multiple Kernel Learning algorithm that extracts and combines representations developed in the internal layers of the Transformer and performs predictions. Most important, we use the system as a powerful tool to inspect the information encoded into the Transformer network, emphasizing the limits of state-of-the-art models.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/330947
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact