Combining internal representations of a pre-trained Transformer model, such as the popular BERT, is an interesting and challenging task nowadays. Usually, internal representations are combined by simple heuristics, e.g. concatenation or average of a subset of layers, with a consequent need for calibrating multiple hyper-parameters during the fine-tuning phase. Inspired by the recent literature, we propose a principled approach to optimally combine internal representations of a Transformer model via Multiple Kernel Learning strategies. Broadly speaking, the proposed system consists of two elements. The former is a canonical Transformer model fine-tuned on the target task. The latter is a Multiple Kernel Learning algorithm that extracts and combines representations developed in the internal layers of the Transformer and performs predictions. Most important, we use the system as a powerful tool to inspect the information encoded into the Transformer network, emphasizing the limits of state-of-the-art models.
Exploring the structure of BERT through Kernel Learning
Lauriola Ivano;Lavelli Alberto;
2021-01-01
Abstract
Combining internal representations of a pre-trained Transformer model, such as the popular BERT, is an interesting and challenging task nowadays. Usually, internal representations are combined by simple heuristics, e.g. concatenation or average of a subset of layers, with a consequent need for calibrating multiple hyper-parameters during the fine-tuning phase. Inspired by the recent literature, we propose a principled approach to optimally combine internal representations of a Transformer model via Multiple Kernel Learning strategies. Broadly speaking, the proposed system consists of two elements. The former is a canonical Transformer model fine-tuned on the target task. The latter is a Multiple Kernel Learning algorithm that extracts and combines representations developed in the internal layers of the Transformer and performs predictions. Most important, we use the system as a powerful tool to inspect the information encoded into the Transformer network, emphasizing the limits of state-of-the-art models.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.