Neural end-to-end architectures have beenrecently proposed for spoken languagetranslation (SLT), following the state-of-the-art results obtained in machine translation (MT) and speech recognition (ASR).Motivated by this contiguity, we proposean SLT adaptation of Transformer (thestate-of-the-art architecture in MT), whichexploits the integration of ASR solutionsto cope with long input sequences featuring low information density. Long audiorepresentations hinder the training of largemodels due to Transformer’s quadraticmemory complexity.Moreover, for thesake of translation quality, handling suchsequences requires capturing both short-and long-range dependencies between bi-dimensional features. Focusing on Trans-former’s encoder, our adaptation is basedon:i)downsampling the input with con-volutional neural networks, which enablesmodel training on non cutting-edge GPUs,ii)modeling the bidimensional nature ofthe audio spectrogram with 2D components, andiii)adding a distance penaltyto the attention, which is able to bias ittowards short-range dependencies.Ourexperiments show that our SLT-adaptedTransformer outperforms the RNN-basedbaseline both in translation quality andtraining time, setting the state-of-the-artperformance on six language directions.
|Titolo:||Enhancing Transformer for End-to-end Speech-to-Text Translation|
|Data di pubblicazione:||2019|
|Appare nelle tipologie:||4.1 Contributo in Atti di convegno|