IRIS Institutional Research Information System

Neural end-to-end architectures have beenrecently proposed for spoken languagetranslation (SLT), following the state-of-the-art results obtained in machine translation (MT) and speech recognition (ASR).Motivated by this contiguity, we proposean SLT adaptation of Transformer (thestate-of-the-art architecture in MT), whichexploits the integration of ASR solutionsto cope with long input sequences featuring low information density. Long audiorepresentations hinder the training of largemodels due to Transformer’s quadraticmemory complexity.Moreover, for thesake of translation quality, handling suchsequences requires capturing both short-and long-range dependencies between bi-dimensional features. Focusing on Trans-former’s encoder, our adaptation is basedon:i)downsampling the input with con-volutional neural networks, which enablesmodel training on non cutting-edge GPUs,ii)modeling the bidimensional nature ofthe audio spectrogram with 2D components, andiii)adding a distance penaltyto the attention, which is able to bias ittowards short-range dependencies.Ourexperiments show that our SLT-adaptedTransformer outperforms the RNN-basedbaseline both in translation quality andtraining time, setting the state-of-the-artperformance on six language directions.

Enhancing Transformer for End-to-end Speech-to-Text Translation

Mattia Antonino Di Gangi;Matteo Negri;Roldano Cattoni;Roberto Dessi;Marco Turchi

2019-01-01

Abstract

Neural end-to-end architectures have beenrecently proposed for spoken languagetranslation (SLT), following the state-of-the-art results obtained in machine translation (MT) and speech recognition (ASR).Motivated by this contiguity, we proposean SLT adaptation of Transformer (thestate-of-the-art architecture in MT), whichexploits the integration of ASR solutionsto cope with long input sequences featuring low information density. Long audiorepresentations hinder the training of largemodels due to Transformer’s quadraticmemory complexity.Moreover, for thesake of translation quality, handling suchsequences requires capturing both short-and long-range dependencies between bi-dimensional features. Focusing on Trans-former’s encoder, our adaptation is basedon:i)downsampling the input with con-volutional neural networks, which enablesmodel training on non cutting-edge GPUs,ii)modeling the bidimensional nature ofthe audio spectrogram with 2D components, andiii)adding a distance penaltyto the attention, which is able to bias ittowards short-range dependencies.Ourexperiments show that our SLT-adaptedTransformer outperforms the RNN-basedbaseline both in translation quality andtraining time, setting the state-of-the-artperformance on six language directions.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2019

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
W19-6603.pdf accesso aperto Tipologia: Documento in Post-print Licenza: Creative commons Dimensione 455.16 kB Formato Adobe PDF Visualizza/Apri	455.16 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/319648

Citazioni

ND

social impact