Neural end-to-end architectures have beenrecently proposed for spoken languagetranslation (SLT), following the state-of-the-art results obtained in machine translation (MT) and speech recognition (ASR).Motivated by this contiguity, we proposean SLT adaptation of Transformer (thestate-of-the-art architecture in MT), whichexploits the integration of ASR solutionsto cope with long input sequences featuring low information density. Long audiorepresentations hinder the training of largemodels due to Transformer’s quadraticmemory complexity.Moreover, for thesake of translation quality, handling suchsequences requires capturing both short-and long-range dependencies between bi-dimensional features. Focusing on Trans-former’s encoder, our adaptation is basedon:i)downsampling the input with con-volutional neural networks, which enablesmodel training on non cutting-edge GPUs,ii)modeling the bidimensional nature ofthe audio spectrogram with 2D components, andiii)adding a distance penaltyto the attention, which is able to bias ittowards short-range dependencies.Ourexperiments show that our SLT-adaptedTransformer outperforms the RNN-basedbaseline both in translation quality andtraining time, setting the state-of-the-artperformance on six language directions.

Enhancing Transformer for End-to-end Speech-to-Text Translation

Mattia Antonino Di Gangi;Matteo Negri;Roldano Cattoni;Marco Turchi
2019

Abstract

Neural end-to-end architectures have beenrecently proposed for spoken languagetranslation (SLT), following the state-of-the-art results obtained in machine translation (MT) and speech recognition (ASR).Motivated by this contiguity, we proposean SLT adaptation of Transformer (thestate-of-the-art architecture in MT), whichexploits the integration of ASR solutionsto cope with long input sequences featuring low information density. Long audiorepresentations hinder the training of largemodels due to Transformer’s quadraticmemory complexity.Moreover, for thesake of translation quality, handling suchsequences requires capturing both short-and long-range dependencies between bi-dimensional features. Focusing on Trans-former’s encoder, our adaptation is basedon:i)downsampling the input with con-volutional neural networks, which enablesmodel training on non cutting-edge GPUs,ii)modeling the bidimensional nature ofthe audio spectrogram with 2D components, andiii)adding a distance penaltyto the attention, which is able to bias ittowards short-range dependencies.Ourexperiments show that our SLT-adaptedTransformer outperforms the RNN-basedbaseline both in translation quality andtraining time, setting the state-of-the-artperformance on six language directions.
File in questo prodotto:
File Dimensione Formato  
W19-6603.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 455.16 kB
Formato Adobe PDF
455.16 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/319648
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact