Neural end-to-end architectures for sequence-to-sequence learning represent the state of the art in machine translation (MT) and speech recognition (ASR). Their use is also promising for end-to-end spoken language translation (SLT), which combines the main challenges of ASR and MT. Exploiting existing neural architectures, however, requires task-specific adaptations. A network that has obtained state-of-the-art results in MT with reduced training time is Transformer. However, its direct application to speech input is hindered by two limitations of the self-attention network on which it is based: quadratic memory complexity and no explicit modeling of short-range dependencies between input features. High memory complexity poses constraints to the size of models trainable with a GPU, while the inadequate modeling of local dependencies harms final translation quality. This paper presents an adaptation of Transformer to end-to-end SLT that consists in: i) downsampling the input with convolutional neural networks to make the training process feasible on GPUs, ii) modeling the bidimensional nature of a spectrogram, and iii) adding a distance penalty to the attention, so to bias it towards local context. SLT experiments on 8 language directions show that, with our adaptation, Transformer outperforms a strong RNN-based baseline with a significant reduction in training time.

Adapting Transformer to End-to-End Spoken Language Translation

Mattia A. Di Gangi;Matteo Negri;Marco Turchi
2019

Abstract

Neural end-to-end architectures for sequence-to-sequence learning represent the state of the art in machine translation (MT) and speech recognition (ASR). Their use is also promising for end-to-end spoken language translation (SLT), which combines the main challenges of ASR and MT. Exploiting existing neural architectures, however, requires task-specific adaptations. A network that has obtained state-of-the-art results in MT with reduced training time is Transformer. However, its direct application to speech input is hindered by two limitations of the self-attention network on which it is based: quadratic memory complexity and no explicit modeling of short-range dependencies between input features. High memory complexity poses constraints to the size of models trainable with a GPU, while the inadequate modeling of local dependencies harms final translation quality. This paper presents an adaptation of Transformer to end-to-end SLT that consists in: i) downsampling the input with convolutional neural networks to make the training process feasible on GPUs, ii) modeling the bidimensional nature of a spectrogram, and iii) adding a distance penalty to the attention, so to bias it towards local context. SLT experiments on 8 language directions show that, with our adaptation, Transformer outperforms a strong RNN-based baseline with a significant reduction in training time.
File in questo prodotto:
File Dimensione Formato  
3045.pdf

accesso aperto

Licenza: Creative commons
291.92 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/319654
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact