IRIS Institutional Research Information System

Neural end-to-end architectures for sequence-to-sequence learning represent the state of the art in machine translation (MT) and speech recognition (ASR). Their use is also promising for end-to-end spoken language translation (SLT), which combines the main challenges of ASR and MT. Exploiting existing neural architectures, however, requires task-specific adaptations. A network that has obtained state-of-the-art results in MT with reduced training time is Transformer. However, its direct application to speech input is hindered by two limitations of the self-attention network on which it is based: quadratic memory complexity and no explicit modeling of short-range dependencies between input features. High memory complexity poses constraints to the size of models trainable with a GPU, while the inadequate modeling of local dependencies harms final translation quality. This paper presents an adaptation of Transformer to end-to-end SLT that consists in: i) downsampling the input with convolutional neural networks to make the training process feasible on GPUs, ii) modeling the bidimensional nature of a spectrogram, and iii) adding a distance penalty to the attention, so to bias it towards local context. SLT experiments on 8 language directions show that, with our adaptation, Transformer outperforms a strong RNN-based baseline with a significant reduction in training time.

Adapting Transformer to End-to-End Spoken Language Translation

Mattia A. Di Gangi;Matteo Negri;Marco Turchi

2019-01-01

Abstract

Neural end-to-end architectures for sequence-to-sequence learning represent the state of the art in machine translation (MT) and speech recognition (ASR). Their use is also promising for end-to-end spoken language translation (SLT), which combines the main challenges of ASR and MT. Exploiting existing neural architectures, however, requires task-specific adaptations. A network that has obtained state-of-the-art results in MT with reduced training time is Transformer. However, its direct application to speech input is hindered by two limitations of the self-attention network on which it is based: quadratic memory complexity and no explicit modeling of short-range dependencies between input features. High memory complexity poses constraints to the size of models trainable with a GPU, while the inadequate modeling of local dependencies harms final translation quality. This paper presents an adaptation of Transformer to end-to-end SLT that consists in: i) downsampling the input with convolutional neural networks to make the training process feasible on GPUs, ii) modeling the bidimensional nature of a spectrogram, and iii) adding a distance penalty to the attention, so to bias it towards local context. SLT experiments on 8 language directions show that, with our adaptation, Transformer outperforms a strong RNN-based baseline with a significant reduction in training time.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2019

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
3045.pdf accesso aperto Tipologia: Documento in Post-print Licenza: Creative commons Dimensione 291.92 kB Formato Adobe PDF Visualizza/Apri	291.92 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/319654

Citazioni

ND

social impact