Neural end-to-end architectures for sequence-to-sequence learning represent the state of the art in machine translation (MT) and speech recognition (ASR). Their use is also promising for end-to-end spoken language translation (SLT), which combines the main challenges of ASR and MT. Exploiting existing neural architectures, however, requires task-specific adaptations. A network that has obtained state-of-the-art results in MT with reduced training time is Transformer. However, its direct application to speech input is hindered by two limitations of the self-attention network on which it is based: quadratic memory complexity and no explicit modeling of short-range dependencies between input features. High memory complexity poses constraints to the size of models trainable with a GPU, while the inadequate modeling of local dependencies harms final translation quality. This paper presents an adaptation of Transformer to end-to-end SLT that consists in: i) downsampling the input with convolutional neural networks to make the training process feasible on GPUs, ii) modeling the bidimensional nature of a spectrogram, and iii) adding a distance penalty to the attention, so to bias it towards local context. SLT experiments on 8 language directions show that, with our adaptation, Transformer outperforms a strong RNN-based baseline with a significant reduction in training time.
Adapting Transformer to End-to-End Spoken Language Translation
Mattia A. Di Gangi;Matteo Negri;Marco Turchi
2019-01-01
Abstract
Neural end-to-end architectures for sequence-to-sequence learning represent the state of the art in machine translation (MT) and speech recognition (ASR). Their use is also promising for end-to-end spoken language translation (SLT), which combines the main challenges of ASR and MT. Exploiting existing neural architectures, however, requires task-specific adaptations. A network that has obtained state-of-the-art results in MT with reduced training time is Transformer. However, its direct application to speech input is hindered by two limitations of the self-attention network on which it is based: quadratic memory complexity and no explicit modeling of short-range dependencies between input features. High memory complexity poses constraints to the size of models trainable with a GPU, while the inadequate modeling of local dependencies harms final translation quality. This paper presents an adaptation of Transformer to end-to-end SLT that consists in: i) downsampling the input with convolutional neural networks to make the training process feasible on GPUs, ii) modeling the bidimensional nature of a spectrogram, and iii) adding a distance penalty to the attention, so to bias it towards local context. SLT experiments on 8 language directions show that, with our adaptation, Transformer outperforms a strong RNN-based baseline with a significant reduction in training time.File | Dimensione | Formato | |
---|---|---|---|
3045.pdf
accesso aperto
Tipologia:
Documento in Post-print
Licenza:
Creative commons
Dimensione
291.92 kB
Formato
Adobe PDF
|
291.92 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.