An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation

Ataman, Duygu; Federico, Marcello

Neural machine translation (NMT) models are conventionally trained with fixed-size vocabu- laries to control the computational complexity and the quality of the learned word represen- tations. This, however, limits the accuracy and the generalization capability of the models, especially for morphologically-rich languages, which usually have very sparse vocabularies containing rare inflected or derivated word forms. Some studies tried to overcome this prob- lem by segmenting words into subword level representations and modeling translation at this level. However, recent findings have shown that if these methods interrupt the word struc- ture during segmentation, they might cause semantic or syntactic losses and lead to generat- ing inaccurate translations. In order to investigate this phenomenon, we present an extensive evaluation of two unsupervised vocabulary reduction methods in NMT. The first is the well- known byte-pair-encoding (BPE), a statistical subword segmentation method, whereas the sec- ond is linguistically-motivated vocabulary reduction (LMVR), a segmentation method which also considers morphological properties of subwords. We compare both approaches on ten translation directions involving English and five other languages (Arabic, Czech, German, Ital- ian and Turkish), each representing a distinct language family and morphological typology. LMVR obtains significantly better performance in most languages, showing gains proportional to the sparseness of the vocabulary and the morphological complexity of the tested language.

Nome	Dominio	Durata	Descrizione
s_.*	plu.mx	sessione	recupero grafico citazioni sociali da plumx
A_.*	core.ac.uk	7 giorni	recupero pubblicazioni consigliate per il pannello core-recommander
GS_.*	gstatic.com	richiesta http	visualizza grafico citazioni
CC_.*	creativecommons.org	richiesta http	visualizza licenza bitstream

An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation

Duygu Ataman;Marcello Federico

2018-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

IRIS Institutional Research Information System

An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation

Duygu Ataman;Marcello Federico

2018-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)