Neural machine translation (NMT) models are conventionally trained with fixed-size vocabu- laries to control the computational complexity and the quality of the learned word represen- tations. This, however, limits the accuracy and the generalization capability of the models, especially for morphologically-rich languages, which usually have very sparse vocabularies containing rare inflected or derivated word forms. Some studies tried to overcome this prob- lem by segmenting words into subword level representations and modeling translation at this level. However, recent findings have shown that if these methods interrupt the word struc- ture during segmentation, they might cause semantic or syntactic losses and lead to generat- ing inaccurate translations. In order to investigate this phenomenon, we present an extensive evaluation of two unsupervised vocabulary reduction methods in NMT. The first is the well- known byte-pair-encoding (BPE), a statistical subword segmentation method, whereas the sec- ond is linguistically-motivated vocabulary reduction (LMVR), a segmentation method which also considers morphological properties of subwords. We compare both approaches on ten translation directions involving English and five other languages (Arabic, Czech, German, Ital- ian and Turkish), each representing a distinct language family and morphological typology. LMVR obtains significantly better performance in most languages, showing gains proportional to the sparseness of the vocabulary and the morphological complexity of the tested language.
An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation
Duygu Ataman
;Marcello Federico
2018-01-01
Abstract
Neural machine translation (NMT) models are conventionally trained with fixed-size vocabu- laries to control the computational complexity and the quality of the learned word represen- tations. This, however, limits the accuracy and the generalization capability of the models, especially for morphologically-rich languages, which usually have very sparse vocabularies containing rare inflected or derivated word forms. Some studies tried to overcome this prob- lem by segmenting words into subword level representations and modeling translation at this level. However, recent findings have shown that if these methods interrupt the word struc- ture during segmentation, they might cause semantic or syntactic losses and lead to generat- ing inaccurate translations. In order to investigate this phenomenon, we present an extensive evaluation of two unsupervised vocabulary reduction methods in NMT. The first is the well- known byte-pair-encoding (BPE), a statistical subword segmentation method, whereas the sec- ond is linguistically-motivated vocabulary reduction (LMVR), a segmentation method which also considers morphological properties of subwords. We compare both approaches on ten translation directions involving English and five other languages (Arabic, Czech, German, Ital- ian and Turkish), each representing a distinct language family and morphological typology. LMVR obtains significantly better performance in most languages, showing gains proportional to the sparseness of the vocabulary and the morphological complexity of the tested language.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.