The performance of Phrase-Based Statis- tical Machine Translation (PBSMT) systems mostly depends on training data. Many papers have investigated how to cre- ate new resources in order to increase the size of the training corpus in an attempt to improve PBSMT performance. In this work, we analyse and characterize the way in which the in-domain and out-of-domain performance of PBSMT is impacted when the amount of training data increases. Two different PBSMT systems, Moses and Portage, two of the largest par- allel corpora, Giga (French-English) and UN (Chinese-English) datasets and several in- and out-of-domain test sets were used to build high quality learning curves showing consistent logarithmic growth in per- formance. These results are stable across language pairs, PBSMT systems and do- mains. We also analyse the respective im- pact of additional training data for esti- mating the language and translation mod- els. Our proposed model approximates learning curves very well and indicates the translation model contributes about 30% more to the performance gain than the lan- guage model.

Learning Machine Translation from In-domain and Out-of-domain Data

Turchi, Marco;
2012-01-01

Abstract

The performance of Phrase-Based Statis- tical Machine Translation (PBSMT) systems mostly depends on training data. Many papers have investigated how to cre- ate new resources in order to increase the size of the training corpus in an attempt to improve PBSMT performance. In this work, we analyse and characterize the way in which the in-domain and out-of-domain performance of PBSMT is impacted when the amount of training data increases. Two different PBSMT systems, Moses and Portage, two of the largest par- allel corpora, Giga (French-English) and UN (Chinese-English) datasets and several in- and out-of-domain test sets were used to build high quality learning curves showing consistent logarithmic growth in per- formance. These results are stable across language pairs, PBSMT systems and do- mains. We also analyse the respective im- pact of additional training data for esti- mating the language and translation mod- els. Our proposed model approximates learning curves very well and indicates the translation model contributes about 30% more to the performance gain than the lan- guage model.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/307943
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact