This paper compares techniques to combine diverse parallel corpora for domain-specific phrase-based SMT system train- ing. We address a common scenario where little in-domain data is available for the task, but where large background models exist for the same language pair. In particular, we fo- cus on phrase table fill-up: a method that effectively exploits background knowledge to improve model coverage, while preserving the more reliable information coming from the in-domain corpus. We present experiments on an emerging transcribed speech translation task – the TED talks. While performing similarly in terms of BLEU and NIST scores to the popular log-linear and linear interpolation techniques, filled-up translation models are more compact and easy to tune by minimum error training.

Fill-up versus interpolation methods for phrase-based SMT adaptation

Bisazza, Arianna;Federico, Marcello
2011-01-01

Abstract

This paper compares techniques to combine diverse parallel corpora for domain-specific phrase-based SMT system train- ing. We address a common scenario where little in-domain data is available for the task, but where large background models exist for the same language pair. In particular, we fo- cus on phrase table fill-up: a method that effectively exploits background knowledge to improve model coverage, while preserving the more reliable information coming from the in-domain corpus. We present experiments on an emerging transcribed speech translation task – the TED talks. While performing similarly in terms of BLEU and NIST scores to the popular log-linear and linear interpolation techniques, filled-up translation models are more compact and easy to tune by minimum error training.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/70200
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact