This paper presents WAGS (Word Alignment Gold Standard), a novel benchmark which allows extensive evaluation of WA tools on out-of-vocabulary (OOV) and rare words. WAGS is a subset of the Common Test section of the Europarl English-Italian parallel corpus, and is specifically tailored to OOV and rare words. WAGS is composed of 6,715 sentence pairs containing 11,958 occurrences of OOV and rare words up to frequency 15 in the Europarl Training set (5,080 English words and 6,878 Italian words), representing almost 3% of the whole text. Since WAGS is focused on OOV/rare words, manual alignments are provided for these words only, and not for the whole sentences. Two off-the-shelf word aligners have been evaluated on WAGS, and results have been compared to those obtained on an existing benchmark tailored to full text alignment. The results obtained confirm that WAGS is a valuable resource, which allows a statistically sound evaluation of WA systems’ performance on OOV and rare words, as well as extensive data analyses. WAGS is publicly released under a Creative Commons Attribution license.

WAGS: A Beautiful English-Italian Benchmark Supporting Word Alignment Evaluation on Rare Words

Bentivogli Luisa;Cettolo Mauro;Farajian Mohammad;Federico Marcello
2016

Abstract

This paper presents WAGS (Word Alignment Gold Standard), a novel benchmark which allows extensive evaluation of WA tools on out-of-vocabulary (OOV) and rare words. WAGS is a subset of the Common Test section of the Europarl English-Italian parallel corpus, and is specifically tailored to OOV and rare words. WAGS is composed of 6,715 sentence pairs containing 11,958 occurrences of OOV and rare words up to frequency 15 in the Europarl Training set (5,080 English words and 6,878 Italian words), representing almost 3% of the whole text. Since WAGS is focused on OOV/rare words, manual alignments are provided for these words only, and not for the whole sentences. Two off-the-shelf word aligners have been evaluated on WAGS, and results have been compared to those obtained on an existing benchmark tailored to full text alignment. The results obtained confirm that WAGS is a valuable resource, which allows a statistically sound evaluation of WA systems’ performance on OOV and rare words, as well as extensive data analyses. WAGS is publicly released under a Creative Commons Attribution license.
978-2-9517408-9-1
File in questo prodotto:
File Dimensione Formato  
607_Paper.pdf

accesso aperto

Descrizione: Articolo completo
Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 311.52 kB
Formato Adobe PDF
311.52 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/306281
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact