Getting Expert Quality from the Crowd for Machine Translation Evaluation

Bentivogli, Luisa; Federico, Marcello; Moretti, Giovanni; Paul, M.

This paper addresses the manual evaluation of Machine Translation (MT) quality by means of crowdsourcing. To this purpose, we replicated the ranking evaluation of the Arabic- English BTEC task proposed at the IWSLT 2010Workshop by hiring non-experts through the CrowdFlower interface to Amazon’s Mechanical Turk. In particular, we investigated the effectiveness of “gold units” offered by CrowdFlower as the main quality control mechanism. The analysis of the collected data shows that agreement rates for non-experts are comparable to those obtained for experts, and that the crowd-based system ranking has a very strong correlation with expert-based ranking. Our results confirm that crowdsourcing is an effective way to reduce the costs of MT evaluation without sacrificing quality, and demonstrate that just exploiting the Crowd- Flower control mechanism is enough to approximate expert-level data quality.