Crowd-based MT Evaluation for non-English Target Languages

Paul, M.; Sumita, E.; Bentivogli, Luisa; Federico, Marcello

This paper investigates the feasibility of using crowd-sourcing services for the hu- man assessment of machine translation quality of translations into non-English tar- get languages. Non-expert graders are hired through the CrowdFlower interface to Amazon’s Mechanical Turk in order to carry out a ranking-based MT evaluation of utterances taken from the travel conver- sation domain for 10 Indo-European and Asian languages. The collected human as- sessments are analyzed for their worker characteristics, evaluation costs, and qual- ity of the evaluations in terms of the agree- ment between non-expert graders and ex- pert/oracle judgments. Moreover, data quality control mechanisms including “lo- cale qualification” “qualificatio testing”, and “on-the-fl verification are investi- gated in order to increase the reliability of the crowd-based evaluation results.