This paper investigates the feasibility of using crowd-sourcing services for the hu- man assessment of machine translation quality of translations into non-English tar- get languages. Non-expert graders are hired through the CrowdFlower interface to Amazon’s Mechanical Turk in order to carry out a ranking-based MT evaluation of utterances taken from the travel conver- sation domain for 10 Indo-European and Asian languages. The collected human as- sessments are analyzed for their worker characteristics, evaluation costs, and qual- ity of the evaluations in terms of the agree- ment between non-expert graders and ex- pert/oracle judgments. Moreover, data quality control mechanisms including “lo- cale qualification” “qualificatio testing”, and “on-the-fl verification are investi- gated in order to increase the reliability of the crowd-based evaluation results.
Crowd-based MT Evaluation for non-English Target Languages
Bentivogli, Luisa;Federico, Marcello
2012-01-01
Abstract
This paper investigates the feasibility of using crowd-sourcing services for the hu- man assessment of machine translation quality of translations into non-English tar- get languages. Non-expert graders are hired through the CrowdFlower interface to Amazon’s Mechanical Turk in order to carry out a ranking-based MT evaluation of utterances taken from the travel conver- sation domain for 10 Indo-European and Asian languages. The collected human as- sessments are analyzed for their worker characteristics, evaluation costs, and qual- ity of the evaluations in terms of the agree- ment between non-expert graders and ex- pert/oracle judgments. Moreover, data quality control mechanisms including “lo- cale qualification” “qualificatio testing”, and “on-the-fl verification are investi- gated in order to increase the reliability of the crowd-based evaluation results.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.