While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in order to understand the impact of training size, class distribution and annotation scheme.

Hate Speech Detection with Machine-Translated Data: The Role of Annotation Scheme, Class Imbalance and Undersampling

Camilla Casula;Sara Tonelli
2020-01-01

Abstract

While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in order to understand the impact of training size, class distribution and annotation scheme.
File in questo prodotto:
File Dimensione Formato  
paper_39.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 220.46 kB
Formato Adobe PDF
220.46 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/325141
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact