Over the past decades, the demand for learning English as a second language (L2) has grown consistently, as it has gradually become the lingua franca of business, culture, entertainment, and academia. This aspect has contributed to an increasing demand for systems for automatic feedback for applications in Computer-Assisted Language Learning. In this regard, mastering grammar is a key element of L2 speaking proficiency. In this paper, we illustrate an approach to spoken grammatical error correction (GEC) in a cascaded fashion using only publicly available training data. Specifically, we start from learners' utterances, investigate disfluency detection, and finally explore GEC. We test this pipeline on NICT-JLE, a publicly available L2 corpus, and TLT-GEC, a private dataset that is under preparation for release. We obtain promising results which outperform previous studies that used large proprietary datasets, and we set a potential baseline for future experiments on spoken GEC.
Grammatical Error Correction for L2 Speech Using Publicly Available Data
Stefano Bannò;Michela Rais;Marco Matassoni
2023-01-01
Abstract
Over the past decades, the demand for learning English as a second language (L2) has grown consistently, as it has gradually become the lingua franca of business, culture, entertainment, and academia. This aspect has contributed to an increasing demand for systems for automatic feedback for applications in Computer-Assisted Language Learning. In this regard, mastering grammar is a key element of L2 speaking proficiency. In this paper, we illustrate an approach to spoken grammatical error correction (GEC) in a cascaded fashion using only publicly available training data. Specifically, we start from learners' utterances, investigate disfluency detection, and finally explore GEC. We test this pipeline on NICT-JLE, a publicly available L2 corpus, and TLT-GEC, a private dataset that is under preparation for release. We obtain promising results which outperform previous studies that used large proprietary datasets, and we set a potential baseline for future experiments on spoken GEC.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.