IRIS Institutional Research Information System

This paper describes an end-to-end approach to perform keyword spotting with a pre-trained acoustic model that uses recurrent neural networks and connectionist temporal classification loss. Our approach is specifically designed for low-resource keyword spotting tasks where extremely small amounts of in-domain data are available to train the system. The pre-trained model, largely used in ASR tasks, is fine-tuned on in-domain audio recordings. In inference the model output is matched against the set of predefined keywords using a beam-search re-scoring based on the edit distance.We demonstrate that this approach significantly outperforms the best state-of-the art systems on a well known keyword spotting benchmark, namely "google speech commands". Moreover, com-pared against state-of-the-art methods, our proposed approach is extremely robust in case of limited in domain training material. We show that a very small performance reduction is observed when fine tuning with a very small fraction (around 5%) of the training set.We report an extensive set of experiments on two keyword spotting tasks, varying training sizes and correlating keyword classification accuracy with character error rates provided by the system. We also report an ablation study to assess on the contribution of the out-of-domain pre-training and of the beam-search re-scoring.

End-to-End Low Resource Keyword Spotting Through Character Recognition and Beam-Search Re-Scoring

Mekonnen, Ephrem Tibebe;Brutti, Alessio;Falavigna, Daniele

2022-01-01

Abstract

This paper describes an end-to-end approach to perform keyword spotting with a pre-trained acoustic model that uses recurrent neural networks and connectionist temporal classification loss. Our approach is specifically designed for low-resource keyword spotting tasks where extremely small amounts of in-domain data are available to train the system. The pre-trained model, largely used in ASR tasks, is fine-tuned on in-domain audio recordings. In inference the model output is matched against the set of predefined keywords using a beam-search re-scoring based on the edit distance.We demonstrate that this approach significantly outperforms the best state-of-the art systems on a well known keyword spotting benchmark, namely "google speech commands". Moreover, com-pared against state-of-the-art methods, our proposed approach is extremely robust in case of limited in domain training material. We show that a very small performance reduction is observed when fine tuning with a very small fraction (around 5%) of the training set.We report an extensive set of experiments on two keyword spotting tasks, varying training sizes and correlating keyword classification accuracy with character error rates provided by the system. We also report an ablation study to assess on the contribution of the out-of-domain pre-training and of the beam-search re-scoring.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Codice ISBN
	
				978-1-6654-0540-9
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
icassp2022.pdf solo utenti autorizzati Tipologia: Documento in Pre-print Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 248.98 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	248.98 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/332374

Citazioni

ND

social impact