This paper describes an end-to-end approach to perform keyword spotting with a pre-trained acoustic model that uses recurrent neural networks and connectionist temporal classification loss. Our approach is specifically designed for low-resource keyword spotting tasks where extremely small amounts of in-domain data are available to train the system. The pre-trained model, largely used in ASR tasks, is fine-tuned on in-domain audio recordings. In inference the model output is matched against the set of predefined keywords using a beam-search re-scoring based on the edit distance.We demonstrate that this approach significantly outperforms the best state-of-the art systems on a well known keyword spotting benchmark, namely "google speech commands". Moreover, com-pared against state-of-the-art methods, our proposed approach is extremely robust in case of limited in domain training material. We show that a very small performance reduction is observed when fine tuning with a very small fraction (around 5%) of the training set.We report an extensive set of experiments on two keyword spotting tasks, varying training sizes and correlating keyword classification accuracy with character error rates provided by the system. We also report an ablation study to assess on the contribution of the out-of-domain pre-training and of the beam-search re-scoring.

End-to-End Low Resource Keyword Spotting Through Character Recognition and Beam-Search Re-Scoring

Brutti, Alessio;Falavigna, Daniele
2022

Abstract

This paper describes an end-to-end approach to perform keyword spotting with a pre-trained acoustic model that uses recurrent neural networks and connectionist temporal classification loss. Our approach is specifically designed for low-resource keyword spotting tasks where extremely small amounts of in-domain data are available to train the system. The pre-trained model, largely used in ASR tasks, is fine-tuned on in-domain audio recordings. In inference the model output is matched against the set of predefined keywords using a beam-search re-scoring based on the edit distance.We demonstrate that this approach significantly outperforms the best state-of-the art systems on a well known keyword spotting benchmark, namely "google speech commands". Moreover, com-pared against state-of-the-art methods, our proposed approach is extremely robust in case of limited in domain training material. We show that a very small performance reduction is observed when fine tuning with a very small fraction (around 5%) of the training set.We report an extensive set of experiments on two keyword spotting tasks, varying training sizes and correlating keyword classification accuracy with character error rates provided by the system. We also report an ablation study to assess on the contribution of the out-of-domain pre-training and of the beam-search re-scoring.
978-1-6654-0540-9
File in questo prodotto:
File Dimensione Formato  
icassp2022.pdf

solo utenti autorizzati

Tipologia: Documento in Pre-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 248.98 kB
Formato Adobe PDF
248.98 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/332374
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact