This paper describes an end-to-end approach to perform keyword spotting with a pre-trained acoustic model that uses recurrent neural networks and connectionist temporal classification loss. Our approach is specifically designed for low-resource keyword spotting tasks where extremely small amounts of in-domain data are available to train the system. The pre-trained model, largely used in ASR tasks, is fine-tuned on in-domain audio recordings. In inference the model output is matched against the set of predefined keywords using a beam-search re-scoring based on the edit distance.We demonstrate that this approach significantly outperforms the best state-of-the art systems on a well known keyword spotting benchmark, namely "google speech commands". Moreover, com-pared against state-of-the-art methods, our proposed approach is extremely robust in case of limited in domain training material. We show that a very small performance reduction is observed when fine tuning with a very small fraction (around 5%) of the training set.We report an extensive set of experiments on two keyword spotting tasks, varying training sizes and correlating keyword classification accuracy with character error rates provided by the system. We also report an ablation study to assess on the contribution of the out-of-domain pre-training and of the beam-search re-scoring.
End-to-End Low Resource Keyword Spotting Through Character Recognition and Beam-Search Re-Scoring
Brutti, Alessio;Falavigna, Daniele
2022-01-01
Abstract
This paper describes an end-to-end approach to perform keyword spotting with a pre-trained acoustic model that uses recurrent neural networks and connectionist temporal classification loss. Our approach is specifically designed for low-resource keyword spotting tasks where extremely small amounts of in-domain data are available to train the system. The pre-trained model, largely used in ASR tasks, is fine-tuned on in-domain audio recordings. In inference the model output is matched against the set of predefined keywords using a beam-search re-scoring based on the edit distance.We demonstrate that this approach significantly outperforms the best state-of-the art systems on a well known keyword spotting benchmark, namely "google speech commands". Moreover, com-pared against state-of-the-art methods, our proposed approach is extremely robust in case of limited in domain training material. We show that a very small performance reduction is observed when fine tuning with a very small fraction (around 5%) of the training set.We report an extensive set of experiments on two keyword spotting tasks, varying training sizes and correlating keyword classification accuracy with character error rates provided by the system. We also report an ablation study to assess on the contribution of the out-of-domain pre-training and of the beam-search re-scoring.File | Dimensione | Formato | |
---|---|---|---|
icassp2022.pdf
solo utenti autorizzati
Tipologia:
Documento in Pre-print
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
248.98 kB
Formato
Adobe PDF
|
248.98 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.