This paper describes an approach for adapting a DNN trained on adult speech to children voices. The method extends a previous one, based on the Kullback-Leibler divergence between the original (adult) DNN output distribution and the target one, by accounting for the quality of the supervision of the adaptation utterances. In addition, starting from the observation that by gradually removing from the adaptation set the sentences with higher WERs significant performance improvements can be achieved, we also investigate the usage of automatic selection of adaptation utterances. For determining transcription quality we investigate the use of confidence estimates of recognized hypotheses. We present experiments and related results achieved on an Italian data set of children’s speech. We show that the proposed DNN adaptation approach allows to significantly reduce the WER on a given test set from 14.2% (corresponding to using the non adapted DNN, trained on adult speech) to 10.6%. It is worth mentioning that the latter result has been achieved without making use of any training data specific of children’s speech.
DNN adaptation for recognition of children speech through automatic utterance selection
Matassoni, Marco;Falavigna, Giuseppe Daniele;Giuliani, Diego
2016-01-01
Abstract
This paper describes an approach for adapting a DNN trained on adult speech to children voices. The method extends a previous one, based on the Kullback-Leibler divergence between the original (adult) DNN output distribution and the target one, by accounting for the quality of the supervision of the adaptation utterances. In addition, starting from the observation that by gradually removing from the adaptation set the sentences with higher WERs significant performance improvements can be achieved, we also investigate the usage of automatic selection of adaptation utterances. For determining transcription quality we investigate the use of confidence estimates of recognized hypotheses. We present experiments and related results achieved on an Italian data set of children’s speech. We show that the proposed DNN adaptation approach allows to significantly reduce the WER on a given test set from 14.2% (corresponding to using the non adapted DNN, trained on adult speech) to 10.6%. It is worth mentioning that the latter result has been achieved without making use of any training data specific of children’s speech.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.