In this paper, speaker adaptive acoustic modeling is investigated in the context of large vocabulary speech recognition by training acoustic models with adult speech, children`s speech and a mixture of adult and children`s speech. By exploiting a limited amount (9 hours) of children`s speech and a more significant amount (57 hours) of adult speech, group-specific acoustic models for children and adults, respectively, were trained using several methods for speaker adaptive acoustic modeling. In addition, age-independent acoustic models were trained by exploiting adult and children`s speech. Recognition experiments were performed on three speech corpora, two consisting of children`s speech and one of adult speech, using 64k word and 11k word trigram language models. Methods for speaker adaptive acoustic modeling proved to be effective, in particular for training acoustic models on a mixture of adult and children`s speech, ensuring recognition performance aligned with that achieved with group-specific models for adults and children. A 10.2% word error rate was achieved on speech collected from children in the age range 8-12, compared with the 8.2% word error rate achieved for adults uttering the same texts.
Speaker Adaptive Acoustic Modeling with Mixture of Adult and Children`s Speech
Gerosa, Matteo;Giuliani, Diego;Brugnara, Fabio
2005-01-01
Abstract
In this paper, speaker adaptive acoustic modeling is investigated in the context of large vocabulary speech recognition by training acoustic models with adult speech, children`s speech and a mixture of adult and children`s speech. By exploiting a limited amount (9 hours) of children`s speech and a more significant amount (57 hours) of adult speech, group-specific acoustic models for children and adults, respectively, were trained using several methods for speaker adaptive acoustic modeling. In addition, age-independent acoustic models were trained by exploiting adult and children`s speech. Recognition experiments were performed on three speech corpora, two consisting of children`s speech and one of adult speech, using 64k word and 11k word trigram language models. Methods for speaker adaptive acoustic modeling proved to be effective, in particular for training acoustic models on a mixture of adult and children`s speech, ensuring recognition performance aligned with that achieved with group-specific models for adults and children. A 10.2% word error rate was achieved on speech collected from children in the age range 8-12, compared with the 8.2% word error rate achieved for adults uttering the same texts.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.