In this paper, speaker adaptive acoutstic modeling is investigated by using a novel method for speaker normalization and a wel known vocal tract lenght normalization method. With the novel normalization method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transofrmations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through contrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the mormalized acoustic space. Recognition experiments made use of two corpora, the first one consisting of adults'speech, the second one consisting of children's speech. Performing training and recognition with normalized data resulted ina consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract lenght normalization method adopted in this work. However, it was found that when unsupervised static speaker adaptation is applied in combination with speaker normalization, recognition performance tends to be similar independently of the speaker normalization method adopted
Improved Automatic Speech Recognition through Speaker Normalization
Giuliani, Diego;Gerosa, Matteo;Brugnara, Fabio
2003-01-01
Abstract
In this paper, speaker adaptive acoutstic modeling is investigated by using a novel method for speaker normalization and a wel known vocal tract lenght normalization method. With the novel normalization method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transofrmations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through contrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the mormalized acoustic space. Recognition experiments made use of two corpora, the first one consisting of adults'speech, the second one consisting of children's speech. Performing training and recognition with normalized data resulted ina consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract lenght normalization method adopted in this work. However, it was found that when unsupervised static speaker adaptation is applied in combination with speaker normalization, recognition performance tends to be similar independently of the speaker normalization method adoptedI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.