In this paper, a novel speaker normalization method is presented and compared to a well known vocal tract length normalization method. With this method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space. Recognition experiments made use of two corpora, the first one consisting of adults` speech, the second one consisting of children`s speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract length normalization method
Speaker Normalization through Constrained MLLR Based Transforms
Giuliani, Diego;Gerosa, Matteo;Brugnara, Fabio
2004-01-01
Abstract
In this paper, a novel speaker normalization method is presented and compared to a well known vocal tract length normalization method. With this method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space. Recognition experiments made use of two corpora, the first one consisting of adults` speech, the second one consisting of children`s speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract length normalization methodI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.