Mixtures of Recurrent Neural Networks for Speaker Adaptation

Trentin, Edmondo; Giuliani, Diego

This work introduces a multiple connectionist architecture based on a mixture of Recurrent Neural Networks to approach the problem of speaker adaptation in the acoustic feature domain. Adaptation in the feature space is accomplished by means of a suitable acoustic feature transformation. The aim is the reduction of differences between the acoustic space of a new speaker and the training acoustic space of a given recognizer, in order to increase recognition performance. The transformation has to be estimated from a small amount of speech signal, sampled from the new speaker, and is applied at recognition state to preprocess input speech data, before feeding them into the recognizer. In this work, recognition experiments with continuous speech and a large vocabulary have been carried out using speaker-dependent (SD) and speaker-independent (SI) recognition systems, based on continuous density hidden Markov models. Different connectionist approaches to speaker adaptation are discussed. At first, an extended Multi-Layer Perceptron (MLP) is trained to realize the required multivariate non-linear regression. Various directions along which to extend the feed-forward model with the introduction of recurrent connections are discussed. Experiments with the SD recognizers show a remarkable 56% reduction of the word error rate with respect to the baseline (speech recognition without adaptation) when the recurrent network is used as an acoustic front-end to the recognizer, outperforming the standard linear regression approach. In the SI case, on the other side, it is more difficult to reach a significant recognition improvement with respect to the hidden Markov model alone. This leads to the development of a more effective regression technique based on combined neural networks. A mixture of MLPs, as well as a technique for combining recurrent nets, are used. Experimental results show that the proposed architecture consistently improves recognition performance yielding a 21% reduction of the word error rate

IRIS Institutional Research Information System