In this paper, a novel on-line incremental speaker adaptation technique is proposed for real time transcription applications such as automatic closed-captioning of live TV programs. Differently from previously proposed methods, our technique does not operate at utterance level but instead speaker change detection and clustering as well as speaker adaptation occur over a short chunk of the incoming audio signal. Incremental adaptation based on feature space maximum likelihood linear regression (fMLLR) is conducted w. r. t. a Gaussian mixture model (GMM) modeling the acoustic training data. Individual speakers are represented by fMLLR transforms, and these transforms are used for speaker clustering and for performing speaker adaptation. Speech recognition experiments show that the proposed incremental adaptation technique is effective, 6% relative reduction in word-error-rate (WER) w. r. t. a non-adaptive baseline system, when it is embedded in a online transcription system applied to transcribe television news broadcasts.
An on-line incremental speaker adaptation technique for audio stream transcription
Giuliani, Diego;Brugnara, Fabio
2013-01-01
Abstract
In this paper, a novel on-line incremental speaker adaptation technique is proposed for real time transcription applications such as automatic closed-captioning of live TV programs. Differently from previously proposed methods, our technique does not operate at utterance level but instead speaker change detection and clustering as well as speaker adaptation occur over a short chunk of the incoming audio signal. Incremental adaptation based on feature space maximum likelihood linear regression (fMLLR) is conducted w. r. t. a Gaussian mixture model (GMM) modeling the acoustic training data. Individual speakers are represented by fMLLR transforms, and these transforms are used for speaker clustering and for performing speaker adaptation. Speech recognition experiments show that the proposed incremental adaptation technique is effective, 6% relative reduction in word-error-rate (WER) w. r. t. a non-adaptive baseline system, when it is embedded in a online transcription system applied to transcribe television news broadcasts.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.