This paper proposes and describes a complete system for BlindSourceExtraction (BSE). The goal is to extract a target signal source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array. The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target source, (c) on-line semi-blindsource separation and (d) sourceextraction. All the stages are effectively combined, allowing the estimation of the target signal with limited distortion. While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate.
Blind source extraction for robust speech recognition in multisource noisy environments
Nesta, Francesco;Matassoni, Marco
2013-01-01
Abstract
This paper proposes and describes a complete system for BlindSourceExtraction (BSE). The goal is to extract a target signal source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array. The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target source, (c) on-line semi-blindsource separation and (d) sourceextraction. All the stages are effectively combined, allowing the estimation of the target signal with limited distortion. While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.