Distant-speech recognition represents a technology of fundamental importance for future development of assistive applications characterized by flexible and unobtrusive interaction in home environments. State-of-the-art speech recognition still exhibits lack of robustness, and an unacceptable performance variability, due to environmental noise, reverberation effects, and speaker position. In the past, multi-condition training and contamination methods were explored to reduce the mismatch between training and test conditions. However, the performance evaluation can be biased by factors as limited number of positions of speaker and microphones, adopted set of impulse responses, vocabulary and grammars defining the recognition task. The purpose of this paper is to investigate in more detail some critical aspects that characterize such experimental context. To this purpose, our work addressed a microphone network distributed over different rooms of an apartment and a related set of speaker-microphone pairs leading to a very large set of impulse responses. Besides simulations, the experiments also tackled real speech interactions. The performance evaluation was based on a phone-loop task, in order to minimize the influence of linguistic constraints. The experimental results show how less critical is an accurate selection of impulse responses, if compared to other factors as the signal-to-noise ratio introduced by additive background noise.
On the selection of the impulse responses for distant-speech recognition based on contaminated speech training
Ravanelli, Mirco;Omologo, Maurizio
2014-01-01
Abstract
Distant-speech recognition represents a technology of fundamental importance for future development of assistive applications characterized by flexible and unobtrusive interaction in home environments. State-of-the-art speech recognition still exhibits lack of robustness, and an unacceptable performance variability, due to environmental noise, reverberation effects, and speaker position. In the past, multi-condition training and contamination methods were explored to reduce the mismatch between training and test conditions. However, the performance evaluation can be biased by factors as limited number of positions of speaker and microphones, adopted set of impulse responses, vocabulary and grammars defining the recognition task. The purpose of this paper is to investigate in more detail some critical aspects that characterize such experimental context. To this purpose, our work addressed a microphone network distributed over different rooms of an apartment and a related set of speaker-microphone pairs leading to a very large set of impulse responses. Besides simulations, the experiments also tackled real speech interactions. The performance evaluation was based on a phone-loop task, in order to minimize the influence of linguistic constraints. The experimental results show how less critical is an accurate selection of impulse responses, if compared to other factors as the signal-to-noise ratio introduced by additive background noise.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.