Distant speech recognition in real-world environments is still a challenging problem and a particularly interesting topic is the investigation of multi-channel processing in case of distributed microphones in home environments. This paper presents an initiative oriented to address the challenges of such a scenario; an experimental recognition framework comprising a multi-room, multi-channel corpus and the accompanying evaluation tools is made publicly available. The overall goal is to represent a common platform for comparing state-of-the-art algorithms, share ideas of different research communities and integrate several components in a realistic distant-talking recognition chain, e.g., voice activity detection, speech/feature enhancement, channel selection and fusion, model compensation. The recordings include spoken commands (derived from the well-known GRID corpus) mixed with other acoustic events occurring in different rooms of a real apartment. The work provides a detailed description of data, tasks and baseline results, discussing the potential and limits of the approach and highlighting the impact of single modules on recognition performance.
The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones
Matassoni, Marco;Ravanelli, Mirco
2014-01-01
Abstract
Distant speech recognition in real-world environments is still a challenging problem and a particularly interesting topic is the investigation of multi-channel processing in case of distributed microphones in home environments. This paper presents an initiative oriented to address the challenges of such a scenario; an experimental recognition framework comprising a multi-room, multi-channel corpus and the accompanying evaluation tools is made publicly available. The overall goal is to represent a common platform for comparing state-of-the-art algorithms, share ideas of different research communities and integrate several components in a realistic distant-talking recognition chain, e.g., voice activity detection, speech/feature enhancement, channel selection and fusion, model compensation. The recordings include spoken commands (derived from the well-known GRID corpus) mixed with other acoustic events occurring in different rooms of a real apartment. The work provides a detailed description of data, tasks and baseline results, discussing the potential and limits of the approach and highlighting the impact of single modules on recognition performance.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.