Recently, several neural time-domain speech denoising and speech separation approaches have been investigated in literature, considerably progressing the state-of-the-art in the field. Among these methods, Wave-U-Net is particularly appealing because it allows an integrated modelling of the phase information and can handle large temporal contexts. In this paper, we present an evolution of the original Wave-U-Net architecture, that features a deeper model with exponentially increasing dilation rate from layer to layer in the downsampling blocks. Experiments on a contaminated version of Librispeech show that the proposed architecture outperforms the original one in terms of intelligibility metrics. In addition, we evaluate the performance of the proposed enhancement scheme on a simple intent classification task based on a noisy version of the Fluent Speech Commands dataset. Results show that, also in this case, the proposed method outperforms the baseline and substantially improves the classification accuracy in noisy conditions.
A Speech Enhancement Front-End for Intent Classification in Noisy Environments
Ali, Mohamed Nabih
;Schmalz, Veronica Juliana;Brutti, Alessio;Falavigna, Daniele
2021-01-01
Abstract
Recently, several neural time-domain speech denoising and speech separation approaches have been investigated in literature, considerably progressing the state-of-the-art in the field. Among these methods, Wave-U-Net is particularly appealing because it allows an integrated modelling of the phase information and can handle large temporal contexts. In this paper, we present an evolution of the original Wave-U-Net architecture, that features a deeper model with exponentially increasing dilation rate from layer to layer in the downsampling blocks. Experiments on a contaminated version of Librispeech show that the proposed architecture outperforms the original one in terms of intelligibility metrics. In addition, we evaluate the performance of the proposed enhancement scheme on a simple intent classification task based on a noisy version of the Fluent Speech Commands dataset. Results show that, also in this case, the proposed method outperforms the baseline and substantially improves the classification accuracy in noisy conditions.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.