Robustness against noise is critical for several speech applications in real-world environments. In general, to improve the robustness, a speech enhancement front-end is integrated as a preprocessing stage, often jointly trained with the network backend to reduce the impact of distortions and artifacts on the performance. Recently, the use of speech representation computed using pre-trained models on large amounts of data, as Wav2Vec, has proved to be effective in a variety of speech processing and classification tasks. However, the performance of these models, although very robust, deteriorates in presence of environmental noise. In this paper, we investigate how enhancement can be applied in neural speech classification architectures employing pre-trained speech embeddings. We investigate two approaches: one applies time-domain enhancement prior to extracting the embeddings; the other employs a convolutional neural network to map the noisy embeddings to the corresponding clean ones. Exhaustive experiments on the Fluent Speech Commands and Google Speech Commands corpora, contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, sheds light and provide insights about the most promising enhancement training approaches.

Enhancing Embeddings for Speech Classification in Noisy Conditions

Ali, Mohamed Nabih;Brutti, Alessio;Daniele, Falavigna
2022-01-01

Abstract

Robustness against noise is critical for several speech applications in real-world environments. In general, to improve the robustness, a speech enhancement front-end is integrated as a preprocessing stage, often jointly trained with the network backend to reduce the impact of distortions and artifacts on the performance. Recently, the use of speech representation computed using pre-trained models on large amounts of data, as Wav2Vec, has proved to be effective in a variety of speech processing and classification tasks. However, the performance of these models, although very robust, deteriorates in presence of environmental noise. In this paper, we investigate how enhancement can be applied in neural speech classification architectures employing pre-trained speech embeddings. We investigate two approaches: one applies time-domain enhancement prior to extracting the embeddings; the other employs a convolutional neural network to map the noisy embeddings to the corresponding clean ones. Exhaustive experiments on the Fluent Speech Commands and Google Speech Commands corpora, contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, sheds light and provide insights about the most promising enhancement training approaches.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/335803
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact