Lately, the development of deep learning algorithms has marked milestones in the field of speech processing. In particular, the release of pre-trained feature extraction models has considerably simplified the development of speech classification and recognition algorithms. However, environmental noise and reverberation still negatively affect the whole performance, making robustness in noisy conditions mandatory in real-world applications. One way to mitigate the noise effect is to integrate a speech enhancement front-end that removes artifacts from the desired speech signals. Unlike the state-of-the-art enhancement approaches that operate either on speech spectrogram or directly on time-domain signals, in this paper, we study how enhancement can be applied directly on the speech embeddings, extracted using Wav2Vec and WavLM models. Moreover, we investigate a variety of training approaches, considering different flavors of joint and disjoint training of the speech enhancement front-end with the classification/recognition back-end. We perform exhaustive experiments on the Fluent Speech Commands and Google Speech Commands datasets contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, as well as on the LibriSpeech dataset, contaminated with noises from the MUSAN dataset, considering intent classification, keyword spotting, and speech recognition tasks respectively. Results show that directly enhancing the speech embedding is a viable, computationally effective approach, and provide insights about the most promising training approaches.

Direct enhancement of pre-trained speech embeddings for speech processing in noisy conditions

Mohamed Nabih Ali;Alessio Brutti;Daniele Falavigna
2023-01-01

Abstract

Lately, the development of deep learning algorithms has marked milestones in the field of speech processing. In particular, the release of pre-trained feature extraction models has considerably simplified the development of speech classification and recognition algorithms. However, environmental noise and reverberation still negatively affect the whole performance, making robustness in noisy conditions mandatory in real-world applications. One way to mitigate the noise effect is to integrate a speech enhancement front-end that removes artifacts from the desired speech signals. Unlike the state-of-the-art enhancement approaches that operate either on speech spectrogram or directly on time-domain signals, in this paper, we study how enhancement can be applied directly on the speech embeddings, extracted using Wav2Vec and WavLM models. Moreover, we investigate a variety of training approaches, considering different flavors of joint and disjoint training of the speech enhancement front-end with the classification/recognition back-end. We perform exhaustive experiments on the Fluent Speech Commands and Google Speech Commands datasets contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, as well as on the LibriSpeech dataset, contaminated with noises from the MUSAN dataset, considering intent classification, keyword spotting, and speech recognition tasks respectively. Results show that directly enhancing the speech embedding is a viable, computationally effective approach, and provide insights about the most promising training approaches.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/338727
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact