The recording and processing of vocal data is raising increasing privacy concerns among the end-users as well as service provides. These privacy issues pose severe limitations in scenarios involving smart-cities where public spaces can be monitored by audio-visual sensors to improve the quality of the services offered to citizens. One option to address these problems is to move the processing on the edge devices close to the events, so that potentially identifiable information does not travel over the internet. However, this is often not possible due to hardware limitations on edge devices. An intriguing alternative is the development of voice anonymization techniques that remove individual speaker characteristics, while maintaining the linguistic and acoustic information in data. In this paper we explore a SOTA sequence-to-sequence voice conversion approach, based originally on x-vectors and ASR bottleneck features, and investigate the use of different pre-trained speech and speaker representations to decouple the two acoustic information. In addition, we analyse different strategies for the selection of the target voice representation. Results on public datasets in terms of equal error and word error rates show that good privacy preservation with limited impact on the quality of the voice-converted speech is achieved with respect to the original method.

Using Seq2seq voice conversion with pre-trained representations for audio anonymization: experimental insights

Costante, Marco;Matassoni, Marco;Brutti, Alessio
2022-01-01

Abstract

The recording and processing of vocal data is raising increasing privacy concerns among the end-users as well as service provides. These privacy issues pose severe limitations in scenarios involving smart-cities where public spaces can be monitored by audio-visual sensors to improve the quality of the services offered to citizens. One option to address these problems is to move the processing on the edge devices close to the events, so that potentially identifiable information does not travel over the internet. However, this is often not possible due to hardware limitations on edge devices. An intriguing alternative is the development of voice anonymization techniques that remove individual speaker characteristics, while maintaining the linguistic and acoustic information in data. In this paper we explore a SOTA sequence-to-sequence voice conversion approach, based originally on x-vectors and ASR bottleneck features, and investigate the use of different pre-trained speech and speaker representations to decouple the two acoustic information. In addition, we analyse different strategies for the selection of the target voice representation. Results on public datasets in terms of equal error and word error rates show that good privacy preservation with limited impact on the quality of the voice-converted speech is achieved with respect to the original method.
2022
978-1-6654-8561-6
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/335805
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact