IRIS Institutional Research Information System

Neural Vocoders convert time-frequency representations, such as mel-spectrograms, into corresponding time representations. Vocoders are essential for generative applications in audio (e.g. text-to-speech and text-to-audio). This paper presents a scalable vocoder architecture for small-footprint edge devices, inspired by Vocos and adapted with XiNets and PhiNets. We test the developed model capabilities qualitatively and quantitatively on single-speaker and multi-speaker datasets and benchmark inference speed and memory consumption on four microcontrollers. Additionally, we study the power consumption on an ARM Cortex-M7-powered board. Our results demonstrate the feasibility of deploying neural vocoders on resource-constrained edge devices, potentially enabling new applications in Internet of Sounds (IoS) and Embedded Audio scenarios. Our best-performing model achieves a MOS score of 3.95/5 while utilizing 1.5MiB of FLASH and 517KiB of RAM and consuming 252 mW for a 1s audio clip inference.

TinyVocos: Neural Vocoders on MCUs

Ciapponi, Stefano;Paissan, Francesco;Ancilotto, Alberto;Farella, Elisabetta

2024-01-01

Abstract

Neural Vocoders convert time-frequency representations, such as mel-spectrograms, into corresponding time representations. Vocoders are essential for generative applications in audio (e.g. text-to-speech and text-to-audio). This paper presents a scalable vocoder architecture for small-footprint edge devices, inspired by Vocos and adapted with XiNets and PhiNets. We test the developed model capabilities qualitatively and quantitatively on single-speaker and multi-speaker datasets and benchmark inference speed and memory consumption on four microcontrollers. Additionally, we study the power consumption on an ARM Cortex-M7-powered board. Our results demonstrate the feasibility of deploying neural vocoders on resource-constrained edge devices, potentially enabling new applications in Internet of Sounds (IoS) and Embedded Audio scenarios. Our best-performing model achieves a MOS score of 3.95/5 while utilizing 1.5MiB of FLASH and 517KiB of RAM and consuming 252 mW for a 1s audio clip inference.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2024

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/356888

Citazioni

ND

social impact