Sound Event Detection (SED) pipelines identify and classify relevant events in audio streams. With typical applications in the smart city domain (e.g., crowd counting, alarm triggering), SED is an asset for municipalities and law enforcement agencies. Given the large size of the areas to be monitored and the amount of data generated by the IoT sensors, large models running on centralised servers are not suitable for real-time applications. Conversely, performing SED directly on pervasive embedded devices is very attractive in terms of energy consumption, bandwidth requirements and privacy preservation. In a previous manuscript, we proposed scalable backbones from the PhiNets architectures’ family for real-time sound event detection on microcontrollers. In this paper, we extend our analysis investigating how PhiNets’ scaling parameters affect the model performance in the SED task while searching for the best configuration given the computational constraints. Experimental analysis on UrbanSound8K shows that while onlythe total number of parameters matters when training the model from scratch (i.e., it is independent of the scaling parameter configuration), knowledge distillation is more effective with specific scaling configurations.

Optimizing PhiNet architectures for the detection of urban sounds on low-end devices

Alessio Brutti
;
Francesco Paissan;Alberto Ancilotto;Elisabetta Farella
2022-01-01

Abstract

Sound Event Detection (SED) pipelines identify and classify relevant events in audio streams. With typical applications in the smart city domain (e.g., crowd counting, alarm triggering), SED is an asset for municipalities and law enforcement agencies. Given the large size of the areas to be monitored and the amount of data generated by the IoT sensors, large models running on centralised servers are not suitable for real-time applications. Conversely, performing SED directly on pervasive embedded devices is very attractive in terms of energy consumption, bandwidth requirements and privacy preservation. In a previous manuscript, we proposed scalable backbones from the PhiNets architectures’ family for real-time sound event detection on microcontrollers. In this paper, we extend our analysis investigating how PhiNets’ scaling parameters affect the model performance in the SED task while searching for the best configuration given the computational constraints. Experimental analysis on UrbanSound8K shows that while onlythe total number of parameters matters when training the model from scratch (i.e., it is independent of the scaling parameter configuration), knowledge distillation is more effective with specific scaling configurations.
File in questo prodotto:
File Dimensione Formato  
0001121.pdf

solo utenti autorizzati

Descrizione: articolo definitivo
Tipologia: Documento in Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 456.77 kB
Formato Adobe PDF
456.77 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/333907
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact