Sound Event Detection (SED) pipelines identify and classify relevant events in audio streams. With typical applications in the smart city domain (e.g., crowd counting, alarm triggering), SED is an asset for municipalities and law enforcement agencies. Given the large size of the areas to be monitored and the amount of data generated by the IoT sensors, large models running on centralised servers are not suitable for real-time applications. Conversely, performing SED directly on pervasive embedded devices is very attractive in terms of energy consumption, bandwidth requirements and privacy preservation. In a previous manuscript, we proposed scalable backbones from the PhiNets architectures’ family for real-time sound event detection on microcontrollers. In this paper, we extend our analysis investigating how PhiNets’ scaling parameters affect the model performance in the SED task while searching for the best configuration given the computational constraints. Experimental analysis on UrbanSound8K shows that while onlythe total number of parameters matters when training the model from scratch (i.e., it is independent of the scaling parameter configuration), knowledge distillation is more effective with specific scaling configurations.
Optimizing PhiNet architectures for the detection of urban sounds on low-end devices
Alessio Brutti
;Francesco Paissan;Alberto Ancilotto;Elisabetta Farella
2022-01-01
Abstract
Sound Event Detection (SED) pipelines identify and classify relevant events in audio streams. With typical applications in the smart city domain (e.g., crowd counting, alarm triggering), SED is an asset for municipalities and law enforcement agencies. Given the large size of the areas to be monitored and the amount of data generated by the IoT sensors, large models running on centralised servers are not suitable for real-time applications. Conversely, performing SED directly on pervasive embedded devices is very attractive in terms of energy consumption, bandwidth requirements and privacy preservation. In a previous manuscript, we proposed scalable backbones from the PhiNets architectures’ family for real-time sound event detection on microcontrollers. In this paper, we extend our analysis investigating how PhiNets’ scaling parameters affect the model performance in the SED task while searching for the best configuration given the computational constraints. Experimental analysis on UrbanSound8K shows that while onlythe total number of parameters matters when training the model from scratch (i.e., it is independent of the scaling parameter configuration), knowledge distillation is more effective with specific scaling configurations.File | Dimensione | Formato | |
---|---|---|---|
0001121.pdf
solo utenti autorizzati
Descrizione: articolo definitivo
Tipologia:
Documento in Post-print
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
456.77 kB
Formato
Adobe PDF
|
456.77 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.