Accurate human body keypoint detection is crucial in fields like medicine, entertainment, and VR. However, it often demands complex neural networks best suited for high-compute environments. This work instead presents a keypoint detection approach targeting embedded devices with very low computational resources, such as microcontrollers. The proposed end-to-end solution is based on the development and optimization of each component of a neural network specifically designed for highly constrained devices. Our methodology works top-down, from object to keypoint detection, unlike alternative bottom-up approaches relying instead on complex decoding algorithms or additional processing steps. The proposed network is optimized to ensure maximum compatibility with different embedded runtimes by making use of commonly used operators. We demonstrate the viability of our approach using an STM32H7 microcontroller with 2MB of Flash and 1MB of RAM. We achieve a maximum mAP of 57.9 without relying on external RAM, and good detection performance at latencies down to 133ms per frame.

XiNet-pose: Extremely lightweight pose detection for microcontrollers

Alberto Ancilotto;Francesco Paissan;Elisabetta Farella
2024-01-01

Abstract

Accurate human body keypoint detection is crucial in fields like medicine, entertainment, and VR. However, it often demands complex neural networks best suited for high-compute environments. This work instead presents a keypoint detection approach targeting embedded devices with very low computational resources, such as microcontrollers. The proposed end-to-end solution is based on the development and optimization of each component of a neural network specifically designed for highly constrained devices. Our methodology works top-down, from object to keypoint detection, unlike alternative bottom-up approaches relying instead on complex decoding algorithms or additional processing steps. The proposed network is optimized to ensure maximum compatibility with different embedded runtimes by making use of commonly used operators. We demonstrate the viability of our approach using an STM32H7 microcontroller with 2MB of Flash and 1MB of RAM. We achieve a maximum mAP of 57.9 without relying on external RAM, and good detection performance at latencies down to 133ms per frame.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/345429
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact