IRIS Institutional Research Information System

Vision transformers (ViTs) have recently become the go-to standard for solving various computer vision tasks due to their superior performance and generalization capabilities. However, these architectures are complex to use in embedded and heavily resource-constrained devices for two main reasons: their high memory requirements and the use of complex operators seldom supported by embedded inference pipelines. Meanwhile, in embedded environments, it is still common to use older architectures with lower performance, but offering reduced memory consumption and higher compatibility with the limited embedded runtimes, usually supporting only a limited number of operators. In this paper, we present a neural architecture based on a novel linear transformer block capable of bridging the gap between the performance achieved by modern computer vision models and the broader support offered by architectures currently used in embedded environments. We also propose a solution for one-shot scaling of our architecture, called Hardware-Aware Scaling. This approach allows us to develop architectures tailored to embedded devices with different computational resources without requiring a lengthy network architecture search or manual architecture tuning. We tested our architecture on an object detection task and achieved performance comparable to recent versions of YOLO, with lower latency and parameter count while maximizing compatibility.

Linear Transformers beat YOLO for Embedded Object Detection

Ancilotto, Alberto;Castagnini, Edoardo;Farella, Elisabetta

2025-01-01

Abstract

Vision transformers (ViTs) have recently become the go-to standard for solving various computer vision tasks due to their superior performance and generalization capabilities. However, these architectures are complex to use in embedded and heavily resource-constrained devices for two main reasons: their high memory requirements and the use of complex operators seldom supported by embedded inference pipelines. Meanwhile, in embedded environments, it is still common to use older architectures with lower performance, but offering reduced memory consumption and higher compatibility with the limited embedded runtimes, usually supporting only a limited number of operators. In this paper, we present a neural architecture based on a novel linear transformer block capable of bridging the gap between the performance achieved by modern computer vision models and the broader support offered by architectures currently used in embedded environments. We also propose a solution for one-shot scaling of our architecture, called Hardware-Aware Scaling. This approach allows us to develop architectures tailored to embedded devices with different computational resources without requiring a lengthy network architecture search or manual architecture tuning. We tested our architecture on an object detection task and achieved performance comparable to recent versions of YOLO, with lower latency and parameter count while maximizing compatibility.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/364931

Citazioni

ND

social impact