Vision transformers (ViTs) have recently become the go-to standard for solving various computer vision tasks due to their superior performance and generalization capabilities. However, these architectures are complex to use in embedded and heavily resource-constrained devices for two main reasons: their high memory requirements and the use of complex operators seldom supported by embedded inference pipelines. Meanwhile, in embedded environments, it is still common to use older architectures with lower performance, but offering reduced memory consumption and higher compatibility with the limited embedded runtimes, usually supporting only a limited number of operators. In this paper, we present a neural architecture based on a novel linear transformer block capable of bridging the gap between the performance achieved by modern computer vision models and the broader support offered by architectures currently used in embedded environments. We also propose a solution for one-shot scaling of our architecture, called Hardware-Aware Scaling. This approach allows us to develop architectures tailored to embedded devices with different computational resources without requiring a lengthy network architecture search or manual architecture tuning. We tested our architecture on an object detection task and achieved performance comparable to recent versions of YOLO, with lower latency and parameter count while maximizing compatibility.
Linear Transformers beat YOLO for Embedded Object Detection
Ancilotto, Alberto;Castagnini, Edoardo;Farella, Elisabetta
2025-01-01
Abstract
Vision transformers (ViTs) have recently become the go-to standard for solving various computer vision tasks due to their superior performance and generalization capabilities. However, these architectures are complex to use in embedded and heavily resource-constrained devices for two main reasons: their high memory requirements and the use of complex operators seldom supported by embedded inference pipelines. Meanwhile, in embedded environments, it is still common to use older architectures with lower performance, but offering reduced memory consumption and higher compatibility with the limited embedded runtimes, usually supporting only a limited number of operators. In this paper, we present a neural architecture based on a novel linear transformer block capable of bridging the gap between the performance achieved by modern computer vision models and the broader support offered by architectures currently used in embedded environments. We also propose a solution for one-shot scaling of our architecture, called Hardware-Aware Scaling. This approach allows us to develop architectures tailored to embedded devices with different computational resources without requiring a lengthy network architecture search or manual architecture tuning. We tested our architecture on an object detection task and achieved performance comparable to recent versions of YOLO, with lower latency and parameter count while maximizing compatibility.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
