IRIS Institutional Research Information System

Most recent approaches for action recognition from video leverage deep architectures to encode the video clip into a fixed length representation vector that is then used for classification. For this to be successful, the network must be capable of suppressing irrelevant scene background and extract the representation from the most discriminative part of the video. Our contribution builds on the observation that spatio-temporal patterns characterizing actions in videos are highly correlated with objects and their location in the video. We propose Top-down Attention Recurrent VLAD Encoder (TA-VLAD), a deep recurrent neural architecture with built-in spatial attention that performs temporally aggregated VLAD encoding for action recognition from videos. We adopt a top-down approach of attention, by using class specific activation maps obtained from a deep Convolutional Neural Network pre-trained for generic image recognition, to weight appearance features before encoding them into a fixed-length video descriptor with a Gated Recurrent Unit. Our method achieves state-of-the-art recognition accuracy on HMDB51 and UCF101 benchmarks.

Top-down attention recurrent VLAD encoding for action recognition in videos

Sudhakaran, Swathikiran;Lanz, Oswald

2019-01-01

Abstract

Most recent approaches for action recognition from video leverage deep architectures to encode the video clip into a fixed length representation vector that is then used for classification. For this to be successful, the network must be capable of suppressing irrelevant scene background and extract the representation from the most discriminative part of the video. Our contribution builds on the observation that spatio-temporal patterns characterizing actions in videos are highly correlated with objects and their location in the video. We propose Top-down Attention Recurrent VLAD Encoder (TA-VLAD), a deep recurrent neural architecture with built-in spatial attention that performs temporally aggregated VLAD encoding for action recognition from videos. We adopt a top-down approach of attention, by using class specific activation maps obtained from a deep Convolutional Neural Network pre-trained for generic image recognition, to weight appearance features before encoding them into a fixed-length video descriptor with a Gated Recurrent Unit. Our method achieves state-of-the-art recognition accuracy on HMDB51 and UCF101 benchmarks.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2019

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
IA190021.pdf solo utenti autorizzati Tipologia: Documento in Pre-print Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 573.02 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	573.02 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/320734

Citazioni

ND

social impact