IRIS Institutional Research Information System

Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.

Multi-speaker tracking from an audio-visual sensing device

Xinyuan Qian;Alessio Brutti;Oswald Lanz;Maurizio Omologo;Andrea Cavallaro

2019-01-01

Abstract

Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2019

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Multi_speaker_tracking_from_an_audio_visual_sensing_device.pdf non disponibili Tipologia: Documento in Post-print Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 5.63 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	5.63 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
Multi-speaker tracking from an.pdf accesso aperto Licenza: PUBBLICO - Pubblico con Copyright Dimensione 1.99 MB Formato Adobe PDF Visualizza/Apri	1.99 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/317577

Citazioni

ND

social impact