Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.

Multi-speaker tracking from an audio-visual sensing device

Xinyuan Qian
;
Alessio Brutti
;
Oswald Lanz
;
Maurizio Omologo
;
2019

Abstract

Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.
File in questo prodotto:
File Dimensione Formato  
Multi_speaker_tracking_from_an_audio_visual_sensing_device.pdf

non disponibili

Tipologia: Documento in Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 5.63 MB
Formato Adobe PDF
5.63 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Multi-speaker tracking from an.pdf

accesso aperto

Licenza: PUBBLICO - Pubblico con Copyright
Dimensione 1.99 MB
Formato Adobe PDF
1.99 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/317577
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact