Multi-speaker tracking from an audio-visual sensing device