Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when signals from some modalities are unavailable or severely degraded. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of a worse performing (or weaker) modality. The translation from the weaker to the better performing (or stronger) modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representation in a shared latent space. We validate the proposed framework on the AVEC 2016 dataset (RECOLA) for continuous emotion recognition and show the effectiveness of the framework that achieves state-of-the-art (uni-modal) performance for weaker modalities.

Robust Latent Representations Via Cross-Modal Translation and Alignment

Rajan, Vandana
;
Brutti, Alessio
;
2021

Abstract

Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when signals from some modalities are unavailable or severely degraded. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of a worse performing (or weaker) modality. The translation from the weaker to the better performing (or stronger) modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representation in a shared latent space. We validate the proposed framework on the AVEC 2016 dataset (RECOLA) for continuous emotion recognition and show the effectiveness of the framework that achieves state-of-the-art (uni-modal) performance for weaker modalities.
978-1-7281-7605-5
File in questo prodotto:
File Dimensione Formato  
ICASSP2021_RECOLA.pdf

non disponibili

Tipologia: Documento in Pre-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 209.1 kB
Formato Adobe PDF
209.1 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/326867
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact