Cross-Modal Variational Alignment of Latent Spaces

Abstract

In this paper, we propose a novel cross-modal variational alignment method in order to process and relate information across different modalities. The proposed approach consists of two variational autoencoder (VAE) networks which generate and model the latent space of each modality. The first network is a multi modal variational autoencoder that maps directly one modality to the other, while the second one is a single-modal variational autoencoder. In order to associate the two spaces, we apply variational alignment, which acts as a translation mechanism that projects the latent space of the first VAE onto the one of the single-modal VAE through an intermediate distribution. Experimental results on four well-known datasets, covering two different application domains (food image analysis and 3D hand pose estimation), show the generality of the proposed method and its superiority against a number of state-of-the-art approaches.

Cite

Text

Theodoridis et al. "Cross-Modal Variational Alignment of Latent Spaces." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020. doi:10.1109/CVPRW50498.2020.00488

Markdown

[Theodoridis et al. "Cross-Modal Variational Alignment of Latent Spaces." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.](https://mlanthology.org/cvprw/2020/theodoridis2020cvprw-crossmodal/) doi:10.1109/CVPRW50498.2020.00488

BibTeX

@inproceedings{theodoridis2020cvprw-crossmodal,
  title     = {{Cross-Modal Variational Alignment of Latent Spaces}},
  author    = {Theodoridis, Thomas and Chatzis, Theocharis and Solachidis, Vassilios and Dimitropoulos, Kosmas and Daras, Petros},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2020},
  pages     = {4127-4136},
  doi       = {10.1109/CVPRW50498.2020.00488},
  url       = {https://mlanthology.org/cvprw/2020/theodoridis2020cvprw-crossmodal/}
}