Improved Multimodal Deep Learning with Variation of Information
Abstract
Deep learning has been successfully applied to multimodal representation learning problems, with a common strategy to learning joint representations that are shared across multiple modalities on top of layers of modality-specific networks. Nonetheless, there still remains a question how to learn a good association between data modalities; in particular, a good generative model of multimodal data should be able to reason about missing data modality given the rest of data modalities. In this paper, we propose a novel multimodal representation learning framework that explicitly aims this goal. Rather than learning with maximum likelihood, we train the model to minimize the variation of information. We provide a theoretical insight why the proposed learning objective is sufficient to estimate the data-generating joint distribution of multimodal data. We apply our method to restricted Boltzmann machines and introduce learning methods based on contrastive divergence and multi-prediction training. In addition, we extend to deep networks with recurrent encoding structure to finetune the whole network. In experiments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database with and without text features.
Cite
Text
Sohn et al. "Improved Multimodal Deep Learning with Variation of Information." Neural Information Processing Systems, 2014.Markdown
[Sohn et al. "Improved Multimodal Deep Learning with Variation of Information." Neural Information Processing Systems, 2014.](https://mlanthology.org/neurips/2014/sohn2014neurips-improved/)BibTeX
@inproceedings{sohn2014neurips-improved,
title = {{Improved Multimodal Deep Learning with Variation of Information}},
author = {Sohn, Kihyuk and Shang, Wenling and Lee, Honglak},
booktitle = {Neural Information Processing Systems},
year = {2014},
pages = {2141-2149},
url = {https://mlanthology.org/neurips/2014/sohn2014neurips-improved/}
}