Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality

Singh, Kunal Pratap; Garjani, Ali; Singh, Rishubh; Khattak, Muhammad Uzair; Tarhan, Efe; Toskov, Jason; Atanov, Andrei; Kar, Oğuzhan Fatih; Zamir, Amir

Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality

Kunal Pratap Singh, Ali Garjani, Rishubh Singh, Muhammad Uzair Khattak, Efe Tarhan, Jason Toskov, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

ICLR 2026

/iclr/2026/singh2026iclr-multimodality/

Abstract

Cross-modal learning, i.e., learning to predict one modality from another, is a fundamental mechanism for self-supervision via leveraging multimodality. Many practical applications, e.g., deploying a household robot, involve devices that are equipped with a rich set of sensors that enable multimodal sensing in their test environment. This presents an opportunity to apply cross-modal learning to the multimodal data sensed by these devices to learn representations. Findings in developmental psychology also suggest that biological agents leverage it to build an effective representation of their surroundings. To study this, we propose a sandbox, where we restrict a user device to just a given test environment. It results in a specialization setup where we attempt to develop a performant model for this specific test environment. Under this setup, we develop Test-Space Training (TST), which performs multimodal data collection in the test environment and performs self-supervised pre-training on it. We evaluate these models on various downstream tasks in the same environment. We find various interesting insights, such as collecting rich multimodal data only from the test environment and leveraging cross-modal learning, we can achieve competitive results with generalist models (Oquab et al., 2023; Radford et al., 2021), pre-trained on large-scale internet-based datasets. This enables an alternative scenario where the need for external Internet-scale datasets for pre-training models is reduced. We also present a set of analyses and ablations that raise intriguing points on substituting data with (multi)modality, and how varying pre-training data enables a tradeoff between a model’s abilities to specialise to a test environment, and generalize to held-out spaces.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Singh et al. "Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality." International Conference on Learning Representations, 2026.

Markdown

[Singh et al. "Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/singh2026iclr-multimodality/)

BibTeX

@inproceedings{singh2026iclr-multimodality,
  title     = {{Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality}},
  author    = {Singh, Kunal Pratap and Garjani, Ali and Singh, Rishubh and Khattak, Muhammad Uzair and Tarhan, Efe and Toskov, Jason and Atanov, Andrei and Kar, Oğuzhan Fatih and Zamir, Amir},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/singh2026iclr-multimodality/}
}