Pre-Trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner

ICLRW 2024

/iclrw/2024/gupta2024iclrw-pretrained/

Abstract

Vision- and language-guided embodied AI requires a fine-grained understanding of the physical world through language and visual inputs. Such capabilities are difficult to learn solely from task-specific data, which has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding---a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Gupta et al. "Pre-Trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control." ICLR 2024 Workshops: R2-FM, 2024.

Markdown

[Gupta et al. "Pre-Trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/gupta2024iclrw-pretrained/)

BibTeX

@inproceedings{gupta2024iclrw-pretrained,
  title     = {{Pre-Trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control}},
  author    = {Gupta, Gunshi and Yadav, Karmesh and Gal, Yarin and Batra, Dhruv and Kira, Zsolt and Lu, Cong and Rudner, Tim G. J.},
  booktitle = {ICLR 2024 Workshops: R2-FM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/gupta2024iclrw-pretrained/}
}