CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Abstract

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects and parts, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. We find that the proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

Cite

Text

Didolkar et al. "CTRL-O: Language-Controllable Object-Centric Visual Representation Learning." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02749

Markdown

[Didolkar et al. "CTRL-O: Language-Controllable Object-Centric Visual Representation Learning." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/didolkar2025cvpr-ctrlo/) doi:10.1109/CVPR52734.2025.02749

BibTeX

@inproceedings{didolkar2025cvpr-ctrlo,
  title     = {{CTRL-O: Language-Controllable Object-Centric Visual Representation Learning}},
  author    = {Didolkar, Aniket and Zadaianchuk, Andrii and Awal, Rabiul and Seitzer, Maximilian and Gavves, Efstratios and Agrawal, Aishwarya},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29523-29533},
  doi       = {10.1109/CVPR52734.2025.02749},
  url       = {https://mlanthology.org/cvpr/2025/didolkar2025cvpr-ctrlo/}
}