Object-Centric Latent Action Learning

Abstract

Leveraging vast amounts of internet video data for Embodied AI is currently bottle-necked by the lack of action annotations and the presence of action-correlated distractors. We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO, that employs self-supervised decomposition of scenes into object representations and annotates video data with proxy-action labels. This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation of latent action learning approaches caused by distractors. Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by **x2.7** and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by **x2.6** on average.

Cite

Text

Klepach et al. "Object-Centric Latent Action Learning." ICLR 2025 Workshops: WRL, 2025.

Markdown

[Klepach et al. "Object-Centric Latent Action Learning." ICLR 2025 Workshops: WRL, 2025.](https://mlanthology.org/iclrw/2025/klepach2025iclrw-objectcentric/)

BibTeX

@inproceedings{klepach2025iclrw-objectcentric,
  title     = {{Object-Centric Latent Action Learning}},
  author    = {Klepach, Albina and Nikulin, Alexander and Zisman, Ilya and Tarasov, Denis and Derevyagin, Alexander and Polubarov, Andrei and Nikita, Lyubaykin and Kurenkov, Vladislav},
  booktitle = {ICLR 2025 Workshops: WRL},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/klepach2025iclrw-objectcentric/}
}