Efficient Object-Centric Representation Learning Using Masked Generative Modeling
Abstract
Learning object-centric representations from visual inputs in an unsupervised manner has drawn focus to solve more complex tasks, such as reasoning and reinforcement learning. However, current state-of-the-art methods, relying on autoregressive transformers or diffusion models to generate scenes from object-centric representations, suffer from computational inefficiency due to their sequential or iterative nature. This computational bottleneck limits their practical application and hinders scaling to more complex downstream tasks. To overcome this, we propose MOGENT, an efficient object-centric learning framework based on masked generative modeling. MOGENT conditions a masked bidirectional transformer on learned object slots and employs a parallel iterative decoding scheme to generate scenes, enabling efficient compositional generation. Experiments show that MOGENT significantly improves computational efficiency, accelerating the generation process by up to 67x and 17x compared to autoregressive models and diffusion-based models, respectively. Importantly, the efficiency is attained while maintaining strong or competitive performance on object segmentation and compositional generation tasks.
Cite
Text
Nakano et al. "Efficient Object-Centric Representation Learning Using Masked Generative Modeling." Transactions on Machine Learning Research, 2025.Markdown
[Nakano et al. "Efficient Object-Centric Representation Learning Using Masked Generative Modeling." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/nakano2025tmlr-efficient/)BibTeX
@article{nakano2025tmlr-efficient,
title = {{Efficient Object-Centric Representation Learning Using Masked Generative Modeling}},
author = {Nakano, Akihiro and Suzuki, Masahiro and Matsuo, Yutaka},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/nakano2025tmlr-efficient/}
}