ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Abstract

Embodied cognition argues that intelligence arises from continuous sensorimotor interaction with the world. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? To investigate this, we introduce ENACT, a benchmark that probes this question through world modeling from egocentric interaction. Grounded in a partially observable Markov decision process (POMDP) framework, ENACT comprises two complementary sequence reordering tasks: forward world modeling (predicting an ordered sequence of future states from actions) and inverse world modeling (inferring an ordered sequence of actions from state changes). Correctly solving these tasks indicates that the model has a solid understanding of how the environment will evolve given one's actions. Our scalable dataset contains 8,972 QA pairs derived from diverse, long-horizon household activities in the BEHAVIOR simulator. Experiments reveal a significant performance gap between state-of-the-art VLMs and humans, which widens dramatically as interaction horizons lengthen. We find that models consistently solve the inverse problem better than the forward one and exhibit strong embodied biases, showing a preference for right-handed actions and performance degradation with camera perspectives that deviate from those of human vision.

Cite

Text

Wang et al. "ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-enact/)

BibTeX

@inproceedings{wang2026iclr-enact,
  title     = {{ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction}},
  author    = {Wang, Qineng and Huang, Wenlong and Zhou, Yu and Yin, Hang and Bao, Tianwei and Lyu, Jianwen and Liu, Weiyu and Zhang, Ruohan and Wu, Jiajun and Fei-Fei, Li and Li, Manling},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-enact/}
}