ActionEQA: Action Interface for Embodied Question Answering

Bao, Tianwei; Wang, Qineng; Wang, Kangrui; Deng, Mingkai; Liu, Guangyi; Mao, Jiayuan; Birnbaum, Lawrence; Hu, Zhiting; Xing, Eric P.; Wang, Zhaoran; Li, Manling

ActionEQA: Action Interface for Embodied Question Answering

Tianwei Bao, Qineng Wang, Kangrui Wang, Mingkai Deng, Guangyi Liu, Jiayuan Mao, Lawrence Birnbaum, Zhiting Hu, Eric P. Xing, Zhaoran Wang, Manling Li

TMLR 2026

/tmlr/2026/bao2026tmlr-actioneqa/

Abstract

While Vision-Language Models (VLMs) are increasingly integral to embodied intelligence, a significant action understanding bottleneck persists in translating high-level semantic instructions into precise low-level physical actions. However, current benchmarks for embodied agents primarily focus on high-level perception and planning, failing to capture the depth and nature of this semantic-to-physical gap. To address this, we introduce ActionEQA, the first Embodied Question Answering (EQA) benchmark designed to methodically evaluate the ability of VLMs to bridge this critical yet underexplored semantic-physical divide. Grounded in real-world robotics data, ActionEQA thoroughly analyzes VLMs’ grasp of the action interface using a dual-tier design: (1) a Three-Tiered Action Hierarchy for pinpointing the depth at which VLMs' action reasoning collapses. (2) Bidirectional Reasoning Tasks for testing whether VLMs struggle more to predict action outcomes or infer the actions that led to them. Our key findings reveal: (1) The primary bottleneck in action understanding occurs at the mid-level, arising from the challenge of grounding compositional language in 3D physical geometry. (2) VLMs are more adept at inferring past actions than predicting their future outcomes. (3) Richer visual inputs require greater spatial reasoning from VLMs to map actions to physical geometry. (4) Within the action hierarchy, model failures shift from predominantly perceptual errors at the high level to flawed geometric and physical reasoning at the low level.

PDF TMLR OpenReview Semantic Scholar

Cite

Text

Bao et al. "ActionEQA: Action Interface for Embodied Question Answering." Transactions on Machine Learning Research, 2026.

Markdown

[Bao et al. "ActionEQA: Action Interface for Embodied Question Answering." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/bao2026tmlr-actioneqa/)

BibTeX

@article{bao2026tmlr-actioneqa,
  title     = {{ActionEQA: Action Interface for Embodied Question Answering}},
  author    = {Bao, Tianwei and Wang, Qineng and Wang, Kangrui and Deng, Mingkai and Liu, Guangyi and Mao, Jiayuan and Birnbaum, Lawrence and Hu, Zhiting and Xing, Eric P. and Wang, Zhaoran and Li, Manling},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/bao2026tmlr-actioneqa/}
}