Exploring the Limits of Vision-Language-Action Manipulation in Cross-Task Generalization
Abstract
The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce **AGNOSTOS**, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for test—distinct from common training task distributions—and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose **Cross-Task In-Context Manipulation (X-ICM)**, a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a **dynamics-guided sample selection** strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs, achieving improvements of 6.0\% over $\pi_0$ and 7.9\% over VoxPoser. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
Cite
Text
Zhou et al. "Exploring the Limits of Vision-Language-Action Manipulation in Cross-Task Generalization." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhou et al. "Exploring the Limits of Vision-Language-Action Manipulation in Cross-Task Generalization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhou2025neurips-exploring-a/)BibTeX
@inproceedings{zhou2025neurips-exploring-a,
title = {{Exploring the Limits of Vision-Language-Action Manipulation in Cross-Task Generalization}},
author = {Zhou, Jiaming and Ye, Ke and Liu, Jiayi and Ma, Teli and Wang, Zifan and Qiu, Ronghe and Lin, Kun-Yu and Zhao, Zhilin and Liang, Junwei},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhou2025neurips-exploring-a/}
}