Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory

Abstract

Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose MindExplore, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning long-horizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweight Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create SandGo-1k and SandThink-21k, the first expert-level multimodal embodied dataset and CoT dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 times more successful than existing methods in unstructured and dynamic environments.

Cite

Text

Li et al. "Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory." International Conference on Computer Vision, 2025.

Markdown

[Li et al. "Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/li2025iccv-longhorizon/)

BibTeX

@inproceedings{li2025iccv-longhorizon,
  title     = {{Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory}},
  author    = {Li, Daixun and Zhang, Yusi and Cao, Mingxiang and Liu, Donglai and Xie, Weiying and Hui, Tianlin and Lin, Lunkai and Xie, Zhiqiang and Li, Yunsong},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {6839-6848},
  url       = {https://mlanthology.org/iccv/2025/li2025iccv-longhorizon/}
}