RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks
Abstract
To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can handle. Typically, the VLM planner needs finetuning to learn to decompose a new task, which requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, without prior knowledge, the heuristic sub-tasks can deviate significantly from the visuomotor policy's training data, thereby degrading task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes video demonstrations into sub-tasks with prior by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. RDD outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at https://rdd-neurips.github.io
Cite
Text
Yan et al. "RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks." Advances in Neural Information Processing Systems, 2025.Markdown
[Yan et al. "RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yan2025neurips-rdd/)BibTeX
@inproceedings{yan2025neurips-rdd,
title = {{RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks}},
author = {Yan, Mingxuan and Wang, Yuping and Liu, Zechun and Li, Jiachen},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/yan2025neurips-rdd/}
}