RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

Abstract

To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can handle. Typically, the VLM planner needs finetuning to learn to decompose a new task, which requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, without prior knowledge, the heuristic sub-tasks can deviate significantly from the visuomotor policy's training data, thereby degrading task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes video demonstrations into sub-tasks with prior by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. RDD outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at https://rdd-neurips.github.io

Cite

Text

Yan et al. "RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks." Advances in Neural Information Processing Systems, 2025.

Markdown

[Yan et al. "RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yan2025neurips-rdd/)

BibTeX

@inproceedings{yan2025neurips-rdd,
  title     = {{RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks}},
  author    = {Yan, Mingxuan and Wang, Yuping and Liu, Zechun and Li, Jiachen},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/yan2025neurips-rdd/}
}