Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation
Abstract
Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems
Cite
Text
Zhang et al. "Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation." Proceedings of The 9th Conference on Robot Learning, 2025.Markdown
[Zhang et al. "Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation." Proceedings of The 9th Conference on Robot Learning, 2025.](https://mlanthology.org/corl/2025/zhang2025corl-generative/)BibTeX
@inproceedings{zhang2025corl-generative,
title = {{Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation}},
author = {Zhang, Chuye and Zhang, Xiaoxiong and Zheng, Linfang and Pan, Wei and Zhang, Wei},
booktitle = {Proceedings of The 9th Conference on Robot Learning},
year = {2025},
pages = {2823-2846},
volume = {305},
url = {https://mlanthology.org/corl/2025/zhang2025corl-generative/}
}