Training-Free Generation of Temporally Consistent Rewards from VLMs
Abstract
Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose T2-VLM, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the status changes in VLM-derived subgoals. Specifically, our method first queries the VLM to establish spatially aware subgoals and an initial completion estimate before each round of interaction. We then employ a Bayesian tracking algorithm to update the goal completion status dynamically, using subgoal hidden states to generate structured rewards for reinforcement learning (RL) agents. This approach enhances long-horizon decision-making and improves failure recovery capabilities with RL. Extensive experiments indicate that T2-VLM achieves state-of-the-art performance in two robot manipulation benchmarks, demonstrating superior reward accuracy with reduced computation consumption. We believe our approach not only advances reward generation techniques but also contributes to the broader field of embodied AI. Project website: https://t2-vlm.github.io/.
Cite
Text
Zhao et al. "Training-Free Generation of Temporally Consistent Rewards from VLMs." International Conference on Computer Vision, 2025.Markdown
[Zhao et al. "Training-Free Generation of Temporally Consistent Rewards from VLMs." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhao2025iccv-trainingfree/)BibTeX
@inproceedings{zhao2025iccv-trainingfree,
title = {{Training-Free Generation of Temporally Consistent Rewards from VLMs}},
author = {Zhao, Yinuo and Yuan, Jiale and Xu, Zhiyuan and Hao, Xiaoshuai and Zhang, Xinyi and Wu, Kun and Che, Zhengping and Liu, Chi Harold and Tang, Jian},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {8133-8143},
url = {https://mlanthology.org/iccv/2025/zhao2025iccv-trainingfree/}
}