Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Abstract

Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent’s egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as “Tracking Anything”, to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot. 1 1 Project Website: https://sites.google.com/view/offline-evt

Cite

Text

Zhong et al. "Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73464-9_9

Markdown

[Zhong et al. "Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhong2024eccv-empowering/) doi:10.1007/978-3-031-73464-9_9

BibTeX

@inproceedings{zhong2024eccv-empowering,
  title     = {{Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL}},
  author    = {Zhong, Fangwei and Wu, Kui and Ci, Hai and Wang, Chu-ran and Chen, Hao},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73464-9_9},
  url       = {https://mlanthology.org/eccv/2024/zhong2024eccv-empowering/}
}