4D Visual Pre-Training for Robot Learning

Hou, Chengkai; Ze, Yanjie; Fu, Yankai; Gao, Zeyu; Hu, Songbo; Yu, Yue; Zhang, Shanghang; Xu, Huazhe

4D Visual Pre-Training for Robot Learning

Chengkai Hou, Yanjie Ze, Yankai Fu, Zeyu Gao, Songbo Hu, Yue Yu, Shanghang Zhang, Huazhe Xu

ICCV 2025 pp. 8451-8461

/iccv/2025/hou2025iccv-4d/

Abstract

General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of \ours adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d-visual-pretraining.github.io/.

PDF ICCV Semantic Scholar

Cite

Text

Hou et al. "4D Visual Pre-Training for Robot Learning." International Conference on Computer Vision, 2025.

Markdown

[Hou et al. "4D Visual Pre-Training for Robot Learning." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/hou2025iccv-4d/)

BibTeX

@inproceedings{hou2025iccv-4d,
  title     = {{4D Visual Pre-Training for Robot Learning}},
  author    = {Hou, Chengkai and Ze, Yanjie and Fu, Yankai and Gao, Zeyu and Hu, Songbo and Yu, Yue and Zhang, Shanghang and Xu, Huazhe},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {8451-8461},
  url       = {https://mlanthology.org/iccv/2025/hou2025iccv-4d/}
}