Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

Abstract

This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding. Code and more results are available at: https://davidyao99.github.io/uni4d.

Cite

Text

Yao et al. "Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00112

Markdown

[Yao et al. "Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/yao2025cvpr-uni4d/) doi:10.1109/CVPR52734.2025.00112

BibTeX

@inproceedings{yao2025cvpr-uni4d,
  title     = {{Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video}},
  author    = {Yao, David Yifan and Zhai, Albert J. and Wang, Shenlong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {1116-1126},
  doi       = {10.1109/CVPR52734.2025.00112},
  url       = {https://mlanthology.org/cvpr/2025/yao2025cvpr-uni4d/}
}