Building 3D Representations and Generating Motions from a Single Image via Video-Generation

Abstract

Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as \emph{DepthAnything}. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zhi et al. "Building 3D Representations and Generating Motions from a Single Image via Video-Generation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhi et al. "Building 3D Representations and Generating Motions from a Single Image via Video-Generation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhi2025neurips-building/)

BibTeX

@inproceedings{zhi2025neurips-building,
  title     = {{Building 3D Representations and Generating Motions from a Single Image via Video-Generation}},
  author    = {Zhi, Weiming and Ma, Ziyong and Zhang, Tianyi and Johnson-Roberson, Matthew},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhi2025neurips-building/}
}