ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

Abstract

Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

Cite

Text

Yenphraphai et al. "ShapeGen4D: Towards High Quality 4D Shape Generation from Videos." International Conference on Learning Representations, 2026.

Markdown

[Yenphraphai et al. "ShapeGen4D: Towards High Quality 4D Shape Generation from Videos." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yenphraphai2026iclr-shapegen4d/)

BibTeX

@inproceedings{yenphraphai2026iclr-shapegen4d,
  title     = {{ShapeGen4D: Towards High Quality 4D Shape Generation from Videos}},
  author    = {Yenphraphai, Jiraphon and Mirzaei, Ashkan and Chen, Jianqi and Zou, Jiaxu and Tulyakov, Sergey and Yeh, Raymond A. and Wonka, Peter and Wang, Chaoyang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yenphraphai2026iclr-shapegen4d/}
}