Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model’s intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view–conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods.

Cite

Text

Wu et al. "Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-geometry/)

BibTeX

@inproceedings{wu2026iclr-geometry,
  title     = {{Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling}},
  author    = {Wu, Haoyu and Wu, Diankun and He, Tianyu and Guo, Junliang and Ye, Yang and Duan, Yueqi and Bian, Jiang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-geometry/}
}