Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator

Abstract

The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce **VIST3A**, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit *model stitching*, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt *direct reward finetuning*, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.

Cite

Text

Go et al. "Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator." International Conference on Learning Representations, 2026.

Markdown

[Go et al. "Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/go2026iclr-textto3d/)

BibTeX

@inproceedings{go2026iclr-textto3d,
  title     = {{Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator}},
  author    = {Go, Hyojun and Narnhofer, Dominik and Bhat, Goutam and Truong, Prune and Tombari, Federico and Schindler, Konrad},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/go2026iclr-textto3d/}
}