Representing 3D Shapes with 64 Latent Vectors for 3D Diffusion Models

Abstract

Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models. This paper introduces COD-VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency. First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency. Experimental results demonstrate that COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation.

Cite

Text

Cho et al. "Representing 3D Shapes with 64 Latent Vectors for 3D Diffusion Models." International Conference on Computer Vision, 2025.

Markdown

[Cho et al. "Representing 3D Shapes with 64 Latent Vectors for 3D Diffusion Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/cho2025iccv-representing/)

BibTeX

@inproceedings{cho2025iccv-representing,
  title     = {{Representing 3D Shapes with 64 Latent Vectors for 3D Diffusion Models}},
  author    = {Cho, In and Yoo, Youngbeom and Jeon, Subin and Kim, Seon Joo},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {28556-28566},
  url       = {https://mlanthology.org/iccv/2025/cho2025iccv-representing/}
}