A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

Abstract

Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.

Cite

Text

Pham et al. "A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers." International Conference on Learning Representations, 2026.

Markdown

[Pham et al. "A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/pham2026iclr-hidden/)

BibTeX

@inproceedings{pham2026iclr-hidden,
  title     = {{A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers}},
  author    = {Pham, Trung X. and Zhang, Kang and Hong, Ji Woo and Yoo, Chang D.},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/pham2026iclr-hidden/}
}