DS-VLM: Diffusion Supervision Vision Language Model

Abstract

Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.

Cite

Text

Sun et al. "DS-VLM: Diffusion Supervision Vision Language Model." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Sun et al. "DS-VLM: Diffusion Supervision Vision Language Model." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/sun2025icml-dsvlm/)

BibTeX

@inproceedings{sun2025icml-dsvlm,
  title     = {{DS-VLM: Diffusion Supervision Vision Language Model}},
  author    = {Sun, Zhen and Shen, Yunhang and Li, Jie and Sun, Xing and Dai, Pingyang and Cao, Liujuan and Ji, Rongrong},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {57667-57679},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/sun2025icml-dsvlm/}
}