DS-VLM: Diffusion Supervision Vision Language Model

Zhen Sun, Yunhang Shen, Jie Li, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji

ICML 2025 pp. 57667-57679

/icml/2025/sun2025icml-dsvlm/

Abstract

Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Sun et al. "DS-VLM: Diffusion Supervision Vision Language Model." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Sun et al. "DS-VLM: Diffusion Supervision Vision Language Model." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/sun2025icml-dsvlm/)

BibTeX

@inproceedings{sun2025icml-dsvlm,
  title     = {{DS-VLM: Diffusion Supervision Vision Language Model}},
  author    = {Sun, Zhen and Shen, Yunhang and Li, Jie and Sun, Xing and Dai, Pingyang and Cao, Liujuan and Ji, Rongrong},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {57667-57679},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/sun2025icml-dsvlm/}
}