Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is publicly available https://github.com/HVision-NKU/DenseVLM.

Cite

Text

Li et al. "Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction." International Conference on Computer Vision, 2025.

Markdown

[Li et al. "Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/li2025iccv-unbiased/)

BibTeX

@inproceedings{li2025iccv-unbiased,
  title     = {{Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction}},
  author    = {Li, Yunheng and Li, Yuxuan and Zeng, Quan-Sheng and Wang, Wenhai and Hou, Qibin and Cheng, Ming-Ming},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {23795-23805},
  url       = {https://mlanthology.org/iccv/2025/li2025iccv-unbiased/}
}