Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation

Abstract

Deep learning models often suffer from performance degradation in unseen domains, posing a risk for safety-critical applications such as autonomous driving. To tackle this problem, recent studies have leveraged pre-trained Visual Foundation Models (VFMs) to enhance generalization. However, exsiting works mainly focus on designing intricate networks for VFMs, neglecting their inherent strong generalization potential. Moreover, these methods typically perform inference on low-resolution images. The loss of detail hinders accurate predictions in unseen domains, especially for small objects. In this paper, we argue that simply fine-tuning VFMs and leveraging high-resolution images unleash the power of VFMs for generalizable semantic segmentation. Therefore, we design a VFM-based segmentation network (VFMNet) that adapts VFMs to this task with minimal fine-tuning, preserving their generalizable knowledge. Then, to fully utilize high-resolution images, we train a Mask-guided Refinement Network (MGRNet) to refine VFMNet's predictions combining detailed image features. Furthermore, we adopt a two-stage coarse-to-fine inference approach. MGRNet is used to refine the low-confidence regions predicted by VFMNet to obtain fine-grained results. Extensive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art methods by 3.3% on the average mIoU in synthetic-to-real domain generalization.

Cite

Text

Tang et al. "Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I19.34295

Markdown

[Tang et al. "Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/tang2025aaai-unleashing-a/) doi:10.1609/AAAI.V39I19.34295

BibTeX

@inproceedings{tang2025aaai-unleashing-a,
  title     = {{Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation}},
  author    = {Tang, PeiYuan and Zhang, Xiaodong and Yang, Chunze and Yuan, Haoran and Sun, Jun and Shan, Danfeng and Yang, Zijiang James},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {20823-20831},
  doi       = {10.1609/AAAI.V39I19.34295},
  url       = {https://mlanthology.org/aaai/2025/tang2025aaai-unleashing-a/}
}