Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks

Abstract

Nuanced-concept image classification tasks often require substantial labeled data. The labeling process for such problems is time-consuming and labor-intensive. While zero-shot methods like CLIP, Modeling Collaborator, and AdaptCLIPZS have shown promising results, they generally lack a versatile open source pipeline for domain-independent, multi-class fine-grained classification. We are proposing a classification pipeline consisting of weak supervision and open-source Vision Language Models (VLMs) to be employed in both binary and multi-class nuanced classification problems. Our proposed pipeline is domain-independent as it uses knowledge embedded in the pre-training of VLMs. This eliminates the need for additional fine-tuning for specific contexts, as required by methods such as AdaptCLIPZS. In our proposed pipeline, VLMs serve as weak labelers in the classification tasks, while a Weak Supervision (WS) model aggregates those labels and produce a set of pseudo labels (pseudo ground-truth) to train an end classifier. We have conducted multiple experiments to demonstrate the validity of the pipeline in both binary and multi-class classification tasks. The experimental results have shown that our proposed pipeline is capable of producing superior results in both binary and multi-class problems compared to the state-of-the-art zero-shot classification methods.

Cite

Text

Tousi et al. "Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Tousi et al. "Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/tousi2025cvprw-combining/)

BibTeX

@inproceedings{tousi2025cvprw-combining,
  title     = {{Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks}},
  author    = {Tousi, Seyed Mohamad Ali and Demby's, Jacket and Farag, Ramy and Omotara, Gbenga and DeSouza, Guilherme N.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {2142-2151},
  url       = {https://mlanthology.org/cvprw/2025/tousi2025cvprw-combining/}
}