Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks
Abstract
Nuanced-concept image classification tasks often require substantial labeled data. The labeling process for such problems is time-consuming and labor-intensive. While zero-shot methods like CLIP, Modeling Collaborator, and AdaptCLIPZS have shown promising results, they generally lack a versatile open source pipeline for domain-independent, multi-class fine-grained classification. We are proposing a classification pipeline consisting of weak supervision and open-source Vision Language Models (VLMs) to be employed in both binary and multi-class nuanced classification problems. Our proposed pipeline is domain-independent as it uses knowledge embedded in the pre-training of VLMs. This eliminates the need for additional fine-tuning for specific contexts, as required by methods such as AdaptCLIPZS. In our proposed pipeline, VLMs serve as weak labelers in the classification tasks, while a Weak Supervision (WS) model aggregates those labels and produce a set of pseudo labels (pseudo ground-truth) to train an end classifier. We have conducted multiple experiments to demonstrate the validity of the pipeline in both binary and multi-class classification tasks. The experimental results have shown that our proposed pipeline is capable of producing superior results in both binary and multi-class problems compared to the state-of-the-art zero-shot classification methods.
Cite
Text
Tousi et al. "Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.Markdown
[Tousi et al. "Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/tousi2025cvprw-combining/)BibTeX
@inproceedings{tousi2025cvprw-combining,
title = {{Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks}},
author = {Tousi, Seyed Mohamad Ali and Demby's, Jacket and Farag, Ramy and Omotara, Gbenga and DeSouza, Guilherme N.},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2025},
pages = {2142-2151},
url = {https://mlanthology.org/cvprw/2025/tousi2025cvprw-combining/}
}