Image-Caption Difficulty for Efficient Weakly-Supervised Object Detection from In-the-Wild Data
Abstract
In recent years, we have witnessed the collection of larger and larger multi-modal, image-caption datasets: from hundreds of thousands such pairs to hundreds of millions. Such datasets allow researchers to build powerful deep learning models, at the cost of requiring intensive computational resources. In this work, we ask: can we use such datasets efficiently without sacrificing performance? We tackle this problem by extracting difficulty scores from each image-caption sample, and by using such scores to make training more effective and efficient. We compare two ways to use difficulty scores to influence training: filtering a representative subset of each dataset and ordering samples through curriculum learning. We analyze and compare difficulty scores extracted from a single modality—captions (i.e., caption length and number of object mentions) or images (i.e., region proposals’ size and number)—or based on alignment of image-caption pairs (i.e., CLIP and concreteness). We focus on Weakly-Supervised Object Detection where image-level labels are extracted from captions. We discover that (1) combining filtering and curriculum learning can achieve large gains in performance, but not all methods are stable across experimental settings, (2) singlemodality scores often outperform alignment-based ones, (3) alignment scores show the largest gains when training time is limited.
Cite
Text
Nebbia and Kovashka. "Image-Caption Difficulty for Efficient Weakly-Supervised Object Detection from In-the-Wild Data." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00266Markdown
[Nebbia and Kovashka. "Image-Caption Difficulty for Efficient Weakly-Supervised Object Detection from In-the-Wild Data." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/nebbia2024cvprw-imagecaption/) doi:10.1109/CVPRW63382.2024.00266BibTeX
@inproceedings{nebbia2024cvprw-imagecaption,
title = {{Image-Caption Difficulty for Efficient Weakly-Supervised Object Detection from In-the-Wild Data}},
author = {Nebbia, Giacomo and Kovashka, Adriana},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {2596-2605},
doi = {10.1109/CVPRW63382.2024.00266},
url = {https://mlanthology.org/cvprw/2024/nebbia2024cvprw-imagecaption/}
}