Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining
Abstract
We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that *learns* the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1\% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2\% absolute accuracy increase over all previous results and a 5.5\% increase over results that---like us---use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.
Cite
Text
Shechter and Carmon. "Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining." Advances in Neural Information Processing Systems, 2025.Markdown
[Shechter and Carmon. "Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/shechter2025neurips-filter/)BibTeX
@inproceedings{shechter2025neurips-filter,
title = {{Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining}},
author = {Shechter, Mikey and Carmon, Yair},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/shechter2025neurips-filter/}
}