Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining

Abstract

We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that *learns* the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1\% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2\% absolute accuracy increase over all previous results and a 5.5\% increase over results that---like us---use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.

Cite

Text

Shechter and Carmon. "Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining." Advances in Neural Information Processing Systems, 2025.

Markdown

[Shechter and Carmon. "Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/shechter2025neurips-filter/)

BibTeX

@inproceedings{shechter2025neurips-filter,
  title     = {{Filter like You Test: Data-Driven Data Filtering for CLIP Pretraining}},
  author    = {Shechter, Mikey and Carmon, Yair},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/shechter2025neurips-filter/}
}