Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Abstract

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $\eta\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{\eta \sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{\eta n}}$ in the large $\eta$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $\eta$ regime.

Cite

Text

Pareek et al. "Understanding the Gain from Data Filtering in Multimodal Contrastive Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Pareek et al. "Understanding the Gain from Data Filtering in Multimodal Contrastive Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/pareek2025neurips-understanding/)

BibTeX

@inproceedings{pareek2025neurips-understanding,
  title     = {{Understanding the Gain from Data Filtering in Multimodal Contrastive Learning}},
  author    = {Pareek, Divyansh and Oh, Sewoong and Du, Simon Shaolei},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/pareek2025neurips-understanding/}
}