Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

Abstract

In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

Cite

Text

Reddy et al. "Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01834

Markdown

[Reddy et al. "Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/reddy2025cvpr-videocolbert/) doi:10.1109/CVPR52734.2025.01834

BibTeX

@inproceedings{reddy2025cvpr-videocolbert,
  title     = {{Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval}},
  author    = {Reddy, Arun and Martin, Alexander and Yang, Eugene and Yates, Andrew and Sanders, Kate and Murray, Kenton and Kriz, Reno and de Melo, Celso M. and Van Durme, Benjamin and Chellappa, Rama},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {19691-19701},
  doi       = {10.1109/CVPR52734.2025.01834},
  url       = {https://mlanthology.org/cvpr/2025/reddy2025cvpr-videocolbert/}
}