CoVR: Learning Composed Video Retrieval from Web Video Captions

Ventura, Lucas; Yang, Antoine; Schmid, Cordelia; Varol, Gül

doi:10.1609/AAAI.V38I6.28334

CoVR: Learning Composed Video Retrieval from Web Video Captions

Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

AAAI 2024 pp. 5270-5279

doi:10.1609/AAAI.V38I6.28334 /aaai/2024/ventura2024aaai-covr/

Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.

PDF AAAI Semantic Scholar

Cite

Text

Ventura et al. "CoVR: Learning Composed Video Retrieval from Web Video Captions." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28334

Markdown

[Ventura et al. "CoVR: Learning Composed Video Retrieval from Web Video Captions." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/ventura2024aaai-covr/) doi:10.1609/AAAI.V38I6.28334

BibTeX

@inproceedings{ventura2024aaai-covr,
  title     = {{CoVR: Learning Composed Video Retrieval from Web Video Captions}},
  author    = {Ventura, Lucas and Yang, Antoine and Schmid, Cordelia and Varol, Gül},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {5270-5279},
  doi       = {10.1609/AAAI.V38I6.28334},
  url       = {https://mlanthology.org/aaai/2024/ventura2024aaai-covr/}
}