OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

Ji, Junyang; Zhang, Shengjun; Li, Da; Luo, Yuxiao; Wang, Yan; Xu, Di; Yang, Biao; Yuan, Wei; Yang, Fan; He, Zhihai; Yang, Wenming

OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

Junyang Ji, Shengjun Zhang, Da Li, Yuxiao Luo, Yan Wang, Di Xu, Biao Yang, Wei Yuan, Fan Yang, Zhihai He, Wenming Yang

ICLR 2026

/iclr/2026/ji2026iclr-omnicvr/

Abstract

Composed video retrieval presents a complex challenge: retrieving a target video based on a source video and a textual modification instruction. This task demands fine-grained reasoning over multimodal transformations. However, existing benchmarks predominantly focus on vision–text alignment, largely overlooking the rich semantic signals embedded in audio—such as speech, music, and environmental sounds—which are often decisive for comprehensive video understanding. To bridge this gap, we introduce **OmniCVR**, a large-scale benchmark for omni-composed video retrieval that establishes vision, audio, and text as first-class modalities. OmniCVR is constructed via a scalable, automated pipeline integrating content-aware segmentation, omni-modal annotation, and a rigorous dual-validation protocol involving both large language models and human experts. The benchmark comprises vision-centric, audio-centric, and integrated queries, with the latter forming the majority to accurately reflect real-world multimodal complexity. Furthermore, we propose **AudioVLM2Vec**, an audio-aware extension of VLM2Vec. By incorporating explicit audio semantics, AudioVLM2Vec achieves state-of-the-art performance, highlighting fundamental limitations in the audio reasoning capabilities of current multimodal retrieval systems.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Ji et al. "OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text." International Conference on Learning Representations, 2026.

Markdown

[Ji et al. "OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ji2026iclr-omnicvr/)

BibTeX

@inproceedings{ji2026iclr-omnicvr,
  title     = {{OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text}},
  author    = {Ji, Junyang and Zhang, Shengjun and Li, Da and Luo, Yuxiao and Wang, Yan and Xu, Di and Yang, Biao and Yuan, Wei and Yang, Fan and He, Zhihai and Yang, Wenming},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ji2026iclr-omnicvr/}
}