Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Abstract
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content.The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model will be publicly released.
Cite
Text
Thawakar et al. "Beyond Simple Edits: Composed Video Retrieval with Dense Modifications." International Conference on Computer Vision, 2025.Markdown
[Thawakar et al. "Beyond Simple Edits: Composed Video Retrieval with Dense Modifications." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/thawakar2025iccv-beyond/)BibTeX
@inproceedings{thawakar2025iccv-beyond,
title = {{Beyond Simple Edits: Composed Video Retrieval with Dense Modifications}},
author = {Thawakar, Omkar and Demidov, Dmitry and Thawkar, Ritesh and Anwer, Rao Muhammad and Shah, Mubarak and Khan, Fahad Shahbaz and Khan, Salman},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {20435-20444},
url = {https://mlanthology.org/iccv/2025/thawakar2025iccv-beyond/}
}