VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Abstract

We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

Cite

Text

Kim et al. "VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02706

Markdown

[Kim et al. "VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/kim2025cvpr-videocomp/) doi:10.1109/CVPR52734.2025.02706

BibTeX

@inproceedings{kim2025cvpr-videocomp,
  title     = {{VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models}},
  author    = {Kim, Dahun and Piergiovanni, Aj and Mallya, Ganesh and Angelova, Anelia},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29060-29070},
  doi       = {10.1109/CVPR52734.2025.02706},
  url       = {https://mlanthology.org/cvpr/2025/kim2025cvpr-videocomp/}
}