ViCaS: A Dataset for Combining Holistic and Pixel-Level Video Understanding Using Captions with Grounded Segmentation

Abstract

Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: https://ali2500.github.io/vicas-project/

Cite

Text

Athar et al. "ViCaS: A Dataset for Combining Holistic and Pixel-Level Video Understanding Using Captions with Grounded Segmentation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01772

Markdown

[Athar et al. "ViCaS: A Dataset for Combining Holistic and Pixel-Level Video Understanding Using Captions with Grounded Segmentation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/athar2025cvpr-vicas/) doi:10.1109/CVPR52734.2025.01772

BibTeX

@inproceedings{athar2025cvpr-vicas,
  title     = {{ViCaS: A Dataset for Combining Holistic and Pixel-Level Video Understanding Using Captions with Grounded Segmentation}},
  author    = {Athar, Ali and Deng, Xueqing and Chen, Liang-Chieh},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {19023-19035},
  doi       = {10.1109/CVPR52734.2025.01772},
  url       = {https://mlanthology.org/cvpr/2025/athar2025cvpr-vicas/}
}