Identity-Text Video Corpus Grounding

Abstract

Video corpus grounding (VCG), which aims to retrieve relevant video moments from a video corpus, has attracted significant attention in the multimedia research community. However, the existing VCG setting primarily focuses on matching textual descriptions with videos and ignores the distinct visual identities in the videos, thus resulting in inaccurate understanding of video content and deteriorated retrieval performances. To address this limitation, we introduce a novel task, Identity-Text Video Corpus Grounding (ITVCG), which simultaneously utilize textual descriptions and visual identities as queries. As such, ITVCG benefits in enabling more accurate video corpus grounding with visual identities, as well as providing users with more flexible options to locate relevant frames based on either textual descriptions or textual descriptions and visual identities. To conduct evaluations regarding the novel ITVCG task, we propose the TVR-IT dataset, comprising 463 identity images from 6 TV shows, with 68,840 out of 72,840 queries containing at least one identity image. Furthermore, we propose Video-Locator, the first model designed for the ITVCG task. Our proposed Video-Locator integrates video-identity-text alignment and multi-modal fine-grained fusion components, facilitating a video large language model (Video LLM) to jointly understand textual descriptions, visual identities, as well as videos. Experimental results demonstrate the effectiveness of the proposed Video-Locator model and highlight the importance of identity-generalization capability for ITVCG.

Cite

Text

Huang et al. "Identity-Text Video Corpus Grounding." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32375

Markdown

[Huang et al. "Identity-Text Video Corpus Grounding." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/huang2025aaai-identity/) doi:10.1609/AAAI.V39I4.32375

BibTeX

@inproceedings{huang2025aaai-identity,
  title     = {{Identity-Text Video Corpus Grounding}},
  author    = {Huang, Bin and Wang, Xin and Chen, Hong and Chen, Houlun and Wu, Yaofei and Zhu, Wenwu},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {3608-3616},
  doi       = {10.1609/AAAI.V39I4.32375},
  url       = {https://mlanthology.org/aaai/2025/huang2025aaai-identity/}
}