VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Abstract

Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

Cite

Text

Munasinghe et al. "VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01773

Markdown

[Munasinghe et al. "VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/munasinghe2025cvpr-videoglamm/) doi:10.1109/CVPR52734.2025.01773

BibTeX

@inproceedings{munasinghe2025cvpr-videoglamm,
  title     = {{VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos}},
  author    = {Munasinghe, Shehan and Gani, Hanan and Zhu, Wenqi and Cao, Jiale and Xing, Eric and Khan, Fahad Shahbaz and Khan, Salman},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {19036-19046},
  doi       = {10.1109/CVPR52734.2025.01773},
  url       = {https://mlanthology.org/cvpr/2025/munasinghe2025cvpr-videoglamm/}
}