VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos

Gia, Bao Tran; Le, Khiem; Do, Tien; Mai, Tien-Dung; Ngo, Thanh Duc; Le, Duy-Dinh; Satoh, Shin'ichi

VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos

Bao Tran Gia, Khiem Le, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, Shin'ichi Satoh

CVPRW 2025 pp. 3689-3698

/cvprw/2025/gia2025cvprw-vrag/

Abstract

The rapid expansion of video data across various domains has heightened the demand for efficient retrieval and question-answering systems, particularly for long-form videos. Existing Video Question Answering (VQA) approaches struggle with processing extended video sequences due to high computational costs, loss of contextual coherence, and challenges in retrieving relevant information. To tackle these limitations, we introduce VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos, a novel framework that brings a retrieval-augmented generation (RAG) architecture to the video domain. VRAG first retrieves the most relevant video segments and then applies chunking and refinement to identify key sub-segments, enabling precise and focused answer generation. This approach maximizes the effectiveness of the Multimodal Large Language Model (MLLM) by ensuring only the most relevant content is processed. Our experimental evaluation on a benchmark demonstrates significant improvements in retrieval precision and answer quality. These results highlight the effectiveness of retrieval-augmented reasoning for scalable and accurate VQA in long-form video datasets.

PDF CVPRW Semantic Scholar

Cite

Text

Gia et al. "VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Gia et al. "VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/gia2025cvprw-vrag/)

BibTeX

@inproceedings{gia2025cvprw-vrag,
  title     = {{VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos}},
  author    = {Gia, Bao Tran and Le, Khiem and Do, Tien and Mai, Tien-Dung and Ngo, Thanh Duc and Le, Duy-Dinh and Satoh, Shin'ichi},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {3689-3698},
  url       = {https://mlanthology.org/cvprw/2025/gia2025cvprw-vrag/}
}