HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Abstract
Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce **HierarQ**, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed **Hierar**chical **Q**uerying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on **10** video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis. All code will be made available upon acceptance.
Cite
Text
Azad et al. "HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00799Markdown
[Azad et al. "HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/azad2025cvpr-hierarq/) doi:10.1109/CVPR52734.2025.00799BibTeX
@inproceedings{azad2025cvpr-hierarq,
title = {{HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding}},
author = {Azad, Shehreen and Vineet, Vibhav and Rawat, Yogesh Singh},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {8545-8556},
doi = {10.1109/CVPR52734.2025.00799},
url = {https://mlanthology.org/cvpr/2025/azad2025cvpr-hierarq/}
}