OVG-HQ: Online Video Grounding with Hybrid-Modal Queries
Abstract
Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that uses neural network parameters to dynamically retain past context and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. We will release our source code and dataset.
Cite
Text
Zeng et al. "OVG-HQ: Online Video Grounding with Hybrid-Modal Queries." International Conference on Computer Vision, 2025.Markdown
[Zeng et al. "OVG-HQ: Online Video Grounding with Hybrid-Modal Queries." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zeng2025iccv-ovghq/)BibTeX
@inproceedings{zeng2025iccv-ovghq,
title = {{OVG-HQ: Online Video Grounding with Hybrid-Modal Queries}},
author = {Zeng, Runhao and Mao, Jiaqi and Lai, Minghao and Phan, Minh Hieu and Dong, Yanjie and Wang, Wei and Chen, Qi and Hu, Xiping},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {21085-21096},
url = {https://mlanthology.org/iccv/2025/zeng2025iccv-ovghq/}
}