Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Abstract

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their ability to generalize remains limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend more difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small but comprehensive and balanced benchmark suitable for LVLM evaluation, which is sourced from available public benchmarks. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using significantly less training data than prior LVLM approaches, while improving its general video understanding capabilities. Project Page: https://xuboshen.github.io/Time-R1/.

Cite

Text

Wang et al. "Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-timer1/)

BibTeX

@inproceedings{wang2025neurips-timer1,
  title     = {{Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding}},
  author    = {Wang, Ye and Wang, Ziheng and Xu, Boshen and Du, Yang and Lin, Kejun and Xiao, Zihan and Yue, Zihao and Ju, Jianzhong and Zhang, Liang and Yang, Dingyi and Fang, Xiangnan and He, Zewen and Luo, Zhenbo and Wang, Wenxuan and Lin, Junqi and Luan, Jian and Jin, Qin},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-timer1/}
}