DisTime: Distribution-Based Time Representation for Video Large Language Models

Abstract

Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. DisTime is released at https://github.com/josephzpng/DisTime.

Cite

Text

Zeng et al. "DisTime: Distribution-Based Time Representation for Video Large Language Models." International Conference on Computer Vision, 2025.

Markdown

[Zeng et al. "DisTime: Distribution-Based Time Representation for Video Large Language Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zeng2025iccv-distime/)

BibTeX

@inproceedings{zeng2025iccv-distime,
  title     = {{DisTime: Distribution-Based Time Representation for Video Large Language Models}},
  author    = {Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {21961-21971},
  url       = {https://mlanthology.org/iccv/2025/zeng2025iccv-distime/}
}