KDA: Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding

ICCV 2025 pp. 23311-23320

Abstract

Video Temporal Grounding (VTG) confronts the challenge of bridging the semantic gap between concise textual queries and the rich complexity of video content, further compounded by the difficulty of capturing discriminative features without explicit target cues. To address these challenges, we propose Knowledge Diffusion Alignment (KDA), a framework that leverages the generative prowess of diffusion models. KDA introduces a multi-layer video knowledge extraction module alongside a background residual diffusion model that progressively prunes irrelevant background information from global video features, thereby distilling query-relevant moment knowledge enriched with visual context. By a three-stage training approach that harnesses annotated moment guidance, KDA guarantees that the extracted moment knowledge incorporates the discriminative features necessary for accurate localization. A knowledge prompt reasoning module facilitates the comprehensive interaction and utilization of moment knowledge and multimodal features. Moreover, we introduce a spans-enhanced decoder that selectively integrates spans from multi-modal features, capitalizing on intrinsic alignment cues. Comprehensive experiments on three datasets demonstrate performance that surpasses state-of-the-art methods, attesting to the effectiveness of the proposed framework.

Cite

Text

Ran et al. "KDA: Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding." International Conference on Computer Vision, 2025.

Markdown

[Ran et al. "KDA: Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ran2025iccv-kda/)

BibTeX

@inproceedings{ran2025iccv-kda,
  title     = {{KDA: Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding}},
  author    = {Ran, Ran and Wei, Jiwei and He, Shiyuan and Ma, Zeyu and Zhang, Chaoning and Xie, Ning and Yang, Yang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {23311-23320},
  url       = {https://mlanthology.org/iccv/2025/ran2025iccv-kda/}
}