Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning

Xie, Zhuyang; Yang, Yan; Yu, Yankai; Wang, Jie; Jiang, Yongquan; Wu, Xiao

doi:10.1609/AAAI.V39I8.32948

Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning

Zhuyang Xie, Yan Yang, Yankai Yu, Jie Wang, Yongquan Jiang, Xiao Wu

AAAI 2025 pp. 8771-8779

doi:10.1609/AAAI.V39I8.32948 /aaai/2025/xie2025aaai-exploring/

Abstract

Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level and leverage these concepts to provide temporal event cues; and (2) establish cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, weakly supervised concept detection is performed for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to produce more discriminative concept embeddings. In the captioning network, a cyclic co-learning strategy is proposed, where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator’s event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

PDF AAAI Semantic Scholar

Cite

Text

Xie et al. "Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I8.32948

Markdown

[Xie et al. "Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/xie2025aaai-exploring/) doi:10.1609/AAAI.V39I8.32948

BibTeX

@inproceedings{xie2025aaai-exploring,
  title     = {{Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning}},
  author    = {Xie, Zhuyang and Yang, Yan and Yu, Yankai and Wang, Jie and Jiang, Yongquan and Wu, Xiao},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {8771-8779},
  doi       = {10.1609/AAAI.V39I8.32948},
  url       = {https://mlanthology.org/aaai/2025/xie2025aaai-exploring/}
}