Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning
Abstract
Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level and leverage these concepts to provide temporal event cues; and (2) establish cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, weakly supervised concept detection is performed for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to produce more discriminative concept embeddings. In the captioning network, a cyclic co-learning strategy is proposed, where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator’s event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.
Cite
Text
Xie et al. "Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I8.32948Markdown
[Xie et al. "Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/xie2025aaai-exploring/) doi:10.1609/AAAI.V39I8.32948BibTeX
@inproceedings{xie2025aaai-exploring,
title = {{Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning}},
author = {Xie, Zhuyang and Yang, Yan and Yu, Yankai and Wang, Jie and Jiang, Yongquan and Wu, Xiao},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {8771-8779},
doi = {10.1609/AAAI.V39I8.32948},
url = {https://mlanthology.org/aaai/2025/xie2025aaai-exploring/}
}