Event-Equalized Dense Video Captioning

Abstract

Dense video captioning aims to localize and caption all events in arbitrary untrimmed videos. Although previous methods have achieved appealing results, they still face the issue of temporal bias, i.e, models tend to focus more on events with certain temporal characteristics. Specifically, 1) the temporal distribution of events in training datasets is uneven. Models trained on these datasets will pay less attention to out-of-distribution events. 2) long-duration events have more frame features than short ones and will attract more attention. To address this, we argue that events, with varying temporal characteristics, should be treated equally when it comes to dense video captioning. Intuitively, different events tend to have distinct visual differences due to varied camera views, backgrounds, or subjects. Inspired by that, we intend to utilize visual features to have an approximate perception of possible events and pay equal attention to them. In this paper, we introduce a simple but effective framework, called Event-Equalized Dense Video Captioning(E^2DVC) to overcome the temporal bias and treat all possible events equally. Specifically, an event perception module(EPM) is proposed to do uneven clustering on visual frame features to generate pseudo-events. We enforce the model's attention to these pseudo-events through the pseudo-event initialization module(PEI). A novel event-enhanced encoder(EEE) is also devised to enhance the model's ability to explore frame-frame and frame-event relationships. Experimental results validate the effectiveness of the proposed methods.

Cite

Text

Wu et al. "Event-Equalized Dense Video Captioning." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00788

Markdown

[Wu et al. "Event-Equalized Dense Video Captioning." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wu2025cvpr-eventequalized/) doi:10.1109/CVPR52734.2025.00788

BibTeX

@inproceedings{wu2025cvpr-eventequalized,
  title     = {{Event-Equalized Dense Video Captioning}},
  author    = {Wu, Kangyi and Li, Pengna and Fu, Jingwen and Li, Yizhe and Wu, Yang and Liu, Yuhan and Wang, Jinjun and Zhou, Sanping},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {8417-8427},
  doi       = {10.1109/CVPR52734.2025.00788},
  url       = {https://mlanthology.org/cvpr/2025/wu2025cvpr-eventequalized/}
}