Joint Event Detection and Description in Continuous Video Streams
Abstract
Dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) for solving this task in an end-to-end fashion, which encodes the input video stream with three-dimensional convolutional layers, proposes variable- length temporal events based on pooled features, and then uses a two-level hierarchical LSTM module with context modeling to transcribe the event proposals into captions. We show the effectiveness of our proposed JEDDi-Net on the large-scale ActivityNet Captions dataset.
Cite
Text
Xu et al. "Joint Event Detection and Description in Continuous Video Streams." IEEE/CVF Winter Conference on Applications of Computer Vision, 2019. doi:10.1109/WACV.2019.00048Markdown
[Xu et al. "Joint Event Detection and Description in Continuous Video Streams." IEEE/CVF Winter Conference on Applications of Computer Vision, 2019.](https://mlanthology.org/wacv/2019/xu2019wacv-joint/) doi:10.1109/WACV.2019.00048BibTeX
@inproceedings{xu2019wacv-joint,
title = {{Joint Event Detection and Description in Continuous Video Streams}},
author = {Xu, Huijuan and Li, Boyang and Ramanishka, Vasili and Sigal, Leonid and Saenko, Kate},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2019},
pages = {396-405},
doi = {10.1109/WACV.2019.00048},
url = {https://mlanthology.org/wacv/2019/xu2019wacv-joint/}
}