Contextually Customized Video Summaries via Natural Language

Abstract

The best summary of a long video differs among different people due to its highly subjective nature. Even for the same person, the best summary may change with time or mood. In this paper, we introduce the task of generating contextually customized video summaries through simple text. First, we train a deep architecture to effectively learn semantic embeddings of video frames by leveraging the abundance of image-caption data via a progressive manner, whereby our algorithm is able to select semantically relevant video segments for a contextually meaningful video summary, given a user-specific text description or even a single sentence. In order to evaluate our customized video summaries, we conduct experimental comparison with baseline methods that utilize ground-truth information. Despite the challenging baselines, our method still manages to show comparable or even exceeding performance. We also demonstrate that our method is able to automatically generate semantically diverse video summaries even without any text input.

Cite

Text

Choi et al. "Contextually Customized Video Summaries via Natural Language." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018. doi:10.1109/WACV.2018.00191

Markdown

[Choi et al. "Contextually Customized Video Summaries via Natural Language." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018.](https://mlanthology.org/wacv/2018/choi2018wacv-contextually/) doi:10.1109/WACV.2018.00191

BibTeX

@inproceedings{choi2018wacv-contextually,
  title     = {{Contextually Customized Video Summaries via Natural Language}},
  author    = {Choi, Jinsoo and Oh, Tae-Hyun and Kweon, In So},
  booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
  year      = {2018},
  pages     = {1718-1726},
  doi       = {10.1109/WACV.2018.00191},
  url       = {https://mlanthology.org/wacv/2018/choi2018wacv-contextually/}
}