Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior

Abstract

Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://github.com/cssjcai/vesd.

Cite

Text

Cai et al. "Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01264-9_12

Markdown

[Cai et al. "Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/cai2018eccv-weaklysupervised-a/) doi:10.1007/978-3-030-01264-9_12

BibTeX

@inproceedings{cai2018eccv-weaklysupervised-a,
  title     = {{Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior}},
  author    = {Cai, Sijia and Zuo, Wangmeng and Davis, Larry S. and Zhang, Lei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2018},
  doi       = {10.1007/978-3-030-01264-9_12},
  url       = {https://mlanthology.org/eccv/2018/cai2018eccv-weaklysupervised-a/}
}