Summarizing First-Person Videos from Third Persons' Points of View

Abstract

Video highlight or summarization is among interesting topics in computer vision, which benefits a variety of applications like viewing, searching, or storage. However, most existing studies rely on training data of third-person videos, which cannot easily generalize to highlight the first-person ones. With the goal of deriving an effective model to summarize first-person videos, we propose a novel deep neural network architecture for describing and discriminating vital spatiotemporal information across videos with different points of view. Our proposed model is realized in a semi-supervised setting, in which fully annotated third-person videos, unlabeled first-person videos, and a small amount of annotated first-person ones are presented during training. In our experiments, qualitative and quantitative evaluations on both benchmarks and our collected first-person video datasets are presented.

Cite

Text

Ho et al. "Summarizing First-Person Videos from Third Persons' Points of View." Proceedings of the European Conference on Computer Vision (ECCV), 2018.

Markdown

[Ho et al. "Summarizing First-Person Videos from Third Persons' Points of View." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/ho2018eccv-summarizing/)

BibTeX

@inproceedings{ho2018eccv-summarizing,
  title     = {{Summarizing First-Person Videos from Third Persons' Points of View}},
  author    = {Ho, Hsuan-I and Chiu, Wei-Chen and Frank Wang, Yu-Chiang},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2018},
  url       = {https://mlanthology.org/eccv/2018/ho2018eccv-summarizing/}
}