Video Summarization by Learning Relationships Between Action and Scene
Abstract
We propose a novel deep architecture for video summarization in untrimmed videos that simultaneously recognizes action and scene classes for every video segments. Our networks accomplish this through a multi-task fusion approach based on two types of attention modules to explore semantic correlations between action and scene in the videos. The proposed networks consist of the feature embedding networks and attention inference networks to stochastically leverage the inferred action and scene feature representations. Additionally, we design a new center loss function that learns the feature representations by enforcing to minimize the intra-class variations and to maximize the inter-class variations. Our model achieves a score of 0.8409 for summarization and accuracy of 0.7294 for action and scene recognition on test set of CoVieW'19 dataset, which is ranked 3rd.
Cite
Text
Park et al. "Video Summarization by Learning Relationships Between Action and Scene." IEEE/CVF International Conference on Computer Vision Workshops, 2019. doi:10.1109/ICCVW.2019.00193Markdown
[Park et al. "Video Summarization by Learning Relationships Between Action and Scene." IEEE/CVF International Conference on Computer Vision Workshops, 2019.](https://mlanthology.org/iccvw/2019/park2019iccvw-video/) doi:10.1109/ICCVW.2019.00193BibTeX
@inproceedings{park2019iccvw-video,
title = {{Video Summarization by Learning Relationships Between Action and Scene}},
author = {Park, Jungin and Lee, Jiyoung and Jeon, Sangryul and Sohn, Kwanghoon},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2019},
pages = {1545-1552},
doi = {10.1109/ICCVW.2019.00193},
url = {https://mlanthology.org/iccvw/2019/park2019iccvw-video/}
}