ECO: Efficient Convolutional Network for Online Video Understanding
Abstract
The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, thus missing important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.
Cite
Text
Zolfaghari et al. "ECO: Efficient Convolutional Network for Online Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01216-8_43Markdown
[Zolfaghari et al. "ECO: Efficient Convolutional Network for Online Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/zolfaghari2018eccv-eco/) doi:10.1007/978-3-030-01216-8_43BibTeX
@inproceedings{zolfaghari2018eccv-eco,
title = {{ECO: Efficient Convolutional Network for Online Video Understanding}},
author = {Zolfaghari, Mohammadreza and Singh, Kamaljeet and Brox, Thomas},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2018},
doi = {10.1007/978-3-030-01216-8_43},
url = {https://mlanthology.org/eccv/2018/zolfaghari2018eccv-eco/}
}