Real-Time Online Video Detection with Temporal Smoothing Transformers
Abstract
Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs 6x faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS’14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS’14 dataset.
Cite
Text
Zhao and Krähenbühl. "Real-Time Online Video Detection with Temporal Smoothing Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19830-4Markdown
[Zhao and Krähenbühl. "Real-Time Online Video Detection with Temporal Smoothing Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/zhao2022eccv-realtime/) doi:10.1007/978-3-031-19830-4BibTeX
@inproceedings{zhao2022eccv-realtime,
title = {{Real-Time Online Video Detection with Temporal Smoothing Transformers}},
author = {Zhao, Yue and Krähenbühl, Philipp},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-19830-4},
url = {https://mlanthology.org/eccv/2022/zhao2022eccv-realtime/}
}