K-Centered Patch Sampling for Efficient Video Recognition

Abstract

For decades, it has been a common practice to choose a subset of video frames for reducing the computational burden of a video understanding model. In this paper, we argue that this popular heuristic might be sub-optimal under recent transformer-based models. Specifically, inspired by that transformers are built upon patches of video frames, we propose to sample patches rather than frames using the greedy K-center search, i.e., the farthest patch to what has been chosen so far is sampled iteratively. We then show that a transformer trained with the selected video patches can outperform its baseline trained with the video frames sampled in the traditional way. Furthermore, by adding a certain spatiotemporal structuredness condition, the proposed K-centered patch sampling can be even applied to the recent sophisticated video transformers, boosting their performance further. We demonstrate the superiority of our method on Something-Something and Kinetics datasets.

Cite

Text

Park et al. "K-Centered Patch Sampling for Efficient Video Recognition." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19833-5_10

Markdown

[Park et al. "K-Centered Patch Sampling for Efficient Video Recognition." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/park2022eccv-kcentered/) doi:10.1007/978-3-031-19833-5_10

BibTeX

@inproceedings{park2022eccv-kcentered,
  title     = {{K-Centered Patch Sampling for Efficient Video Recognition}},
  author    = {Park, Seong Hyeon and Tack, Jihoon and Heo, Byeongho and Ha, Jung-Woo and Shin, Jinwoo},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19833-5_10},
  url       = {https://mlanthology.org/eccv/2022/park2022eccv-kcentered/}
}