Video Frame Interpolation Transformer

Abstract

Existing methods for video interpolation heavily rely on deep convolution neural networks, and thus suffer from their intrinsic limitations, such as content-agnostic kernel weights and restricted receptive field. To address these issues, we propose a Transformer-based video interpolation framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video interpolation and extend it to the spatial-temporal domain. Furthermore, we propose a space-time separation strategy to save memory usage, which also improves performance. In addition, we develop a multi-scale frame synthesis scheme to fully realize the potential of Transformers. Extensive experiments demonstrate the proposed model performs favorably against the state-of-the-art methods both quantitatively and qualitatively on a variety of benchmark datasets. The code and models are released at https://github.com/zhshi0816/ Video-Frame-Interpolation-Transformer.

Cite

Text

Shi et al. "Video Frame Interpolation Transformer." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01696

Markdown

[Shi et al. "Video Frame Interpolation Transformer." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/shi2022cvpr-video/) doi:10.1109/CVPR52688.2022.01696

BibTeX

@inproceedings{shi2022cvpr-video,
  title     = {{Video Frame Interpolation Transformer}},
  author    = {Shi, Zhihao and Xu, Xiangyu and Liu, Xiaohong and Chen, Jun and Yang, Ming-Hsuan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {17482-17491},
  doi       = {10.1109/CVPR52688.2022.01696},
  url       = {https://mlanthology.org/cvpr/2022/shi2022cvpr-video/}
}