Event-Based Video Reconstruction Using Transformer

Abstract

Event cameras, which output events by detecting spatio-temporal brightness changes, bring a novel paradigm to image sensors with high dynamic range and low latency. Previous works have achieved impressive performances on event-based video reconstruction by introducing convolutional neural networks (CNNs). However, intrinsic locality of convolutional operations is not capable of modeling long-range dependency, which is crucial to many vision tasks. In this paper, we present a hybrid CNN-Transformer network for event-based video reconstruction (ET-Net), which merits the fine local information from CNN and global contexts from Transformer. In addition, we further propose a Token Pyramid Aggregation strategy to implement multi-scale token integration for relating internal and intersected semantic concepts in the token-space. Experimental results demonstrate that our proposed method achieves superior performance over state-of-the-art methods on multiple real-world event datasets. The code is available at https://github.com/WarranWeng/ET-Net

Cite

Text

Weng et al. "Event-Based Video Reconstruction Using Transformer." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00256

Markdown

[Weng et al. "Event-Based Video Reconstruction Using Transformer." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/weng2021iccv-eventbased/) doi:10.1109/ICCV48922.2021.00256

BibTeX

@inproceedings{weng2021iccv-eventbased,
  title     = {{Event-Based Video Reconstruction Using Transformer}},
  author    = {Weng, Wenming and Zhang, Yueyi and Xiong, Zhiwei},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {2563-2572},
  doi       = {10.1109/ICCV48922.2021.00256},
  url       = {https://mlanthology.org/iccv/2021/weng2021iccv-eventbased/}
}