Discrete Cosin TransFormer: Image Modeling from Frequency Domain

Abstract

In this paper, we propose Discrete Cosin TransFormer (DCFormer) that directly learn semantics from DCT-based frequency domain representation. We first show that transformer-based networks are able to learn semantics directly from frequency domain representation based on discrete cosine transform (DCT) without compromising the performance. To achieve the desired efficiency-effectiveness trade-off, we then leverage an input information compression on its frequency domain representation, which highlights the visually significant signals inspired by JPEG compression. We explore different frequency domain down-sampling strategies and show that it is possible to preserve the semantic meaningful information by strategically dropping the high-frequency components. The proposed DCFormer is tested on various downstream tasks including image classification, object detection and instance segmentation, and achieves state-of-the-art comparable performance with less FLOPs, and outperforms the commonly used backbone (e.g. SWIN) at similar FLOPs. Our ablation results also show that the proposed method generalizes well on different transformer backbones.

Cite

Text

Li et al. "Discrete Cosin TransFormer: Image Modeling from Frequency Domain." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Li et al. "Discrete Cosin TransFormer: Image Modeling from Frequency Domain." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/li2023wacv-discrete/)

BibTeX

@inproceedings{li2023wacv-discrete,
  title     = {{Discrete Cosin TransFormer: Image Modeling from Frequency Domain}},
  author    = {Li, Xinyu and Zhang, Yanyi and Yuan, Jianbo and Lu, Hanlin and Zhu, Yibo},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {5468-5478},
  url       = {https://mlanthology.org/wacv/2023/li2023wacv-discrete/}
}