ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Xue, Mengqi; Huang, Qihan; Zhang, Haofei; Hu, Jingwen; Song, Jie; Song, Mingli; Jin, Canghong

doi:10.24963/ijcai.2024/168

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Jingwen Hu, Jie Song, Mingli Song, Canghong Jin

IJCAI 2024 pp. 1516-1524

doi:10.24963/ijcai.2024/168 /ijcai/2024/xue2024ijcai-protopformer/

Abstract

Image super-resolution (ISR) is a classic and challenging problem in computer vision because of complex and unknown degradation patterns in the data collection process. Leveraging powerful generative priors, diffusion-based methods have recently established new state-of-the-art ISR performance, but their characteristics in the frequency domain are still underexplored. In this paper, we innovatively investigate their frequency-domain behaviors from a sampling timestep perspective. Experimentally, we find that current diffusion-based ISR algorithms exhibit insufficiency in different frequency components in distinct groups of timesteps during the sampling. To address this, we first propose a Timestep Division Controller that is able to adaptively divide the timesteps into groups based on the performance gradient across different components. Next, we design two dedicated modules --- the Amplitude and Phase Enhancement Module (APEM) and the High- and Low-Frequency Enhancement Module (HLEM), to regulate the information flow of distinct frequency-domain features. By adaptively enhancing specific frequency components at different stages of the sampling process, the two modules effectively compensate for the insufficient frequency-domain perception of diffusion-based ISR models. Extensive experiments on three benchmark datasets verify the superior ISR performance of our method, e.g., achieving an average 5.40% improvement on CLIP-IQA compared to the best diffusion-based ISR baseline.

PDF IJCAI Semantic Scholar

Cite

Text

Xue et al. "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/168

Markdown

[Xue et al. "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/xue2024ijcai-protopformer/) doi:10.24963/ijcai.2024/168

BibTeX

@inproceedings{xue2024ijcai-protopformer,
  title     = {{ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition}},
  author    = {Xue, Mengqi and Huang, Qihan and Zhang, Haofei and Hu, Jingwen and Song, Jie and Song, Mingli and Jin, Canghong},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {1516-1524},
  doi       = {10.24963/ijcai.2024/168},
  url       = {https://mlanthology.org/ijcai/2024/xue2024ijcai-protopformer/}
}