Building Vision Models upon Heat Conduction

Abstract

Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion, enabling robust visual representations. HCO enjoys a computational complexity of O(N^1.5), as it can be implemented using discrete cosine transformation (DCT) operations. HCO is plug-and-play, combining with deep learning backbones produces visual representation models (termed vHeat) with global receptive fields. Experiments across vision tasks demonstrate that, beyond the stronger performance, vHeat achieves up to a 3x throughput, 80% less GPU memory allocation, and 35% fewer computational FLOPs compared to the Swin-Transformer.

Cite

Text

Wang et al. "Building Vision Models upon Heat Conduction." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00907

Markdown

[Wang et al. "Building Vision Models upon Heat Conduction." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-building/) doi:10.1109/CVPR52734.2025.00907

BibTeX

@inproceedings{wang2025cvpr-building,
  title     = {{Building Vision Models upon Heat Conduction}},
  author    = {Wang, Zhaozhi and Liu, Yue and Tian, Yunjie and Liu, Yunfan and Wang, Yaowei and Ye, Qixiang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {9707-9717},
  doi       = {10.1109/CVPR52734.2025.00907},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-building/}
}