Non-Natural Image Understanding with Advancing Frequency-Based Vision Encoders

Lin, Wang; Wang, QingSong; Feng, Yueying; Wang, Shulei; Jin, Tao; Zhao, Zhou; Wu, Fei; Yao, Chang; Chen, Jingyuan

doi:10.1109/CVPR52734.2025.02770

Non-Natural Image Understanding with Advancing Frequency-Based Vision Encoders

Wang Lin, QingSong Wang, Yueying Feng, Shulei Wang, Tao Jin, Zhou Zhao, Fei Wu, Chang Yao, Jingyuan Chen

CVPR 2025 pp. 29756-29766

doi:10.1109/CVPR52734.2025.02770 /cvpr/2025/lin2025cvpr-nonnatural/

Abstract

Large language models (LLMs) have significantly enhanced cross-modal understanding capabilities by integrating visual encoders with textual embeddings, giving rise to multimodal large language models (MLLMs). However, these models struggle with non-natural images such as geometric and charts, particularly in fields like education and finance. Despite efforts to collect datasets and fine-tune the MLLMs, the gap with natural image understanding is still evident, and the cost of collecting large and diverse non-natural image datasets is high. To address this, we analyzed the limitations of transformer-based vision encoders(ViT) within existing MLLMs from a frequency perspective. Studies have shown that ViT models are less effective at capturing high-frequency information, impairing their ability to capture elements like points, lines, and angles in non-natural images. In response, we introduced FM-ViT, a frequency-modulated vision encoder that utilizes Fourier decomposition to extract high and low frequency components from self-attention features and re-weight them during tuning to non-natural images. In addition, we combine the features of CNN models with FM-ViT and propose EDGE, an MLLM with enhanced graphical encoders tailored for understanding non-natural images. Extensive experiments have confirmed the effectiveness of our FM-ViT and EDGE in 4 types.

PDF CVPR Semantic Scholar

Cite

Text

Lin et al. "Non-Natural Image Understanding with Advancing Frequency-Based Vision Encoders." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02770

Markdown

[Lin et al. "Non-Natural Image Understanding with Advancing Frequency-Based Vision Encoders." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/lin2025cvpr-nonnatural/) doi:10.1109/CVPR52734.2025.02770

BibTeX

@inproceedings{lin2025cvpr-nonnatural,
  title     = {{Non-Natural Image Understanding with Advancing Frequency-Based Vision Encoders}},
  author    = {Lin, Wang and Wang, QingSong and Feng, Yueying and Wang, Shulei and Jin, Tao and Zhao, Zhou and Wu, Fei and Yao, Chang and Chen, Jingyuan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29756-29766},
  doi       = {10.1109/CVPR52734.2025.02770},
  url       = {https://mlanthology.org/cvpr/2025/lin2025cvpr-nonnatural/}
}