IRGPT: Understanding Real-World Infrared Image with Bi-Cross-Modal Curriculum on Large-Scale Benchmark

Abstract

Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infrared-text. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.

Cite

Text

Cao et al. "IRGPT: Understanding Real-World Infrared Image with Bi-Cross-Modal Curriculum on Large-Scale Benchmark." International Conference on Computer Vision, 2025.

Markdown

[Cao et al. "IRGPT: Understanding Real-World Infrared Image with Bi-Cross-Modal Curriculum on Large-Scale Benchmark." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/cao2025iccv-irgpt/)

BibTeX

@inproceedings{cao2025iccv-irgpt,
  title     = {{IRGPT: Understanding Real-World Infrared Image with Bi-Cross-Modal Curriculum on Large-Scale Benchmark}},
  author    = {Cao, Zhe and Zhang, Jin and Zhang, Ruiheng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {166-176},
  url       = {https://mlanthology.org/iccv/2025/cao2025iccv-irgpt/}
}