The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers

Qi, Daiqing; Zhao, Handong; Shi, Jing; Jenni, Simon; Fan, Yifei; Dernoncourt, Franck; Cohen, Scott; Li, Sheng

The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers

Daiqing Qi, Handong Zhao, Jing Shi, Simon Jenni, Yifei Fan, Franck Dernoncourt, Scott Cohen, Sheng Li

CVPR 2025 pp. 24807-24816

/cvpr/2025/qi2025cvpr-photographer/

Abstract

Photographer, curator, and former director of photography at the Museum of Modern Art (MoMA), John Szarkowski remarked in *William Eggleston's Guide*, "While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky." Szarkowski insightfully revealed a notable gap between general and aesthetic visual understanding: while the former emphasizes identifying factual elements in an image (the sky), the latter transcends mere object identification, viewing it instead as an aesthetic component--a pure expanse of blue, valued purely as a color block in visual aesthetics. Such distinctions between general visual understanding (detection, localization, etc.) and aesthetic perception (color, lighting, composition, etc.) pose a significant challenge for existing Multimodal Large Language Models (MLLMs) in comprehending image aesthetics, which is increasingly needed in real-world applications, from image recommendation and enhancement to generation. To fundamentally advance the aesthetic understanding of MLLMs, we introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, distinguished by its large scale, expertise, and diversity. Additionally, we propose a new model, PhotoEye, an MLLM featuring a language-guided multi-view vision fusion mechanism for understanding image aesthetics from multiple perspectives. Finally, we introduce PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. Our model demonstrates significant advantages over both open-source and commercial models on existing benchmarks and PhotoBench.

PDF CVPR Semantic Scholar

Cite

Text

Qi et al. "The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers." Conference on Computer Vision and Pattern Recognition, 2025.

Markdown

[Qi et al. "The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/qi2025cvpr-photographer/)

BibTeX

@inproceedings{qi2025cvpr-photographer,
  title     = {{The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers}},
  author    = {Qi, Daiqing and Zhao, Handong and Shi, Jing and Jenni, Simon and Fan, Yifei and Dernoncourt, Franck and Cohen, Scott and Li, Sheng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {24807-24816},
  url       = {https://mlanthology.org/cvpr/2025/qi2025cvpr-photographer/}
}