The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers

Abstract

Photographer, curator, and former director of photography at the Museum of Modern Art (MoMA), John Szarkowski remarked in *William Eggleston's Guide*, "While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky." Szarkowski insightfully revealed a notable gap between general and aesthetic visual understanding: while the former emphasizes identifying factual elements in an image (the sky), the latter transcends mere object identification, viewing it instead as an aesthetic component--a pure expanse of blue, valued purely as a color block in visual aesthetics. Such distinctions between general visual understanding (detection, localization, etc.) and aesthetic perception (color, lighting, composition, etc.) pose a significant challenge for existing Multimodal Large Language Models (MLLMs) in comprehending image aesthetics, which is increasingly needed in real-world applications, from image recommendation and enhancement to generation. To fundamentally advance the aesthetic understanding of MLLMs, we introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, distinguished by its large scale, expertise, and diversity. Additionally, we propose a new model, PhotoEye, an MLLM featuring a language-guided multi-view vision fusion mechanism for understanding image aesthetics from multiple perspectives. Finally, we introduce PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. Our model demonstrates significant advantages over both open-source and commercial models on existing benchmarks and PhotoBench.

Cite

Text

Qi et al. "The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers." Conference on Computer Vision and Pattern Recognition, 2025.

Markdown

[Qi et al. "The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/qi2025cvpr-photographer/)

BibTeX

@inproceedings{qi2025cvpr-photographer,
  title     = {{The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique like Photographers}},
  author    = {Qi, Daiqing and Zhao, Handong and Shi, Jing and Jenni, Simon and Fan, Yifei and Dernoncourt, Franck and Cohen, Scott and Li, Sheng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {24807-24816},
  url       = {https://mlanthology.org/cvpr/2025/qi2025cvpr-photographer/}
}