PixTalk: Controlling Photorealistic Image Processing and Editing with Language

Abstract

Text-guided image generation and editing is emerging as a fundamental problem in computer vision. However, most approaches lack control, and the generated results are far from professional photography quality standards. In this work, we propose the first approach that introduces language and explicit control into the image processing and editing pipeline. PixTalk is a vision-language multi-task image processing model, guided using text instructions. Our method is able to perform over 40 transformations --the most popular techniques in photography--, delivering results as professional photography editing software. Our model can process 12MP images on consumer GPUs in real-time (under 1 second). As part of this effort, we propose a novel dataset and benchmark for new research on multi-modal image processing and editing.

Cite

Text

Conde et al. "PixTalk: Controlling Photorealistic Image Processing and Editing with Language." International Conference on Computer Vision, 2025.

Markdown

[Conde et al. "PixTalk: Controlling Photorealistic Image Processing and Editing with Language." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/conde2025iccv-pixtalk/)

BibTeX

@inproceedings{conde2025iccv-pixtalk,
  title     = {{PixTalk: Controlling Photorealistic Image Processing and Editing with Language}},
  author    = {Conde, Marcos V. and Lu, Zihao and Timofte, Radu},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {19269-19279},
  url       = {https://mlanthology.org/iccv/2025/conde2025iccv-pixtalk/}
}