\texttt{Complex-Edit}: CoT-like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Yang, Siwei; Hui, Mude; Zhao, Bingchen; Zhou, Yuyin; Ruiz, Nataniel; Xie, Cihang

\texttt{Complex-Edit}: CoT-like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie

TMLR 2026

/tmlr/2026/yang2026tmlr-complexedit/

Abstract

We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured "Chain-of-Edit" pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models’ ability to retain key elements from the input images; 3) Stronger models aren't necessarily more resilient towards higher complexity; 4) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 5) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 6) We observe a "curse of synthetic data": when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises --- a phenomenon that intriguingly also manifests in the latest GPT-Image-1's outputs. The code for evaluation and data generation, and the test set is released at https://github.com/UCSC-VLAA/Complex-Edit.

PDF TMLR OpenReview Code Semantic Scholar

Cite

Text

Yang et al. "\texttt{Complex-Edit}: CoT-like Instruction Generation for Complexity-Controllable Image Editing Benchmark." Transactions on Machine Learning Research, 2026.

Markdown

[Yang et al. "\texttt{Complex-Edit}: CoT-like Instruction Generation for Complexity-Controllable Image Editing Benchmark." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/yang2026tmlr-complexedit/)

BibTeX

@article{yang2026tmlr-complexedit,
  title     = {{\texttt{Complex-Edit}: CoT-like Instruction Generation for Complexity-Controllable Image Editing Benchmark}},
  author    = {Yang, Siwei and Hui, Mude and Zhao, Bingchen and Zhou, Yuyin and Ruiz, Nataniel and Xie, Cihang},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/yang2026tmlr-complexedit/}
}