Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Abstract

Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing \textbf{ICEdit}, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing.

Cite

Text

Zhang et al. "Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhang et al. "Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-enabling/)

BibTeX

@inproceedings{zhang2025neurips-enabling,
  title     = {{Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer}},
  author    = {Zhang, Zechuan and Xie, Ji and Lu, Yu and Yang, Zongxin and Yang, Yi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhang2025neurips-enabling/}
}