Instruction-Based Image Editing with Planning, Reasoning, and Generation

Ji, Liya; Qi, Chenyang; Chen, Qifeng

Instruction-Based Image Editing with Planning, Reasoning, and Generation

ICCV 2025 pp. 17506-17515

/iccv/2025/ji2025iccv-instructionbased/

Abstract

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only single-modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images. The source code will be publicly available.

PDF ICCV Semantic Scholar

Cite

Text

Ji et al. "Instruction-Based Image Editing with Planning, Reasoning, and Generation." International Conference on Computer Vision, 2025.

Markdown

[Ji et al. "Instruction-Based Image Editing with Planning, Reasoning, and Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ji2025iccv-instructionbased/)

BibTeX

@inproceedings{ji2025iccv-instructionbased,
  title     = {{Instruction-Based Image Editing with Planning, Reasoning, and Generation}},
  author    = {Ji, Liya and Qi, Chenyang and Chen, Qifeng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17506-17515},
  url       = {https://mlanthology.org/iccv/2025/ji2025iccv-instructionbased/}
}