CLIPDrag: Combining Text-Based and Drag-Based Instructions for Image Editing

ICLR 2025

/iclr/2025/jiang2025iclr-clipdrag/

Abstract

Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (\ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed \textbf{CLIPDrag}, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.

PDF ICLR Semantic Scholar

Cite

Text

Jiang et al. "CLIPDrag: Combining Text-Based and Drag-Based Instructions for Image Editing." International Conference on Learning Representations, 2025.

Markdown

[Jiang et al. "CLIPDrag: Combining Text-Based and Drag-Based Instructions for Image Editing." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/jiang2025iclr-clipdrag/)

BibTeX

@inproceedings{jiang2025iclr-clipdrag,
  title     = {{CLIPDrag: Combining Text-Based and Drag-Based Instructions for Image Editing}},
  author    = {Jiang, Ziqi and Wang, Zhen and Chen, Long},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/jiang2025iclr-clipdrag/}
}