DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes

Abstract

The image difference captioning (IDC) task is to describe the distinctions between two images. However, existing datasets do not offer comprehensive coverage across all image-difference categories. In this work, we introduce a high-quality dataset, DiffTell with various types of image manipulations, including global image alterations, object-level changes, and text manipulations. The data quality is controlled by careful human filtering. Additionally, to scale up the data collection without prohibitive human labor costs, we explore the possibility of automatically filtering for quality control. We demonstrate that both traditional methods and recent multimodal large language models (MLLMs) exhibit performance improvements on the IDC task after training on the DiffTell dataset. Through extensive ablation studies, we provide a detailed analysis of the performance gains attributed to DiffTell. Experiments show DiffTell significantly enhances the availability of resources for IDC research, offering a more comprehensive foundation and benchmark for future investigations.

Cite

Text

Di et al. "DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes." International Conference on Computer Vision, 2025.

Markdown

[Di et al. "DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/di2025iccv-difftell/)

BibTeX

@inproceedings{di2025iccv-difftell,
  title     = {{DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes}},
  author    = {Di, Zonglin and Shi, Jing and Fan, Yifei and Tan, Hao and Black, Alexander and Collomosse, John and Liu, Yang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {24580-24590},
  url       = {https://mlanthology.org/iccv/2025/di2025iccv-difftell/}
}