SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning

Zhang, Xu; Yuan, Jin; Zhang, Hanwang; Zhong, Guojin; Zang, Yongsheng; Lin, Jiacheng; Li, Zhiyong

doi:10.1609/AAAI.V39I10.33113

SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning

Xu Zhang, Jin Yuan, Hanwang Zhang, Guojin Zhong, Yongsheng Zang, Jiacheng Lin, Zhiyong Li

AAAI 2025 pp. 10257-10265

doi:10.1609/AAAI.V39I10.33113 /aaai/2025/zhang2025aaai-sgdiff/

Abstract

Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as high-cost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning'' (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for correlated mask-caption prediction. Initially, we introduce a Prompt-Centric Scene Graph Adaptor to map a user's prompt to a scene graph, effectively capturing his intention. Subsequently, we employ a diffusion process incorporating a Scene Graph Guided Bimodal Transformer to predict correlated caption-mask pairs by uncovering intricate correlations between them. To ensure accurate alignment, we design a Multi-Entities Contrastive Learning loss to explicitly align visual and textual entities by considering inter-modal similarity, resulting in well-aligned caption-mask pairs. Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in SegCaptioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input.

PDF AAAI Semantic Scholar

Cite

Text

Zhang et al. "SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33113

Markdown

[Zhang et al. "SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhang2025aaai-sgdiff/) doi:10.1609/AAAI.V39I10.33113

BibTeX

@inproceedings{zhang2025aaai-sgdiff,
  title     = {{SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning}},
  author    = {Zhang, Xu and Yuan, Jin and Zhang, Hanwang and Zhong, Guojin and Zang, Yongsheng and Lin, Jiacheng and Li, Zhiyong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10257-10265},
  doi       = {10.1609/AAAI.V39I10.33113},
  url       = {https://mlanthology.org/aaai/2025/zhang2025aaai-sgdiff/}
}