Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Abstract

Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion’s architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. The framework initially identifies anchor points using cross-attention mechanisms and subsequently performs segmentation with self-attention to achieve zero-shot PNG. Moreover, we introduce a refinement module based on SAM to enhance the quality of the segmentation masks. Our extensive experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting, conclusively proving the diffusion model’s capability for context-aware, phrase-level understanding. Source code is available at https://github.com/nini0919/DiffPNG.

Cite

Text

Yang et al. "Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73668-1_10

Markdown

[Yang et al. "Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/yang2024eccv-exploring/) doi:10.1007/978-3-031-73668-1_10

BibTeX

@inproceedings{yang2024eccv-exploring,
  title     = {{Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model}},
  author    = {Yang, Danni and Dong, Ruohan and Ji, Jiayi and Ma, Yiwei and Wang, Haowei and Sun, Xiaoshuai and Ji, Rongrong},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73668-1_10},
  url       = {https://mlanthology.org/eccv/2024/yang2024eccv-exploring/}
}