Synergistic Dual Spatial-Aware Generation of Image-to-Text and Text-to-Image
Abstract
In the visual spatial understanding (VSU) field, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$\to$image and 3D$\to$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$\to$X processes to guide the hard X$\to$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly.Further in-depth analysis reveals how our dual learning strategy advances.
Cite
Text
Zhao et al. "Synergistic Dual Spatial-Aware Generation of Image-to-Text and Text-to-Image." Neural Information Processing Systems, 2024. doi:10.52202/079017-3888Markdown
[Zhao et al. "Synergistic Dual Spatial-Aware Generation of Image-to-Text and Text-to-Image." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhao2024neurips-synergistic/) doi:10.52202/079017-3888BibTeX
@inproceedings{zhao2024neurips-synergistic,
title = {{Synergistic Dual Spatial-Aware Generation of Image-to-Text and Text-to-Image}},
author = {Zhao, Yu and Fei, Hao and Li, Xiangtai and Qin, Libo and Ji, Jiayi and Zhu, Hongyuan and Zhang, Meishan and Zhang, Min and Wei, Jianguo},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3888},
url = {https://mlanthology.org/neurips/2024/zhao2024neurips-synergistic/}
}