$\textit{Bifr\"ost}$: 3D-Aware Image Compositing with Language Instructions

Abstract

This paper introduces $\textit{Bifröst}$, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships ($\textit{e.g.}$, occlusion). $\textit{Bifröst}$ addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that $\textit{Bifröst}$ significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.

Cite

Text

Li et al. "$\textit{Bifr\"ost}$: 3D-Aware Image Compositing with Language Instructions." Neural Information Processing Systems, 2024. doi:10.52202/079017-4114

Markdown

[Li et al. "$\textit{Bifr\"ost}$: 3D-Aware Image Compositing with Language Instructions." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/li2024neurips-bifr/) doi:10.52202/079017-4114

BibTeX

@inproceedings{li2024neurips-bifr,
  title     = {{$\textit{Bifr\"ost}$: 3D-Aware Image Compositing with Language Instructions}},
  author    = {Li, Lingxiao and Gong, Kaixiong and Li, Weihong and Dai, Xili and Chen, Tao and Yuan, Xiaojun and Yue, Xiangyu},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4114},
  url       = {https://mlanthology.org/neurips/2024/li2024neurips-bifr/}
}