PICS: Pairwise Image Compositing with Spatial Interactions

Abstract

Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive $\alpha$-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines. Code and data are available at https://github.com/RyanHangZhou/PICS

Cite

Text

Zhou et al. "PICS: Pairwise Image Compositing with Spatial Interactions." International Conference on Learning Representations, 2026.

Markdown

[Zhou et al. "PICS: Pairwise Image Compositing with Spatial Interactions." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhou2026iclr-pics/)

BibTeX

@inproceedings{zhou2026iclr-pics,
  title     = {{PICS: Pairwise Image Compositing with Spatial Interactions}},
  author    = {Zhou, Hang and Zuo, Xinxin and Wang, Sen and Cheng, Li},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhou2026iclr-pics/}
}