PACA: Prespective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

Abstract

Scene rearrangement like table tidying is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation segmentation and feature encoding which can lead to a low success rate due to error accumulation. Furthermore they lack control over the viewing perspectives of the generated goals restricting the tasks to 3-DoF settings. In this paper we propose PACA a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically we develop an object-level representation that integrates generation segmentation and feature encoding into a single step. Additionally we introduce perspective control thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down settings. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes achieving an average matching accuracy and execution success rate of 87% and 67% respectively.

Cite

Text

Jin et al. "PACA: Prespective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Jin et al. "PACA: Prespective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/jin2025wacv-paca/)

BibTeX

@inproceedings{jin2025wacv-paca,
  title     = {{PACA: Prespective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement}},
  author    = {Jin, Shutong and Wang, Ruiyu and Chen, Kuangyi and Pokorny, Florian T.},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {6559-6569},
  url       = {https://mlanthology.org/wacv/2025/jin2025wacv-paca/}
}