SPAD: Spatially Aware Multi-View Diffusers
Abstract
We present SPAD a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency we utilize Pl ?ucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. Compared to concurrent works that can only generate views at fixed azimuth and elevation (e.g. MVDream SyncDreamer) SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue.
Cite
Text
Kant et al. "SPAD: Spatially Aware Multi-View Diffusers." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00956Markdown
[Kant et al. "SPAD: Spatially Aware Multi-View Diffusers." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/kant2024cvpr-spad/) doi:10.1109/CVPR52733.2024.00956BibTeX
@inproceedings{kant2024cvpr-spad,
title = {{SPAD: Spatially Aware Multi-View Diffusers}},
author = {Kant, Yash and Siarohin, Aliaksandr and Wu, Ziyi and Vasilkovsky, Michael and Qian, Guocheng and Ren, Jian and Guler, Riza Alp and Ghanem, Bernard and Tulyakov, Sergey and Gilitschenski, Igor},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {10026-10038},
doi = {10.1109/CVPR52733.2024.00956},
url = {https://mlanthology.org/cvpr/2024/kant2024cvpr-spad/}
}