Multi-View Geometry-Aware Diffusion Transformer for Indoor Novel View Synthesis
Abstract
Recent advancements in novel view synthesis for indoor scenes using diffusion models have gained significant attention, particularly for generating target poses from a single source image. While existing methods produce plausible nearby views, they struggle to extrapolate perspectives far beyond the input. Moreover, achieving multi-view consistency typically requires computationally expensive 3D priors, limiting scalability for long-range generation. In this paper, we propose a transformer-based latent diffusion model that integrates view geometry constraints to enable long-range, consistent novel view synthesis. Our approach explicitly warps input-view feature maps as the denoised target view and incorporates a conditioning combination of epipolar-weighted source image features, Plücker raymaps, and camera poses. This design allows for semantically and geometrically coherent extrapolation of novel views in a single-shot manner. We evaluate our model on the ScanNet and RealEstate10K datasets using diverse metrics for view quality and consistency. Experimental results demonstrate its superiority over existing methods, highlighting its potential for scalable, high-fidelity novel view synthesis in video generation.
Cite
Text
Kang et al. "Multi-View Geometry-Aware Diffusion Transformer for Indoor Novel View Synthesis." ICLR 2025 Workshops: DeLTa, 2025.Markdown
[Kang et al. "Multi-View Geometry-Aware Diffusion Transformer for Indoor Novel View Synthesis." ICLR 2025 Workshops: DeLTa, 2025.](https://mlanthology.org/iclrw/2025/kang2025iclrw-multiview/)BibTeX
@inproceedings{kang2025iclrw-multiview,
title = {{Multi-View Geometry-Aware Diffusion Transformer for Indoor Novel View Synthesis}},
author = {Kang, Xueyang and Xiang, Zhengkang and Zhang, Zezheng and Khoshelham, Kourosh},
booktitle = {ICLR 2025 Workshops: DeLTa},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/kang2025iclrw-multiview/}
}