MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio) - such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views

Cite

Text

Wang et al. "MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion." International Conference on Computer Vision, 2025.

Markdown

[Wang et al. "MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/wang2025iccv-monofusion/)

BibTeX

@inproceedings{wang2025iccv-monofusion,
  title     = {{MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion}},
  author    = {Wang, Zihan and Tan, Jeff and Khurana, Tarasha and Peri, Neehar and Ramanan, Deva},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {8252-8263},
  url       = {https://mlanthology.org/iccv/2025/wang2025iccv-monofusion/}
}