PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

Abstract

We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm.

Cite

Text

Wang et al. "PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction." International Conference on Learning Representations, 2024.

Markdown

[Wang et al. "PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/wang2024iclr-pflrm/)

BibTeX

@inproceedings{wang2024iclr-pflrm,
  title     = {{PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction}},
  author    = {Wang, Peng and Tan, Hao and Bi, Sai and Xu, Yinghao and Luan, Fujun and Sunkavalli, Kalyan and Wang, Wenping and Xu, Zexiang and Zhang, Kai},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/wang2024iclr-pflrm/}
}