Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Abstract

Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. Compared to prior works that perform separate optimization of the human, the camera, and the scene, JOSH leverages the human-scene contact constraints to jointly optimize all parameters in a single stage. Experiment results demonstrate that JOSH significantly improves 4D human-scene reconstruction, global human motion estimation, and dense scene reconstruction by utilizing the joint optimization of scene geometry, human motion, and camera poses. Further studies show that JOSH can enable scalable training of end-to-end global human motion models on extensive web data, highlighting its robustness and generalizability. The code and model are available at [https://vail-ucla.github.io/JOSH/](https://vail-ucla.github.io/JOSH/).

Cite

Text

Liu et al. "Joint Optimization for 4D Human-Scene Reconstruction in the Wild." International Conference on Learning Representations, 2026.

Markdown

[Liu et al. "Joint Optimization for 4D Human-Scene Reconstruction in the Wild." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liu2026iclr-joint/)

BibTeX

@inproceedings{liu2026iclr-joint,
  title     = {{Joint Optimization for 4D Human-Scene Reconstruction in the Wild}},
  author    = {Liu, Zhizheng and Lin, Joe and Wu, Wayne and Zhou, Bolei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/liu2026iclr-joint/}
}