Locality Sensitive Avatars from Video

Abstract

We present locality-sensitive avatar, a neural radiance field (NeRF) based network to learn human motions from monocular videos. To this end, we estimate a canonical representation between different frames of a video with a non-linear mapping from observation to canonical space, which we decompose into a skeletal rigid motion and a non-rigid counterpart. Our key contribution is to retain fine-grained details by modeling the non-rigid part with a graph neural network (GNN) that keeps the pose information local to neighboring body parts. Compared to former canonical representation based methods which solely operate on the coordinate space of a whole shape, our locality-sensitive motion modeling can reproduce both realistic shape contours and vivid fine-grained details. We evaluate on ZJU-MoCap, SynWild, ActorsHQ, MVHumanNet and various outdoor videos. The experiments reveal that with the locality sensitive deformation to canonical feature space, we are the first to achieve state-of-the-art results across novel view synthesis, novel pose animation and 3D shape reconstruction simultaneously. Our code is available at https://github.com/ChunjinSong/lsavatar.

Cite

Text

Song et al. "Locality Sensitive Avatars from Video." International Conference on Learning Representations, 2025.

Markdown

[Song et al. "Locality Sensitive Avatars from Video." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/song2025iclr-locality/)

BibTeX

@inproceedings{song2025iclr-locality,
  title     = {{Locality Sensitive Avatars from Video}},
  author    = {Song, Chunjin and Wu, Zhijie and Su, Shih-Yang and Wandt, Bastian and Sigal, Leonid and Rhodin, Helge},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/song2025iclr-locality/}
}