StyleLipSync: Style-Based Personalized Lip-Sync Video Generation

Abstract

In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.

Cite

Text

Ki and Min. "StyleLipSync: Style-Based Personalized Lip-Sync Video Generation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.02088

Markdown

[Ki and Min. "StyleLipSync: Style-Based Personalized Lip-Sync Video Generation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/ki2023iccv-stylelipsync/) doi:10.1109/ICCV51070.2023.02088

BibTeX

@inproceedings{ki2023iccv-stylelipsync,
  title     = {{StyleLipSync: Style-Based Personalized Lip-Sync Video Generation}},
  author    = {Ki, Taekyung and Min, Dongchan},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {22841-22850},
  doi       = {10.1109/ICCV51070.2023.02088},
  url       = {https://mlanthology.org/iccv/2023/ki2023iccv-stylelipsync/}
}