MODA: Mapping-Once Audio-Driven Portrait Animation with Dual Attentions
Abstract
Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided render syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.
Cite
Text
Liu et al. "MODA: Mapping-Once Audio-Driven Portrait Animation with Dual Attentions." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.02104Markdown
[Liu et al. "MODA: Mapping-Once Audio-Driven Portrait Animation with Dual Attentions." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/liu2023iccv-moda/) doi:10.1109/ICCV51070.2023.02104BibTeX
@inproceedings{liu2023iccv-moda,
title = {{MODA: Mapping-Once Audio-Driven Portrait Animation with Dual Attentions}},
author = {Liu, Yunfei and Lin, Lijian and Yu, Fei and Zhou, Changyin and Li, Yu},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {23020-23029},
doi = {10.1109/ICCV51070.2023.02104},
url = {https://mlanthology.org/iccv/2023/liu2023iccv-moda/}
}