1DFormer: A Transformer Architecture Learning 1d Landmark Representations for Facial Landmark Tracking

Yin, Shi; Huan, Shijie; Wang, Shangfei; Hu, Jinshui; Guo, Tao; Yin, Bing; Yin, Baocai; Liu, Cong

doi:10.24963/ijcai.2024/176

1DFormer: A Transformer Architecture Learning 1d Landmark Representations for Facial Landmark Tracking

Shi Yin, Shijie Huan, Shangfei Wang, Jinshui Hu, Tao Guo, Bing Yin, Baocai Yin, Cong Liu

IJCAI 2024 pp. 1588-1597

doi:10.24963/ijcai.2024/176 /ijcai/2024/yin2024ijcai-dformer/

Abstract

Generating high-fidelity talking heads that maintain stable head poses and achieve robust lip sync remains a significant challenge. Although methods based on 3D Gaussian Splatting (3DGS) offer a promising solution via point-based deformation, they suffer from inconsistent head dynamics and mismatched mouth movements due to unstable Gaussian initialization and incomplete speech features. To overcome these limitations, we introduce SyncGaussian, a 3DGS-based framework that ensures stable head poses, enhanced lip sync, and realistic appearances with real-time rendering. SyncGaussian employs a stable head Gaussian initialization strategy to mitigate head jitter by optimizing commonly used rough head pose parameters. To enhance lip sync, we propose a sync-enhanced encoder that leverages audio-to-text and audio-to-visual speech features. Guided by a tailored cosine similarity loss function, the encoder integrates discriminative speech features through a multi-level sync adaptation mechanism, enabling the learning of an adaptive speech feature space. Extensive experiments demonstrate that SyncGaussian outperforms state-of-the-art methods in image quality, dynamic motion, and lip sync, with the potential for real-time applications.

PDF IJCAI Semantic Scholar

Cite

Text

Yin et al. "1DFormer: A Transformer Architecture Learning 1d Landmark Representations for Facial Landmark Tracking." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/176

Markdown

[Yin et al. "1DFormer: A Transformer Architecture Learning 1d Landmark Representations for Facial Landmark Tracking." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/yin2024ijcai-dformer/) doi:10.24963/ijcai.2024/176

BibTeX

@inproceedings{yin2024ijcai-dformer,
  title     = {{1DFormer: A Transformer Architecture Learning 1d Landmark Representations for Facial Landmark Tracking}},
  author    = {Yin, Shi and Huan, Shijie and Wang, Shangfei and Hu, Jinshui and Guo, Tao and Yin, Bing and Yin, Baocai and Liu, Cong},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {1588-1597},
  doi       = {10.24963/ijcai.2024/176},
  url       = {https://mlanthology.org/ijcai/2024/yin2024ijcai-dformer/}
}