SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models

Zuo, Ran; Hu, Haoxiang; Deng, Xiaoming; Gao, Cangjun; Zhang, Zhengming; Lai, Yu-Kun; Ma, Cuixia; Liu, Yong-Jin; Wang, Hongan

doi:10.24963/ijcai.2024/202

SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models

Ran Zuo, Haoxiang Hu, Xiaoming Deng, Cangjun Gao, Zhengming Zhang, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, Hongan Wang

IJCAI 2024 pp. 1825-1833

doi:10.24963/ijcai.2024/202 /ijcai/2024/zuo2024ijcai-scenediff/

Abstract

Speech-driven 3D facial animation aims to create lifelike facial expressions that synchronize accurately with speech. Despite significant progress, many existing methods may focus on generating facial animation with a fixed emotional state, neglecting the diverse transformations of facial emotions under a given speech input. To solve this issue, we focus on exploring the refined alignment between speech representations and multiple domains in facial expression information. We aim to disentangle the spoken language and emotion facial priors from speech expressions, to guide the refinement of the facial vertices based on speech. To accomplish this objective, we propose ExpTalk, which first applies an Adaptive Disentanglement Variational Autoencoder (AD-VAE) to decouple facial expression aligned with spoken language and emotions of speech through contrastive learning. Then a Refined Alignment Diffusion (RAD) is employed to iteratively refine the decoupled facial expression priors through diffusion-based perturbations, producing facial animations that align with the emotional variations of the given speech. Extensive experiments prove the effectiveness of our ExpTalk by surpassing state-of-the-arts by a large margin.

PDF IJCAI Semantic Scholar

Cite

Text

Zuo et al. "SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/202

Markdown

[Zuo et al. "SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/zuo2024ijcai-scenediff/) doi:10.24963/ijcai.2024/202

BibTeX

@inproceedings{zuo2024ijcai-scenediff,
  title     = {{SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models}},
  author    = {Zuo, Ran and Hu, Haoxiang and Deng, Xiaoming and Gao, Cangjun and Zhang, Zhengming and Lai, Yu-Kun and Ma, Cuixia and Liu, Yong-Jin and Wang, Hongan},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {1825-1833},
  doi       = {10.24963/ijcai.2024/202},
  url       = {https://mlanthology.org/ijcai/2024/zuo2024ijcai-scenediff/}
}