LLM-Driven Multimodal and Multi-Identity Listening Head Generation

Abstract

Generating natural listener responses in conversational scenarios is crucial for creating engaging digital humans and avatars. Recent work has shown that large language models (LLMs) can be effectively leveraged for this task, demonstrating remarkable capabilities in generating contextually appropriate listener behaviors. However, current LLM-based methods face two critical limitations: they rely solely on speech content, overlooking other crucial communication signals, and they entangle listener identity with response generation, compromising output fidelity and generalization. In this work, we present a novel framework that addresses these limitations while maintaining the advantages of LLMs. Our approach introduces a Multimodal-LM architecture that jointly processes speech content, acoustics, and speaker emotion, capturing the full spectrum of communication cues. Additionally, we propose an identity disentanglement strategy using instance normalization and adaptive instance normalization in a VQ-VAE framework, enabling high-fidelity listening head synthesis with flexible identity control. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of response naturalness and fidelity, while enabling effective identity control without retraining.

Cite

Text

Lai et al. "LLM-Driven Multimodal and Multi-Identity Listening Head Generation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00996

Markdown

[Lai et al. "LLM-Driven Multimodal and Multi-Identity Listening Head Generation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/lai2025cvpr-llmdriven/) doi:10.1109/CVPR52734.2025.00996

BibTeX

@inproceedings{lai2025cvpr-llmdriven,
  title     = {{LLM-Driven Multimodal and Multi-Identity Listening Head Generation}},
  author    = {Lai, Peiwen and Zhong, Weizhi and Qin, Yipeng and Ren, Xiaohang and Wang, Baoyuan and Li, Guanbin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {10656-10666},
  doi       = {10.1109/CVPR52734.2025.00996},
  url       = {https://mlanthology.org/cvpr/2025/lai2025cvpr-llmdriven/}
}