VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Sung-Bin, Kim; Choi, Jeongsoo; Peng, Puyuan; Chung, Joon Son; Oh, Tae-Hyun; Harwath, David

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath

ICCV 2025 pp. 14623-14632

/iccv/2025/sungbin2025iccv-voicecraftdub/

Abstract

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

PDF ICCV Semantic Scholar

Cite

Text

Sung-Bin et al. "VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models." International Conference on Computer Vision, 2025.

Markdown

[Sung-Bin et al. "VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/sungbin2025iccv-voicecraftdub/)

BibTeX

@inproceedings{sungbin2025iccv-voicecraftdub,
  title     = {{VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models}},
  author    = {Sung-Bin, Kim and Choi, Jeongsoo and Peng, Puyuan and Chung, Joon Son and Oh, Tae-Hyun and Harwath, David},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {14623-14632},
  url       = {https://mlanthology.org/iccv/2025/sungbin2025iccv-voicecraftdub/}
}