LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Abstract

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data pipeline for YouTube videos and their closed captions (CC), resulting in \texttt Live-CC-10M pre-training set and \texttt Live-WhisperX-408K high-quality supervised fine-tuning (SFT) set. Remarkably, even without SFT, the pre-trained model \texttt LiveCC-7B demonstrates significant improvements in general video QA and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new benchmark \texttt LiveSports-3K , using LLM-as-a-judge to measure the free-form commentary. Experiments show our final model \texttt LiveCC-7B can surpass LLaVA-Video-72B in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B scale on popular benchmarks such as VideoMME, demonstrating its broad generalizability. All resources of this paper have been released at \href https://showlab.github.io/livecc showlab.github.io/livecc .

Cite

Text

Chen et al. "LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02708

Markdown

[Chen et al. "LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/chen2025cvpr-livecc/) doi:10.1109/CVPR52734.2025.02708

BibTeX

@inproceedings{chen2025cvpr-livecc,
  title     = {{LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale}},
  author    = {Chen, Joya and Zeng, Ziyun and Lin, Yiqi and Li, Wei and Ma, Zejun and Shou, Mike Zheng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29083-29095},
  doi       = {10.1109/CVPR52734.2025.02708},
  url       = {https://mlanthology.org/cvpr/2025/chen2025cvpr-livecc/}
}