VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Abstract

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.

Cite

Text

Fu et al. "VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction." Advances in Neural Information Processing Systems, 2025.

Markdown

[Fu et al. "VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/fu2025neurips-vita1/)

BibTeX

@inproceedings{fu2025neurips-vita1,
  title     = {{VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction}},
  author    = {Fu, Chaoyou and Lin, Haojia and Wang, Xiong and Zhang, YiFan and Shen, Yunhang and Liu, Xiaoyu and Cao, Haoyu and Long, Zuwei and Gao, Heting and Li, Ke and Ma, Long and Zheng, Xiawu and Ji, Rongrong and Sun, Xing and Shan, Caifeng and He, Ran},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/fu2025neurips-vita1/}
}