Video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Abstract
Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced audio-visual evaluation benchmark, video-SALMONN achieves more than 25% absolute accuracy improvements on the video-QA task and over 30% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at https://github.com/bytedance/SALMONN/
Cite
Text
Sun et al. "Video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models." International Conference on Machine Learning, 2024.Markdown
[Sun et al. "Video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/sun2024icml-videosalmonn/)BibTeX
@inproceedings{sun2024icml-videosalmonn,
title = {{Video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models}},
author = {Sun, Guangzhi and Yu, Wenyi and Tang, Changli and Chen, Xianzhao and Tan, Tian and Li, Wei and Lu, Lu and Ma, Zejun and Wang, Yuxuan and Zhang, Chao},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {47198-47217},
volume = {235},
url = {https://mlanthology.org/icml/2024/sun2024icml-videosalmonn/}
}