Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,081 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.
Cite
Text
Tang et al. "Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I7.32784Markdown
[Tang et al. "Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/tang2025aaai-empowering/) doi:10.1609/AAAI.V39I7.32784BibTeX
@inproceedings{tang2025aaai-empowering,
title = {{Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding}},
author = {Tang, Yunlong and Shimada, Daiki and Bi, Jing and Feng, Mingqian and Hua, Hang and Xu, Chenliang},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {7293-7301},
doi = {10.1609/AAAI.V39I7.32784},
url = {https://mlanthology.org/aaai/2025/tang2025aaai-empowering/}
}