LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
Abstract
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
Cite
Text
Geng et al. "LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01766Markdown
[Geng et al. "LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/geng2025cvpr-longvale/) doi:10.1109/CVPR52734.2025.01766BibTeX
@inproceedings{geng2025cvpr-longvale,
title = {{LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos}},
author = {Geng, Tiantian and Zhang, Jinrui and Wang, Qingni and Wang, Teng and Duan, Jinming and Zheng, Feng},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {18959-18969},
doi = {10.1109/CVPR52734.2025.01766},
url = {https://mlanthology.org/cvpr/2025/geng2025cvpr-longvale/}
}