Language-Guided Audio-Visual Learning for Long-Term Sports Assessment
Abstract
Long-term sports assessment is a challenging task in video understanding since it requires judging complex movement variations and action-music coordination. However, there is no direct correlation between the diverse background music and movements in sporting events. Previous works require a large number of model parameters to learn potential associations between actions and music. To address this issue, we propose a language-guided audio-visual learning (MLAVL) framework that models "audio-action-visual" correlations guided by low-cost language modality. In our framework, multidimensional domain-based actions form action knowledge graphs, motivating audio-visual modalities to focus on task-relevant actions. We further design a shared-specific context encoder to integrate deep multimodal semantics, and an audio-visual cross-modal fusion module to evaluate action-music consistency. To match the sport's rules, we then propose a dual-branch prompt-guided grading module to weigh both visual and audio-visual performance. Extensive experiments demonstrate that our approach achieves state-of-the-art on four public long-term sports benchmarks while maintaining low parameters. Our code is available at https://github.com/XuHuangbiao/MLAVL.
Cite
Text
Xu et al. "Language-Guided Audio-Visual Learning for Long-Term Sports Assessment." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02232Markdown
[Xu et al. "Language-Guided Audio-Visual Learning for Long-Term Sports Assessment." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/xu2025cvpr-languageguided/) doi:10.1109/CVPR52734.2025.02232BibTeX
@inproceedings{xu2025cvpr-languageguided,
title = {{Language-Guided Audio-Visual Learning for Long-Term Sports Assessment}},
author = {Xu, Huangbiao and Ke, Xiao and Wu, Huanqi and Xu, Rui and Li, Yuezhou and Guo, Wenzhong},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {23967-23977},
doi = {10.1109/CVPR52734.2025.02232},
url = {https://mlanthology.org/cvpr/2025/xu2025cvpr-languageguided/}
}