Multimodal Class-Aware Semantic Enhancement Network for Audio-Visual Video Parsing

Zhao, Pengcheng; Zhou, Jinxing; Zhao, Yang; Guo, Dan; Chen, Yanxiang

doi:10.1609/AAAI.V39I10.33134

Multimodal Class-Aware Semantic Enhancement Network for Audio-Visual Video Parsing

Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen

AAAI 2025 pp. 10448-10456

doi:10.1609/AAAI.V39I10.33134 /aaai/2025/zhao2025aaai-multimodal/

Abstract

The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a novel event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.

PDF AAAI Semantic Scholar

Cite

Text

Zhao et al. "Multimodal Class-Aware Semantic Enhancement Network for Audio-Visual Video Parsing." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33134

Markdown

[Zhao et al. "Multimodal Class-Aware Semantic Enhancement Network for Audio-Visual Video Parsing." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhao2025aaai-multimodal/) doi:10.1609/AAAI.V39I10.33134

BibTeX

@inproceedings{zhao2025aaai-multimodal,
  title     = {{Multimodal Class-Aware Semantic Enhancement Network for Audio-Visual Video Parsing}},
  author    = {Zhao, Pengcheng and Zhou, Jinxing and Zhao, Yang and Guo, Dan and Chen, Yanxiang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10448-10456},
  doi       = {10.1609/AAAI.V39I10.33134},
  url       = {https://mlanthology.org/aaai/2025/zhao2025aaai-multimodal/}
}