Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition

Hu, Bo; Zhang, Kai; Zhang, Yanghai; Ye, Yuyang

doi:10.1609/AAAI.V39I16.33898

Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition

Bo Hu, Kai Zhang, Yanghai Zhang, Yuyang Ye

AAAI 2025 pp. 17267-17275

doi:10.1609/AAAI.V39I16.33898 /aaai/2025/hu2025aaai-adaptive-a/

Abstract

In recent years, deep multimodal learning has seen significant advancements. However, there remains a lack of multimodal fusion methods capable of dynamically adjusting the weighting of information both within and across modalities based on input samples. In the domain of multimodal intent recognition, the text modality often contains the most relevant information for intent detection, while the audio and visual modalities provide comparatively less critical information. There is a significant variation in the density of important information across different modalities and samples. To address this challenge, we propose a Dynamic Attention Allocation Fusion (DAF) method with an adaptive network structure that dynamically allocates attention both within individual modalities and across multiple modalities. This approach enables the model to focus more effectively on the most informative modalities and their respective internal features. Furthermore, we introduce a multi-view contrastive learning framework based on DAF (MVCL-DAF). This framework uses distinct and isolated modules to process information from various modalities, taking inspiration from the way the human brain processes multimodal information. Each modality independently infers intent using its respective module, while DAF integrates the multimodal information to produce a comprehensive global intent prediction. The text modality, functioning as the primary modality due to its rich semantic content, guides the other modules in the multi-view contrastive learning process. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods.

PDF AAAI Semantic Scholar

Cite

Text

Hu et al. "Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I16.33898

Markdown

[Hu et al. "Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/hu2025aaai-adaptive-a/) doi:10.1609/AAAI.V39I16.33898

BibTeX

@inproceedings{hu2025aaai-adaptive-a,
  title     = {{Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition}},
  author    = {Hu, Bo and Zhang, Kai and Zhang, Yanghai and Ye, Yuyang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {17267-17275},
  doi       = {10.1609/AAAI.V39I16.33898},
  url       = {https://mlanthology.org/aaai/2025/hu2025aaai-adaptive-a/}
}