Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition

Abstract

In recent years, deep multimodal learning has seen significant advancements. However, there remains a lack of multimodal fusion methods capable of dynamically adjusting the weighting of information both within and across modalities based on input samples. In the domain of multimodal intent recognition, the text modality often contains the most relevant information for intent detection, while the audio and visual modalities provide comparatively less critical information. There is a significant variation in the density of important information across different modalities and samples. To address this challenge, we propose a Dynamic Attention Allocation Fusion (DAF) method with an adaptive network structure that dynamically allocates attention both within individual modalities and across multiple modalities. This approach enables the model to focus more effectively on the most informative modalities and their respective internal features. Furthermore, we introduce a multi-view contrastive learning framework based on DAF (MVCL-DAF). This framework uses distinct and isolated modules to process information from various modalities, taking inspiration from the way the human brain processes multimodal information. Each modality independently infers intent using its respective module, while DAF integrates the multimodal information to produce a comprehensive global intent prediction. The text modality, functioning as the primary modality due to its rich semantic content, guides the other modules in the multi-view contrastive learning process. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods.

Cite

Text

Hu et al. "Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I16.33898

Markdown

[Hu et al. "Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/hu2025aaai-adaptive-a/) doi:10.1609/AAAI.V39I16.33898

BibTeX

@inproceedings{hu2025aaai-adaptive-a,
  title     = {{Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition}},
  author    = {Hu, Bo and Zhang, Kai and Zhang, Yanghai and Ye, Yuyang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {17267-17275},
  doi       = {10.1609/AAAI.V39I16.33898},
  url       = {https://mlanthology.org/aaai/2025/hu2025aaai-adaptive-a/}
}