H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

Abstract

With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolution. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively select multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.

Cite

Text

Chen et al. "H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I2.32220

Markdown

[Chen et al. "H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/chen2025aaai-h/) doi:10.1609/AAAI.V39I2.32220

BibTeX

@inproceedings{chen2025aaai-h,
  title     = {{H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving}},
  author    = {Chen, Siran and Luo, Yuxiao and Ma, Yue and Qiao, Yu and Wang, Yali},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {2212-2220},
  doi       = {10.1609/AAAI.V39I2.32220},
  url       = {https://mlanthology.org/aaai/2025/chen2025aaai-h/}
}