MaskViM: Domain Generalized Semantic Segmentation with State Space Models

Abstract

Domain Generalized Semantic Segmentation (DGSS) aims to utilize segmentation model training on known source domains to make predictions on unknown target domains. Currently, there are two network architectures: one based on Convolutional Neural Networks (CNNs) and the other based on Visual Transformers (ViTs). However, both CNN-based and ViT-based DGSS methods face challenges: the former lacks a global receptive field, while the latter requires more computational demands. Drawing inspiration from State Space Models (SSMs), which not only possess a global receptive field but also maintain linear complexity, we propose SSM-based method for achieving DGSS. In this work, we first elucidate why does mask make sense in SSM-based DGSS and propose our mask learning mechanism. Leveraging this mechanism, we present our Mask Vision Mamba network (MaskViM), a model for SSM-based DGSS, and design our mask loss to optimize MaskViM. Our method achieves superior performance on four diverse DGSS setting, which demonstrates the effectiveness of our method.

Cite

Text

Li et al. "MaskViM: Domain Generalized Semantic Segmentation with State Space Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I5.32502

Markdown

[Li et al. "MaskViM: Domain Generalized Semantic Segmentation with State Space Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/li2025aaai-maskvim/) doi:10.1609/AAAI.V39I5.32502

BibTeX

@inproceedings{li2025aaai-maskvim,
  title     = {{MaskViM: Domain Generalized Semantic Segmentation with State Space Models}},
  author    = {Li, Jiahao and Lu, Yang and Xie, Yuan and Qu, Yanyun},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4752-4760},
  doi       = {10.1609/AAAI.V39I5.32502},
  url       = {https://mlanthology.org/aaai/2025/li2025aaai-maskvim/}
}