SegMAN: Omni-Scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

Abstract

High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. Our SegMAN-B Encoder achieves 85.1% ImageNet-1k accuracy (+1.5% over VMamba-S with fewer parameters). When paired with our decoder, the full SegMAN-B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt-L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer-B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer-B3 on COCO-Stuff with lower GFLOPs. Our code is available at https://github.com/yunxiangfu2001/SegMAN.

Cite

Text

Fu et al. "SegMAN: Omni-Scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01777

Markdown

[Fu et al. "SegMAN: Omni-Scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/fu2025cvpr-segman/) doi:10.1109/CVPR52734.2025.01777

BibTeX

@inproceedings{fu2025cvpr-segman,
  title     = {{SegMAN: Omni-Scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation}},
  author    = {Fu, Yunxiang and Lou, Meng and Yu, Yizhou},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {19077-19087},
  doi       = {10.1109/CVPR52734.2025.01777},
  url       = {https://mlanthology.org/cvpr/2025/fu2025cvpr-segman/}
}