MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-Modal Bottleneck Fusion and Calibrated Decoder Pruning

Abstract

Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.

Cite

Text

Segu et al. "MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-Modal Bottleneck Fusion and Calibrated Decoder Pruning." International Conference on Computer Vision, 2025.

Markdown

[Segu et al. "MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-Modal Bottleneck Fusion and Calibrated Decoder Pruning." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/segu2025iccv-mobius/)

BibTeX

@inproceedings{segu2025iccv-mobius,
  title     = {{MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-Modal Bottleneck Fusion and Calibrated Decoder Pruning}},
  author    = {Segu, Mattia and Gazulla, Marta Tintore and Xian, Yongqin and Van Gool, Luc and Tombari, Federico},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {20726-20736},
  url       = {https://mlanthology.org/iccv/2025/segu2025iccv-mobius/}
}