SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs

Abstract

Recent multimodal large language models (MLLMs) marry modality-specific vision or audio encoders with a shared text decoder. While the encoder is compute- intensive but memory-light, the decoder is the opposite, yet state-of-the-art serving stacks still time-multiplex these complementary kernels, idling SMs or HBM in turn. We introduce SpaceServe, a serving system that space-multiplexes MLLMs: it decouples all modality encoders from the decoder, and co-locates them on the same GPU using fine-grained SM partitioning available in modern runtimes. A cost-model-guided Space-Inference Scheduler (SIS) dynamically assigns SM slices, while a Time-Windowed Shortest-Remaining-First (TWSRFT) policy batches en- coder requests to minimise completion latency and smooth decoder arrivals. Evaluation shows that SpaceServe reduces time-per-output-token by 4.81× on average and up to 28.9× on Nvidia A100 GPUs. SpaceServe is available at https://github.com/gofreelee/SpaceServe

Cite

Text

Li et al. "SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs." Advances in Neural Information Processing Systems, 2025.

Markdown

[Li et al. "SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/li2025neurips-spaceserve/)

BibTeX

@inproceedings{li2025neurips-spaceserve,
  title     = {{SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs}},
  author    = {Li, Zhicheng and Zhang, Shuoming and Zhao, Jiacheng and Li, Siqi and Shi, Xiyu and Zhang, Yangyu and Li, Shuaijiang and Yu, Donglin and Yang, Zheming and Wen, Yuan and Cui, Huimin},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/li2025neurips-spaceserve/}
}