From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning

Abstract

The rapid expansion of multimodal models has surfaced formidable bottlenecks in computation, memory, and deployment, catalyzing the rise of Efficient Multimodal Learning (EML) as a pivotal research frontier. Despite intensive progress, a cohesive understanding of $\textit{what}$, $\textit{how}$, and $\textit{where}$ efficiency is manifested across the learning stack remains fragmented. This survey systematizes the EML landscape by introducing the first structured, model-to-system taxonomy. We distill insights from over 300 seminal works into three hierarchical levels—$\textit{model}$, $\textit{algorithm}$, and $\textit{system}$—addressing architectural parsimony, execution refinement, and hardware-aware orchestration, respectively. Moving beyond a purely categorical review, we offer a methodological synthesis of the vertical synergies between these layers, elucidating how cross-layer co-design contributes to the fundamental "Efficiency-Utility-Privacy'' trade-off. Through an integrative case study of Multimodal Large Language Models (MLLMs), we trace the field’s evolutionary trajectory from initial structural adjustments to modern full-stack resource orchestration. Furthermore, we provide a holistic discussion and application-specific optimization blueprints for diverse domains and posit a paradigm shift toward self-regulating intelligence, where efficiency is an intrinsic, emergent property of the model’s fundamental design rather than a post-hoc constraint. Finally, we present open challenges and future directions that will define the trajectory of EML research. This survey establishes a structured framework for multimodal systems that are not only high-performing and generalizable but natively efficient and ready for ubiquitous deployment. A continuously updated version is available at https://github.com/pwang322/Efficient-Multimodal-Learning-Survey.

Cite

Text

Wang et al. "From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning." Transactions on Machine Learning Research, 2026.

Markdown

[Wang et al. "From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/wang2026tmlr-models/)

BibTeX

@article{wang2026tmlr-models,
  title     = {{From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning}},
  author    = {Wang, Pan and Song, Siwei and Ji, Hui and Cao, Siqi and Yu, Heng and Liu, Zhijian and Yang, Huanrui and Lin, Yingyan Celine and Chen, Beidi and Bansal, Mohit and Liu, Xiaoming and Zhou, Pengfei and Yang, Ming-Hsuan and Chen, Tianlong and Hu, Jingtong},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/wang2026tmlr-models/}
}