FedPIA - Permuting and Integrating Adapters Leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning
Abstract
Large Vision-Language Models (VLMs), possessing millions or billions of parameters, typically require large text and image datasets for effective fine-tuning. However, collecting data from various sites, especially in healthcare, is challenging due to strict privacy regulations. An alternative is to fine-tune these foundation models on end-user devices, such as in medical clinics and hospitals, without sending data to a server. These local clients typically have limited computing power and small datasets, which are not enough for fully fine-tuning large VLMs on their own. A naive solution to these scenarios is to leverage parameter-efficient fine-tuning (PEFT) strategies such as adapters and apply federated learning (FL) algorithms to combine the learned adapter weights, thereby respecting the resource limitations and data privacy of the clients. However, this approach does not fully leverage the knowledge from multiple adapters trained on diverse data distributions and for diverse tasks. The adapters are adversely impacted by data heterogeneity and task heterogeneity across clients resulting in sub-optimal convergence. To this end, we propose a novel framework called FedPIA that improves upon the naive combinations of FL and PEFT by introducing Permutation and Integration of the local Adapters in the server and global Adapters in the clients exploiting Wasserstein barycenters for improved blending of client-specific and client-agnostic knowledge. This layerwise permutation helps to bridge the gap in the parameter space of local and global adapters before integration. We conduct over 2000 client-level experiments utilizing 48 medical image datasets across five different medical vision-language FL task settings encompassing visual question answering as well as image and report-based multi-label disease detection. Our experiments involving diverse client settings, ten different modalities, and two VLM backbones demonstrate that FedPIA consistently outperforms the state-of-the-art PEFT-FL baselines.
Cite
Text
Saha et al. "FedPIA - Permuting and Integrating Adapters Leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I19.34228Markdown
[Saha et al. "FedPIA - Permuting and Integrating Adapters Leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/saha2025aaai-fedpia/) doi:10.1609/AAAI.V39I19.34228BibTeX
@inproceedings{saha2025aaai-fedpia,
title = {{FedPIA - Permuting and Integrating Adapters Leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning}},
author = {Saha, Pramit and Mishra, Divyanshu and Wagner, Felix and Kamnitsas, Konstantinos and Noble, J. Alison},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {20228-20236},
doi = {10.1609/AAAI.V39I19.34228},
url = {https://mlanthology.org/aaai/2025/saha2025aaai-fedpia/}
}