Generalized Representation Learning for Multimodal Histology Imaging Data Through Vision-Language Modeling
Abstract
We introduce a trimodal vision-language framework that unifies multiplexed spatial proteomics (SP), H&E histology, and textual metadata in a single embedding space. A specialized transformer-based SP encoder, alongside pretrained H&E and language models, captures diverse morphological, molecular, and semantic signals. Preliminary results demonstrate improved retrieval, zero-shot classification, and patient-level phenotype predictions, indicating the promise of this multimodal approach for deeper insights and translational applications in digital pathology.
Cite
Text
Leiby et al. "Generalized Representation Learning for Multimodal Histology Imaging Data Through Vision-Language Modeling." ICLR 2025 Workshops: LMRL, 2025.Markdown
[Leiby et al. "Generalized Representation Learning for Multimodal Histology Imaging Data Through Vision-Language Modeling." ICLR 2025 Workshops: LMRL, 2025.](https://mlanthology.org/iclrw/2025/leiby2025iclrw-generalized/)BibTeX
@inproceedings{leiby2025iclrw-generalized,
title = {{Generalized Representation Learning for Multimodal Histology Imaging Data Through Vision-Language Modeling}},
author = {Leiby, Jacob S and Trevino, Alexandro E and Mayer, Aaron T and Wu, Zhenqin and Kim, Dokyoon and Huang, Zhi},
booktitle = {ICLR 2025 Workshops: LMRL},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/leiby2025iclrw-generalized/}
}