Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations
Abstract
Cross-modal distillation has emerged as a critical technique for leveraging strengths across different modalities. However, existing methods have not enabled performance benefits between models trained on different modal data. In this work, we introduce a cross-modal alignment regularization (CMAR) term into language model training, aligning its representations with those of a vision model at specific layers. Our experiments demonstrate that our method enhances language model performance across various downstream tasks, in both pre-training and fine-tuning settings. Specifically, in the pre-training setting, we observe accuracy improvements of 1.01\% on the Language Modeling Broadened to Account for Discourse Aspects (LAMBADA) dataset and 1.49\% on the Causal Reasoning (COPA) dataset. Our method also proves effective in the fine-tuning setting, boosting performance by 1.20\% on LAMBADA and 2.00\% on COPA, indicating that a vision model can substantially enhance language model performance. CMAR provides a simple yet effective strategy to consistently enhance language model performance through representation alignment with vision models, which opens new avenues for improving model performance through direct cross-modal representation alignment.
Cite
Text
Gan et al. "Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations." ICLR 2025 Workshops: Re-Align, 2025.Markdown
[Gan et al. "Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations." ICLR 2025 Workshops: Re-Align, 2025.](https://mlanthology.org/iclrw/2025/gan2025iclrw-crossmodal/)BibTeX
@inproceedings{gan2025iclrw-crossmodal,
title = {{Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations}},
author = {Gan, Yulu and Zhao, Kaiya Ivy and Isola, Phillip},
booktitle = {ICLR 2025 Workshops: Re-Align},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/gan2025iclrw-crossmodal/}
}