Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations

Gan, Yulu; Zhao, Kaiya Ivy; Isola, Phillip

Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations

ICLRW 2025

/iclrw/2025/gan2025iclrw-crossmodal/

Abstract

Cross-modal distillation has emerged as a critical technique for leveraging strengths across different modalities. However, existing methods have not enabled performance benefits between models trained on different modal data. In this work, we introduce a cross-modal alignment regularization (CMAR) term into language model training, aligning its representations with those of a vision model at specific layers. Our experiments demonstrate that our method enhances language model performance across various downstream tasks, in both pre-training and fine-tuning settings. Specifically, in the pre-training setting, we observe accuracy improvements of 1.01\% on the Language Modeling Broadened to Account for Discourse Aspects (LAMBADA) dataset and 1.49\% on the Causal Reasoning (COPA) dataset. Our method also proves effective in the fine-tuning setting, boosting performance by 1.20\% on LAMBADA and 2.00\% on COPA, indicating that a vision model can substantially enhance language model performance. CMAR provides a simple yet effective strategy to consistently enhance language model performance through representation alignment with vision models, which opens new avenues for improving model performance through direct cross-modal representation alignment.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Gan et al. "Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations." ICLR 2025 Workshops: Re-Align, 2025.

Markdown

[Gan et al. "Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations." ICLR 2025 Workshops: Re-Align, 2025.](https://mlanthology.org/iclrw/2025/gan2025iclrw-crossmodal/)

BibTeX

@inproceedings{gan2025iclrw-crossmodal,
  title     = {{Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations}},
  author    = {Gan, Yulu and Zhao, Kaiya Ivy and Isola, Phillip},
  booktitle = {ICLR 2025 Workshops: Re-Align},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/gan2025iclrw-crossmodal/}
}