Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment

Abstract

Adapting large language models (LLMs) to specialized domains typically requires domain-specific corpora for continual pre-training to facilitate knowledge memorization and related instructions for fine-tuning to apply this knowledge. However, this method may lead to inefficient knowledge memorization due to a lack of awareness of knowledge utilization during the continual pre-training and demands LLMs to simultaneously learn knowledge utilization and format alignment with divergent training objectives during the fine-tuning. To enhance the domain adaptation of LLMs, we revise this process and propose a new domain adaptation framework including domain knowledge learning and general format alignment, called \emph{Mix-CPT}. Specifically, we first conduct a knowledge mixture continual pre-training that concurrently focuses on knowledge memorization and utilization. To avoid catastrophic forgetting, we further propose a logit swap self-distillation constraint. By leveraging the knowledge and capabilities acquired during continual pre-training, we then efficiently perform instruction tuning and alignment with a few general training samples to achieve format alignment. Extensive experiments show that our proposed \emph{Mix-CPT} framework can simultaneously improve the task-solving capabilities of LLMs on the target and general domains.

Cite

Text

Jiang et al. "Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment." International Conference on Learning Representations, 2025.

Markdown

[Jiang et al. "Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/jiang2025iclr-mixcpt/)

BibTeX

@inproceedings{jiang2025iclr-mixcpt,
  title     = {{Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment}},
  author    = {Jiang, Jinhao and Li, Junyi and Zhao, Xin and Song, Yang and Zhang, Tao and Wen, Ji-Rong},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/jiang2025iclr-mixcpt/}
}