KaLM-Embedding-V2: Superior Training Techniques and Data Inspire a Versatile Embedding Model

Zhao, Xinping; Hu, Xinshuo; Shan, Zifei; Huang, Shouzheng; Zhou, Yao; Zhang, Xin; Sun, Zetian; Liu, Zhenyu; Li, Dongfang; Wei, Xinyuan; Pan, Youcheng; Xiang, Yang; Zhang, Meishan; Wang, Haofen; Yu, Jun; Hu, Baotian; Zhang, Min

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire a Versatile Embedding Model

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

ICLR 2026

/iclr/2026/zhao2026iclr-kalmembeddingv2/

Abstract

Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2 from the Lychee-KaLM team, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters. The code, data, and models are available at https://kalm-embedding.github.io/.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhao et al. "KaLM-Embedding-V2: Superior Training Techniques and Data Inspire a Versatile Embedding Model." International Conference on Learning Representations, 2026.

Markdown

[Zhao et al. "KaLM-Embedding-V2: Superior Training Techniques and Data Inspire a Versatile Embedding Model." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhao2026iclr-kalmembeddingv2/)

BibTeX

@inproceedings{zhao2026iclr-kalmembeddingv2,
  title     = {{KaLM-Embedding-V2: Superior Training Techniques and Data Inspire a Versatile Embedding Model}},
  author    = {Zhao, Xinping and Hu, Xinshuo and Shan, Zifei and Huang, Shouzheng and Zhou, Yao and Zhang, Xin and Sun, Zetian and Liu, Zhenyu and Li, Dongfang and Wei, Xinyuan and Pan, Youcheng and Xiang, Yang and Zhang, Meishan and Wang, Haofen and Yu, Jun and Hu, Baotian and Zhang, Min},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhao2026iclr-kalmembeddingv2/}
}