LightTransfer: Your Long-Context LLM Is Secretly a Hybrid Model with Effortless Adaptation

Abstract

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose \textsc{LightTransfer}, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies \textit{lazy} layers---those focusing on recent or initial tokens---and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as \textit{lazy}, \textsc{LightTransfer} achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

Cite

Text

Zhang et al. "LightTransfer: Your Long-Context LLM Is Secretly a Hybrid Model with Effortless Adaptation." Transactions on Machine Learning Research, 2025.

Markdown

[Zhang et al. "LightTransfer: Your Long-Context LLM Is Secretly a Hybrid Model with Effortless Adaptation." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/zhang2025tmlr-lighttransfer/)

BibTeX

@article{zhang2025tmlr-lighttransfer,
  title     = {{LightTransfer: Your Long-Context LLM Is Secretly a Hybrid Model with Effortless Adaptation}},
  author    = {Zhang, Xuan and Zhang, Fengzhuo and Du, Cunxiao and Du, Chao and Pang, Tianyu and Gao, Wei and Lin, Min},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/zhang2025tmlr-lighttransfer/}
}