LongMagpie: A Self-Synthesis Method for Generating Large-Scale Long-Context Instructions
Abstract
High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.
Cite
Text
Gao et al. "LongMagpie: A Self-Synthesis Method for Generating Large-Scale Long-Context Instructions." Advances in Neural Information Processing Systems, 2025.Markdown
[Gao et al. "LongMagpie: A Self-Synthesis Method for Generating Large-Scale Long-Context Instructions." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/gao2025neurips-longmagpie/)BibTeX
@inproceedings{gao2025neurips-longmagpie,
title = {{LongMagpie: A Self-Synthesis Method for Generating Large-Scale Long-Context Instructions}},
author = {Gao, Chaochen and W, Xing and Lin, Zijia and Zhang, Debing and Hu, Songlin},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/gao2025neurips-longmagpie/}
}