Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Deng, Haoran; Lin, Yingyu; Lin, Zhenghao; Liu, Xiao; Sun, Yizhou; Ma, Yian; Gong, Yeyun

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Haoran Deng, Yingyu Lin, Zhenghao Lin, Xiao Liu, Yizhou Sun, Yian Ma, Yeyun Gong

ICLR 2026

/iclr/2026/deng2026iclr-beyond/

Abstract

Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Deng et al. "Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data." International Conference on Learning Representations, 2026.

Markdown

[Deng et al. "Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/deng2026iclr-beyond/)

BibTeX

@inproceedings{deng2026iclr-beyond,
  title     = {{Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data}},
  author    = {Deng, Haoran and Lin, Yingyu and Lin, Zhenghao and Liu, Xiao and Sun, Yizhou and Ma, Yian and Gong, Yeyun},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/deng2026iclr-beyond/}
}