Learned Meta-Tokens for Language Modeling
Abstract
Transformer-based language models (LMs) notably struggle to reliably capture distant contextual information. This work introduces a novel approach using meta-tokens -- special tokens injected during pre-training -- paired with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model equipped with meta-attention in addition to causal multi-head attention on <100B tokens, achieving strong performance on a suite of synthetic tasks. Our method facilitates length generalization up to 2$\times$ the context window after extension with YaRN. We provide an information-theoretic analysis which reveals that meta-tokens \textit{sharpen} the positional encoding, allowing them to operate as content-based anchors that compress preceding context and “cache” it within the meta-token. We empirically confirm this by visualizing model internals to study the residual stream. Together, our findings demonstrate that meta-tokens and meta-attention provide a simple, data-efficient pre-training method, grounded by new mechanistic insights into their role in enabling length generalization behavior.
Cite
Text
Shah et al. "Learned Meta-Tokens for Language Modeling." International Conference on Learning Representations, 2026.Markdown
[Shah et al. "Learned Meta-Tokens for Language Modeling." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/shah2026iclr-learned/)BibTeX
@inproceedings{shah2026iclr-learned,
title = {{Learned Meta-Tokens for Language Modeling}},
author = {Shah, Alok and Gupta, Khush and Ramji, Keshav and Chaudhari, Pratik},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/shah2026iclr-learned/}
}