Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Abstract

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter’s in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77% percentage points. Our analysis indicates that MEAP’s effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model’s focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models. Code has been submitted.

Cite

Text

Zhuang et al. "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Zhuang et al. "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhuang2025icml-maskenhanced/)

BibTeX

@inproceedings{zhuang2025icml-maskenhanced,
  title     = {{Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More}},
  author    = {Zhuang, Xialie and Jia, Zhikai and Li, Jianjin and Zhang, Zhenyu and Shen, Li and Cao, Zheng and Liu, Shiwei},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {80516-80532},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/zhuang2025icml-maskenhanced/}
}