Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Abstract
Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter’s in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77% percentage points. Our analysis indicates that MEAP’s effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model’s focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models. Code has been submitted.
Cite
Text
Zhuang et al. "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Zhuang et al. "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhuang2025icml-maskenhanced/)BibTeX
@inproceedings{zhuang2025icml-maskenhanced,
title = {{Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More}},
author = {Zhuang, Xialie and Jia, Zhikai and Li, Jianjin and Zhang, Zhenyu and Shen, Li and Cao, Zheng and Liu, Shiwei},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {80516-80532},
volume = {267},
url = {https://mlanthology.org/icml/2025/zhuang2025icml-maskenhanced/}
}