Leveraging Attention to Effectively Compress Prompts for Long-Context LLMs

Abstract

Prompt compression is increasingly studied for its potential to reduce computational costs and alleviate the burden on language models when processing lengthy prompts. Prior research has assessed token retention and removal by calculating information entropy. However, prompt compression encounters two significant challenges: (1) Information entropy, while widely used, may not be the optimal compression metric; and (2) The semantic significance of tokens is context-dependent, which renders independent token retention decisions inadequate. We posit that the solution to these challenges lies in the intrinsic mechanism of language models. Large language models (LLMs) exhibit robust contextual processing capabilities, with recent studies on their internal dynamics revealing that the attention mechanism plays a crucial role in modeling how LLMs leverage long contexts. Building on this insight, we introduce AttnComp, a novel approach that exploits the attention mechanism within language models to guide prompt compression. Our method employs causal cross-attention from the query to the context to evaluate the significance of each token, and we develop a graph-based algorithm to efficiently cluster tokens into semantic units, thus mitigating the issue of independent dependencies. We conduct experiments on datasets for retrieval-augmented generation and multiple long tasks involving single or multi-document QA. Our proposed method, AttnComp, outperforms previous baselines and validates the contributions of our components through analytical experiments. Compared to other methods that use a causal LM for prompt compression, our approach results in shorter latency and improved performance.

Cite

Text

Zhao et al. "Leveraging Attention to Effectively Compress Prompts for Long-Context LLMs." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I24.34800

Markdown

[Zhao et al. "Leveraging Attention to Effectively Compress Prompts for Long-Context LLMs." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhao2025aaai-leveraging/) doi:10.1609/AAAI.V39I24.34800

BibTeX

@inproceedings{zhao2025aaai-leveraging,
  title     = {{Leveraging Attention to Effectively Compress Prompts for Long-Context LLMs}},
  author    = {Zhao, Yunlong and Wu, Haoran and Xu, Bo},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {26048-26056},
  doi       = {10.1609/AAAI.V39I24.34800},
  url       = {https://mlanthology.org/aaai/2025/zhao2025aaai-leveraging/}
}