UniGist: Towards General and Hardware-Aligned Sequence-Level Long Context Compression
Abstract
Large language models are increasingly capable of handling long-context inputs, but the memory overhead of KV cache remains a major bottleneck for general-purpose deployment. While many compression strategies have been explored, sequence-level compression is particularly challenging due to its tendency to lose important details. We present UniGist, a gist token-based long context compression framework that removes the need for chunk-wise training, enabling the model to learn how to compress and utilize long-range context during training. To fully exploit the sparsity, we introduce a gist shift trick that transforms the attention layout into a right-aligned block structure and develop a block-table-free sparse attention kernel based on it. UniGist further supports one-pass training and flexible chunk sizes during inference, allowing efficient and adaptive context processing. Experiments across multiple long-context tasks show that UniGist significantly improves compression quality, with especially strong performance in recalling details and long-range dependency modeling.
Cite
Text
Deng et al. "UniGist: Towards General and Hardware-Aligned Sequence-Level Long Context Compression." Advances in Neural Information Processing Systems, 2025.Markdown
[Deng et al. "UniGist: Towards General and Hardware-Aligned Sequence-Level Long Context Compression." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/deng2025neurips-unigist/)BibTeX
@inproceedings{deng2025neurips-unigist,
title = {{UniGist: Towards General and Hardware-Aligned Sequence-Level Long Context Compression}},
author = {Deng, Chenlong and Zhang, Zhisong and Mao, Kelong and Li, Shuaiyi and Fang, Tianqing and Zhang, Hongming and Mi, Haitao and Yu, Dong and Dou, Zhicheng},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/deng2025neurips-unigist/}
}