DeFT: Flash Tree-Attention with IO-Awareness for Efficient Tree-Search-Based LLM Inference
Abstract
Decoding using tree search can greatly enhance the inference quality for transformer-based Large Language Models (LLMs). Depending on the guidance signal, it searches for the best path from root to leaf in the tree by forming LLM outputs to improve controllability, reasoning ability, alignment, et cetera. However, current tree decoding strategies and their inference systems do not suit each other well due to redundancy in computation, memory footprints, and memory access, resulting in inefficient inference. To address this issue, we propose DeFT, an IO-aware tree attention algorithm that maintains memory-efficient attention calculation with low memory footprints in two stages: (1) QKV Preparation: we propose a KV-Guided Tree Split strategy to group QKV wisely for high utilization of GPUs and reduction of memory reads/writes for the KV cache between GPU global memory and on-chip shared memory as much as possible; (2) Attention Calculation: we calculate partial attention of each QKV groups in a fused kernel then apply a Tree-topology-aware Global Reduction strategy to get final attention. Thanks to a reduction in KV cache IO by 3.6-4.5x, along with an additional reduction in IO for QK^T and Softmax equivalent to 25% of the total KV cache IO, DeFT can achieve a speedup of 1.7-2.4x in end-to-end latency across two practical reasoning tasks over the SOTA attention algorithms.
Cite
Text
Yao et al. "DeFT: Flash Tree-Attention with IO-Awareness for Efficient Tree-Search-Based LLM Inference." ICLR 2024 Workshops: AGI, 2024.Markdown
[Yao et al. "DeFT: Flash Tree-Attention with IO-Awareness for Efficient Tree-Search-Based LLM Inference." ICLR 2024 Workshops: AGI, 2024.](https://mlanthology.org/iclrw/2024/yao2024iclrw-deft/)BibTeX
@inproceedings{yao2024iclrw-deft,
title = {{DeFT: Flash Tree-Attention with IO-Awareness for Efficient Tree-Search-Based LLM Inference}},
author = {Yao, Jinwei and Zhang, Kexun and Chen, Kaiqi and You, Jiaxuan and Wang, Zeke and Yuan, Binhang and Lin, Tao},
booktitle = {ICLR 2024 Workshops: AGI},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/yao2024iclrw-deft/}
}