This research presents Hydragen, a hardware-aware model that improves the efficiency of transformer-based large language models (LLMs) working with shared prefixes. It is common for LLMs to perform inferences on batches of sequences that share a prefix. Hydragen addresses this by computing attention over the shared prefix and unique suffixes separately, reducing redundant memory reads and allowing for hardware-friendly matrix multiplications. This results in a significant increase in LLM throughput, with improvements of up to 32x against competitive baselines. The method also allows for very long shared contexts and can be applied to tree-based prompt sharing patterns, further reducing inference time.

 

Publication date: 7 Feb 2024
Project Page: https://arxiv.org/abs/2402.05099v1
Paper: https://arxiv.org/pdf/2402.05099