The article presents a novel method called ‘Hydra heads’ to improve the efficiency of speculative decoding in transformer-based large language models (LLMs). The study builds upon the Medusa decoding framework, which uses lightweight heads known as draft heads. However, unlike previous methods, Hydra heads are sequentially dependent, leading to significantly improved speculation accuracy and throughput. The researchers also propose a carefully-tuned Hydra head recipe, named Hydra++, which further improves decoding throughput. The study concludes that Hydra heads offer a simple but highly effective intervention to improve the speed of draft head based speculative decoding.
Publication date: 7 Feb 2024
Project Page: https://arxiv.org/abs/2402.05109
Paper: https://arxiv.org/pdf/2402.05109