LONGNET: Scaling Transformers to 1,000,000,000 Tokens

LONGNET is designed to address the challenge of scaling sequence length in large language models. Traditional methods struggle with either computational complexity or model expressivity, limiting the maximum sequence length. LONGNET, however, can scale sequence length to more than 1 billion tokens without sacrificing performance on shorter sequences. This is achieved through the introduction of dilated attention, which expands the attentive field exponentially as the distance grows. This new approach offers several advantages, including linear computational complexity, logarithmic dependency between tokens, and seamless integration with existing Transformer-based optimization. The paper suggests that LONGNET opens up new possibilities for modeling very long sequences, such as treating an entire corpus or even the entire Internet as a sequence.

 

Publication date: July 5, 2023
Project Page: https://aka.ms/LongNet
Paper: https://arxiv.org/pdf/2307.02486.pdf