The paper discusses the limitations of current stagewise pretraining methods for large language models and proposes a new framework, progressive subnetwork training. The focus is on a simple instantiation of this framework – Random Path Training (RAPTR) that trains a sub-path of layers in each step, progressively increasing the path lengths in stages. RAPTR achieves better pre-training loss for BERT and UL2 language models while requiring 20-33% fewer FLOPs compared to standard training. It also shows better downstream performance on UL2, improving QA tasks and SuperGLUE by 1-5% compared to standard training and stacking. The authors also provide a theoretical basis for RAPTR to justify the increasing complexity of subnetworks in stages and the stability in loss across stage transitions due to residual connections and layer norm.
Publication date: 9 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.05913