Why Do We Need Weight Decay in Modern Deep Learning?

Despite the widespread use of weight decay in training deep networks, its role is not well understood. This study demonstrates how weight decay modifies the optimization dynamics in overparameterized deep networks by enhancing the implicit regularization of Stochastic Gradient Descent (SGD). Conversely, in underparameterized large language models trained with nearly online SGD, weight decay balances the bias-variance tradeoff in stochastic optimization, leading to lower training loss. The study provides a comprehensive perspective on the function of weight decay in deep learning, from ResNets on vision tasks to large language models (LLMs).

Publication date: 6 Oct 2023
Project Page: https://github.com/tml-epfl/why-weight-decay
Paper: https://arxiv.org/pdf/2310.04415

Post Views: 433

Press ESC to close

Share Article:

root

Functional Interpolation for Relative Positions Improves Long Context Transformers

Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets

Please allow ads on our site