Despite the widespread use of weight decay in training deep networks, its role is not well understood. This study demonstrates how weight decay modifies the optimization dynamics in overparameterized deep networks by enhancing the implicit regularization of Stochastic Gradient Descent (SGD). Conversely, in underparameterized large language models trained with nearly online SGD, weight decay balances the bias-variance tradeoff in stochastic optimization, leading to lower training loss. The study provides a comprehensive perspective on the function of weight decay in deep learning, from ResNets on vision tasks to large language models (LLMs).

 

Publication date: 6 Oct 2023
Project Page: https://github.com/tml-epfl/why-weight-decay
Paper: https://arxiv.org/pdf/2310.04415