The study presents a method of dynamically selecting layers in deep transformer networks to reduce the number of trainable parameters. This is achieved by employing Reinforcement Learning to decide whether to train each layer independently or copy the weights from a previous layer. This technique promotes weight sharing and serves as a regularization method. Experimental evaluations show that this model offers modest improvements over the baseline transformer model in terms of perplexity while significantly reducing the number of trainable parameters. The method also significantly reduces memory consumption during training.

 

Publication date: 23 Jan 2024
Project Page: https://arxiv.org/abs/2401.12819v1
Paper: https://arxiv.org/pdf/2401.12819