The article by Bobby He & Thomas Hofmann focuses on simplifying transformer blocks in deep learning. They question whether components such as skip connections, projection/value matrices, sequential sub-blocks, and normalization layers can be removed without affecting training speed. Through experiments, they found that simplified transformers matched the training speed and performance of standard transformers, but with 15% faster training throughput and using 15% fewer parameters. They highlight the role of signal propagation in motivating these modifications and its limitations in understanding deep Neural Network training dynamics.
Publication date: 6 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.01906