The authors of the article investigate ‘Eureka-moments’ in transformers during multi-step tasks, where transformers quickly improve after training and validation loss have stagnated for a long time. They found that transformers struggle with intermediate tasks, unlike CNNs. These rapid leaps in performance are tied back to the Softmax function in the self-attention block of transformers. The researchers propose solutions to alleviate the problem, which enhance training speed and increase the likelihood of learning the intermediate task, leading to higher final accuracy and robustness to hyper-parameters.

 

Publication date: 19 Oct 2023
Project Page: https://arxiv.org/abs/2310.12956
Paper: https://arxiv.org/pdf/2310.12956