This research article discusses the issue of over-smoothing in deep transformer models, where token representations become identical as the model’s depth grows. The authors propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens. This results in the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a new class of transformer models that can mitigate the over-smoothing issue. The advantages of NeuTRENO over traditional transformers are demonstrated in various practical tasks such as object classification, image segmentation, and language modeling.

 

Publication date: 1 Dec 2023
Project Page: https://arxiv.org/abs/2312.00751
Paper: https://arxiv.org/pdf/2312.00751