The paper explores the benefits of transformer-based models’ ability to learn in context from unstructured data during linear regression tasks. It conducts experiments to study the architecture of transformers and provides theoretical explanations. Findings include that a transformer with two layers of softmax attentions with look-ahead attention mask can learn from the prompt, positional encoding can improve performance, and multi-head attention with a high input embedding dimension has better prediction performance than single-head attention.

 

Publication date: 2 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.00743