The study by Eshaan Nichani, Alex Damian, and Jason D. Lee from Princeton University investigates how transformers learn causal structures using gradient descent. The research reveals that transformers’ success in sequence modeling tasks is largely due to the self-attention mechanism, which enables information transfer between different parts of a sequence. The paper introduces an in-context learning task that requires learning latent causal structure and proves that gradient descent on a simplified two-layer transformer can solve this task. The research further confirms the theoretical findings with transformers trained on the in-context learning task, demonstrating their ability to recover a variety of causal structures.
Publication date: 23 Feb 2024
Project Page: https://arxiv.org/abs/2402.14735v1
Paper: https://arxiv.org/pdf/2402.14735