How Transformers Learn Causal Structure with Gradient Descent

The study by Eshaan Nichani, Alex Damian, and Jason D. Lee from Princeton University investigates how transformers learn causal structures using gradient descent. The research reveals that transformers’ success in sequence modeling tasks is largely due to the self-attention mechanism, which enables information transfer between different parts of a sequence. The paper introduces an in-context learning task that requires learning latent causal structure and proves that gradient descent on a simplified two-layer transformer can solve this task. The research further confirms the theoretical findings with transformers trained on the in-context learning task, demonstrating their ability to recover a variety of causal structures.

Publication date: 23 Feb 2024
Project Page: https://arxiv.org/abs/2402.14735v1
Paper: https://arxiv.org/pdf/2402.14735

Post Views: 315

Press ESC to close

Share Article:

root

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Clifford-Steerable Convolutional Neural Networks

Please allow ads on our site