This study investigates whether the recent trend of scaling up neural networks in terms of dataset and model size has improved our understanding of their internal workings, specifically in the field of mechanistic interpretability. The researchers conducted a psychophysical experiment on a diverse suite of models and found no correlation between scale and interpretability. In fact, they found that modern, larger models are less interpretable than older ones, suggesting a regression rather than improvement. The paper highlights the need for models explicitly designed for interpretability and more effective interpretability methods. The researchers also released a dataset containing over 120,000 human responses from their experiment to facilitate further research in this area.

 

Publication date: July 11, 2023
Project Page: brendel-group.github.io/imi
Paper: https://arxiv.org/pdf/2307.05471.pdf