A Reply to Makelov et al. (2023)’s Interpretability Illusion Arguments

The authors respond to Makelov et al.’s recent paper that reviews subspace interchange intervention methods like Distributed Alignment Search (DAS) and claims these could cause interpretability illusions. They argue that what Makelov et al. see as illusions are artifacts of their training and evaluation paradigms. The authors emphasize that the examples and discussions by Makelov et al. have pushed the field of interpretability forward, despite disagreements over their core characterization.

Publication date: 23 Jan 2024
Project Page: https://arxiv.org/abs/2401.12631v1
Paper: https://arxiv.org/pdf/2401.12631

Post Views: 295

activation patching, artificial neural networks, distributed alignment search, interpretability illusion, polysemantic neurons

A Reply to Makelov et al. (2023)’s Interpretability Illusion Arguments

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Binary Feature Mask Optimization for Feature Selection

The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting — An Analytical Model

Leave a Reply Cancel reply

Please allow ads on our site