interpretability illusion Papers

Artificial Intelligence Computation and Language

A Reply to Makelov et al. (2023)’s Interpretability Illusion Arguments

root January 24, 2024 0

The authors respond to Makelov et al.’s recent paper that reviews subspace interchange intervention methods like Distributed Alignment Search (DAS) and claims these could cause interpretability illusions. They argue that…

Press ESC to close

interpretability illusion

Please allow ads on our site