A Reply to Makelov et al. (2023)’s Interpretability Illusion Arguments
The authors respond to Makelov et al.’s recent paper that reviews subspace interchange intervention methods like Distributed Alignment Search (DAS) and claims these could cause interpretability illusions. They argue that…
Continue reading