The authors respond to Makelov et al.’s recent paper that reviews subspace interchange intervention methods like Distributed Alignment Search (DAS) and claims these could cause interpretability illusions. They argue that what Makelov et al. see as illusions are artifacts of their training and evaluation paradigms. The authors emphasize that the examples and discussions by Makelov et al. have pushed the field of interpretability forward, despite disagreements over their core characterization.

 

Publication date: 23 Jan 2024
Project Page: https://arxiv.org/abs/2401.12631v1
Paper: https://arxiv.org/pdf/2401.12631