The paper presents a method for explaining text classifiers using counterfactual representations. Counterfactuals are hypothetical events identical to real observations except for one categorical feature. Constructing counterfactuals for texts poses challenges as some attribute values may not align with plausible real-world events. The authors propose a method for generating counterfactuals by intervening in the space of text representations, arguing that their interventions are minimally disruptive and theoretically sound. The method is validated through experiments on a synthetic dataset and a real-world scenario, showing its potential for explaining classifiers and mitigating bias.

 

Publication date: 2 Feb 2024
Project Page: github.com/toinesayan/counterfactual-representations-for-explanation
Paper: https://arxiv.org/pdf/2402.00711