The paper addresses the issue of unfaithfulness in the explanations provided by visual question answering (VQA) systems. Current post-hoc methods, which provide natural language explanations (NLE), often fail to align with human logical inference, leading to problems like deductive unsatisfiability, factual inconsistency, and semantic perturbation insensitivity. To solve these issues, the authors propose a novel self-supervised Multi-level Contrastive Learning based natural language Explanation model (MCLE) for VQA. The model uses semantic-level, image-level, and instance-level factual and counterfactual samples to extract discriminative features and align the feature spaces from explanations with visual question and answer. This approach is shown to generate more consistent explanations.

 

Publication date: 22 Dec 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2312.13594