This academic article focuses on the challenge machines face in understanding defeasible commonsense norms in a visual context. The authors have created a new multimodal benchmark, NORM LENS, consisting of 10,000 human judgments with free-form explanations covering 2,000 multimodal situations. The aim is to gauge how well models align with average human judgment and how well they can explain their predicted judgments. The study reveals that current state-of-the-art models are not well-aligned with human annotation. The authors propose a new approach to better align models with humans through distilling social commonsense knowledge from large language models.
Publication date: 17 Oct 2023
Project Page: https://seungjuhan.me/normlens
Paper: https://arxiv.org/pdf/2310.10418