The paper introduces a multimodal Theory of Mind (ToM) question answering benchmark, MMToM-QA, and a new method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models), for engineering multimodal ToM capacity. MMToM-QA evaluates machine ToM on different kinds of unimodal and multimodal data about a person’s activity in a household environment. The paper compares the performance of BIP-ALM with human performance and state-of-the-art models, including GPT-4. It concludes that while large language and multimodal models lack robust ToM capacity, BIP-ALM shows promise by leveraging the power of both model-based mental inference and language models.

 

Publication date: 16 Jan 2024
Project Page: https://chuanyangjin.com/mmtom-qa
Paper: https://arxiv.org/pdf/2401.08743