The paper presents the Toloka Visual Question Answering, a crowdsourced dataset designed to test machine learning systems’ performance in grounding visual question answering tasks. These tasks involve drawing a bounding box around an object in an image that correctly answers a given textual question. The paper also describes the data collection process and evaluates the performance of current pre-trained and fine-tuned models in this task. Despite several attempts, no machine learning model has yet outperformed the non-expert crowdsourcing baseline.

 

Publication date: 28 Sep 2023
Project Page: https://arxiv.org/abs/2309.16511
Paper: https://arxiv.org/pdf/2309.16511