The paper presents a method for leveraging pretrained Vision Language Models (VLMs) to annotate 3D objects, considering their full appearance, phrasing of the question, and other affecting factors. The study shows that this approach can outperform a language model for summarization and improve downstream VLM predictions. The method was tested on the large-scale Objaverse dataset, showing that VLMs can approach the quality of human-verified type and material annotations without additional training.
Publication date: 29 Nov 2023
Project Page: https://arxiv.org/abs/2311.17851v1
Paper: https://arxiv.org/pdf/2311.17851