The article presents LLM-Grounder, a new method for 3D visual grounding that uses a large language model (LLM) to decompose complex natural language queries into semantic constituents. The LLM then works with a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. It then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. This method does not require any labeled training data and can generalize to new 3D scenes and arbitrary text queries. The authors demonstrate its effectiveness using the ScanRefer benchmark, showing that LLMs significantly improve the grounding capability, especially for complex language queries.
Publication date: 22 Sep 2023
Project Page: https://chat-with-nerf.github.io/
Paper: https://arxiv.org/pdf/2309.12311