LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

The article presents LLM-Grounder, a new method for 3D visual grounding that uses a large language model (LLM) to decompose complex natural language queries into semantic constituents. The LLM then works with a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. It then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. This method does not require any labeled training data and can generalize to new 3D scenes and arbitrary text queries. The authors demonstrate its effectiveness using the ScanRefer benchmark, showing that LLMs significantly improve the grounding capability, especially for complex language queries.

Publication date: 22 Sep 2023
Project Page: https://chat-with-nerf.github.io/
Paper: https://arxiv.org/pdf/2309.12311

Post Views: 291

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

Leave a Reply Cancel reply

Please allow ads on our site