The paper introduces SInViG, a self-evolving interactive visual agent that aims to improve human-robot interaction. This system is designed to address the issue of language ambiguity through multi-turn visual-language dialogues. It learns automatically from unlabeled images and large language models without human intervention, increasing its robustness against visual and linguistic complexity. This self-evolving feature has allowed it to set new standards on several interactive visual grounding benchmarks. The paper also discusses how individual users’ distinct preferences necessitate a multi-modal approach in human-robot communication systems.
Publication date: 20 Feb 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2402.11792