The paper presents a novel approach, MultiRankIt, for a task defined as Learning-to-Rank Physical Objects (LTRPO). The task involves retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting. The approach uses a Crossmodal Noun Phrase Encoder and a Crossmodal Region Feature Encoder to model relationships between phrases, target objects, and their contextual environment. The approach is tested on a new dataset with complex instructions and real indoor environmental images, outperforming the baseline method. The study also includes physical experiments with a domestic service robot in a real-world setting, achieving an 80% success rate for object retrieval.
Publication date: 29 Dec 2023
Project Page: https://github.com/keio-smilab23/MultiRankIt
Paper: https://arxiv.org/pdf/2312.15844