The research by Naver Labs Europe focuses on improving goal-oriented visual navigation through large-scale machine learning. The primary challenge lies in learning compact representations that can generalize to unseen environments and high-capacity perception modules capable of reasoning on high-dimensional input. The study introduces a new dual encoder with a large-capacity binocular ViT model. This model allows the emergence of correspondence solutions from the training signals, significantly improving ImageNav and Instance-ImageNav performance. The authors argue that this approach addresses the key bottleneck in perception – wide-baseline relative pose estimation and visibility prediction in complex scenes.
Publication date: 28 Sep 2023
Project Page: https://arxiv.org/abs/2309.16634
Paper: https://arxiv.org/pdf/2309.16634