FindThis: Language-Driven Object Disambiguation in Indoor Environments
Abstract
Natural language is naturally ambiguous. In this work, we consider interactions between a user and a mobile service robot tasked with locating a desired object, specified by a language utterance. We present a task FindThis, which addresses the problem of how to disambiguate and locate the particular object instance desired through a dialog with the user. To approach this problem we propose an algorithm, GoFind, which exploits visual attributes of the object that may be intrinsic (e.g., color, shape), or extrinsic (e.g., location, relationships to other entities), expressed in an open vocabulary. GoFind leverages the visual common sense learned by large language models to enable fine-grained object localization and attribute differentiation in a zero-shot manner. We also provide a new visio-linguistic dataset, 3D Objects in Context (3DOC), for evaluating agents on this task consisting of Google Scanned Objects placed in Habitat-Matterport 3D scenes. Finally, we validate our approach on a real robot operating in an unstructured physical office environment using complex fine-grained language instructions.
Cite
Text
Majumdar et al. "FindThis: Language-Driven Object Disambiguation in Indoor Environments." Conference on Robot Learning, 2023.Markdown
[Majumdar et al. "FindThis: Language-Driven Object Disambiguation in Indoor Environments." Conference on Robot Learning, 2023.](https://mlanthology.org/corl/2023/majumdar2023corl-findthis/)BibTeX
@inproceedings{majumdar2023corl-findthis,
title = {{FindThis: Language-Driven Object Disambiguation in Indoor Environments}},
author = {Majumdar, Arjun and Xia, Fei and Ichter, Brian and Batra, Dhruv and Guibas, Leonidas},
booktitle = {Conference on Robot Learning},
year = {2023},
pages = {1335-1347},
volume = {229},
url = {https://mlanthology.org/corl/2023/majumdar2023corl-findthis/}
}