Embodied Question Answering

Abstract

We present a new AI task - Embodied Question Answering(EmbodiedQA) - where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'). In order to answer, the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question ('orange'). EmbodiedQA requires a range of AI skills - language understanding, visual recognition, active perception, goal-driven navigation, commonsense reasoning, long-term memory, and grounding language into actions. In this work, we develop a dataset of questions and answers in House3D environments [1], evaluation metrics, and a hierarchical model trained with imitation and reinforcement learning.

Cite

Text

Das et al. "Embodied Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. doi:10.1109/CVPRW.2018.00279

Markdown

[Das et al. "Embodied Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.](https://mlanthology.org/cvprw/2018/das2018cvprw-embodied/) doi:10.1109/CVPRW.2018.00279

BibTeX

@inproceedings{das2018cvprw-embodied,
  title     = {{Embodied Question Answering}},
  author    = {Das, Abhishek and Datta, Samyak and Gkioxari, Georgia and Lee, Stefan and Parikh, Devi and Batra, Dhruv},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2018},
  pages     = {2054-2063},
  doi       = {10.1109/CVPRW.2018.00279},
  url       = {https://mlanthology.org/cvprw/2018/das2018cvprw-embodied/}
}