ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Shridhar, Mohit; Thomason, Jesse; Gordon, Daniel; Bisk, Yonatan; Han, Winson; Mottaghi, Roozbeh; Zettlemoyer, Luke; Fox, Dieter

doi:10.1109/CVPR42600.2020.01075

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox

CVPR 2020

doi:10.1109/CVPR42600.2020.01075 /cvpr/2020/shridhar2020cvpr-alfred/

Abstract

We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision- and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

PDF CVPR Semantic Scholar

Cite

Text

Shridhar et al. "ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01075

Markdown

[Shridhar et al. "ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/shridhar2020cvpr-alfred/) doi:10.1109/CVPR42600.2020.01075

BibTeX

@inproceedings{shridhar2020cvpr-alfred,
  title     = {{ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks}},
  author    = {Shridhar, Mohit and Thomason, Jesse and Gordon, Daniel and Bisk, Yonatan and Han, Winson and Mottaghi, Roozbeh and Zettlemoyer, Luke and Fox, Dieter},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01075},
  url       = {https://mlanthology.org/cvpr/2020/shridhar2020cvpr-alfred/}
}