From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, Kristina N Toutanova

NeurIPS 2023

/neurips/2023/shaw2023neurips-pixels/

Abstract

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use — via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Shaw et al. "From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces." Neural Information Processing Systems, 2023.

Markdown

[Shaw et al. "From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/shaw2023neurips-pixels/)

BibTeX

@inproceedings{shaw2023neurips-pixels,
  title     = {{From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces}},
  author    = {Shaw, Peter and Joshi, Mandar and Cohan, James and Berant, Jonathan and Pasupat, Panupong and Hu, Hexiang and Khandelwal, Urvashi and Lee, Kenton and Toutanova, Kristina N},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/shaw2023neurips-pixels/}
}