Scaling Offline Q-Learning with Vision Transformers

Abstract

It has been shown that offline RL methods, such as conservative Q-learning~(CQL), scale favorably for training generalist agents with a ResNet backbone. Recent vision and natural language processing research shows that transformer-based models scale more favorably compared to domain specific models with strong inductive biases (such as convolutional neural networks and recurrent neural networks). In this paper, we investigate how well visual transformers (ViTs) serve as backbones for CQL for training single-game agents. In this work, we enhance the Vision Transformer (ViT) for image-based RL by introducing spatio-temporal attention layers. We further investigate the impact of various embedding sequence aggregation methods on ViT performance. Overall, our modified ViT outperforms the standard ViTs in the single-game Atari setting.

Cite

Text

Miao et al. "Scaling Offline Q-Learning with Vision Transformers." NeurIPS 2023 Workshops: FMDM, 2023.

Markdown

[Miao et al. "Scaling Offline Q-Learning with Vision Transformers." NeurIPS 2023 Workshops: FMDM, 2023.](https://mlanthology.org/neuripsw/2023/miao2023neuripsw-scaling/)

BibTeX

@inproceedings{miao2023neuripsw-scaling,
  title     = {{Scaling Offline Q-Learning with Vision Transformers}},
  author    = {Miao, Yingjie and Orbay, Jordi and Agarwal, Rishabh and Kumar, Aviral and Tucker, George and Faust, Aleksandra},
  booktitle = {NeurIPS 2023 Workshops: FMDM},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/miao2023neuripsw-scaling/}
}