Off-Team Learning

Abstract

Zero-shot coordination (ZSC) evaluates an algorithm by the performance of a team of agents that were trained independently under that algorithm. Off-belief learning (OBL) is a recent method that achieves state-of-the-art results in ZSC in the game Hanabi. However, the implementation of OBL relies on a belief model that experiences covariate shift. Moreover, during ad-hoc coordination, OBL or any other neural policy may experience test-time covariate shift. We present two methods addressing these issues. The first method, off-team belief learning (OTBL), attempts to improve the accuracy of the belief model of a target policy πT on a broader range of inputs by weighting trajectories approximately according to the distribution induced by a different policy πb. The second, off-team off-belief learning (OT-OBL), attempts to compute an OBL equilibrium, where fixed point error is weighted according to the distribution induced by cross-play between the training policy π and a different fixed policy πb instead of self-play of π. We investigate these methods in variants of Hanabi.

Cite

Text

Cui et al. "Off-Team Learning." Neural Information Processing Systems, 2022.

Markdown

[Cui et al. "Off-Team Learning." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/cui2022neurips-offteam/)

BibTeX

@inproceedings{cui2022neurips-offteam,
  title     = {{Off-Team Learning}},
  author    = {Cui, Brandon and Hu, Hengyuan and Lupu, Andrei and Sokota, Samuel and Foerster, Jakob},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/cui2022neurips-offteam/}
}