Off-Team Learning
Abstract
Zero-shot coordination (ZSC) evaluates an algorithm by the performance of a team of agents that were trained independently under that algorithm. Off-belief learning (OBL) is a recent method that achieves state-of-the-art results in ZSC in the game Hanabi. However, the implementation of OBL relies on a belief model that experiences covariate shift. Moreover, during ad-hoc coordination, OBL or any other neural policy may experience test-time covariate shift. We present two methods addressing these issues. The first method, off-team belief learning (OTBL), attempts to improve the accuracy of the belief model of a target policy πT on a broader range of inputs by weighting trajectories approximately according to the distribution induced by a different policy πb. The second, off-team off-belief learning (OT-OBL), attempts to compute an OBL equilibrium, where fixed point error is weighted according to the distribution induced by cross-play between the training policy π and a different fixed policy πb instead of self-play of π. We investigate these methods in variants of Hanabi.
Cite
Text
Cui et al. "Off-Team Learning." Neural Information Processing Systems, 2022.Markdown
[Cui et al. "Off-Team Learning." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/cui2022neurips-offteam/)BibTeX
@inproceedings{cui2022neurips-offteam,
title = {{Off-Team Learning}},
author = {Cui, Brandon and Hu, Hengyuan and Lupu, Andrei and Sokota, Samuel and Foerster, Jakob},
booktitle = {Neural Information Processing Systems},
year = {2022},
url = {https://mlanthology.org/neurips/2022/cui2022neurips-offteam/}
}