Can Go AIs Be Adversarially Robust?

Abstract

Prior work found that superhuman Go AIs such as KataGo are vulnerable to opponents playing simple adversarial strategies. This shows that superhuman average-case capabilities may not translate to satisfactory worst-case robustness. However, Go AIs were never designed with security in mind, raising the question: can simple defenses make KataGo robust? In this paper, we test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find these defenses protect against previously discovered attacks, but we uncover several qualitatively distinct adversarial strategies that beat our defended agents. Our results suggest that achieving robustness is challenging, even in narrow domains such as Go. Our code is available at https://github.com/AlignmentResearch/go_attack.

Cite

Text

Tseng et al. "Can Go AIs Be Adversarially Robust?." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[Tseng et al. "Can Go AIs Be Adversarially Robust?." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/tseng2024icmlw-go/)

BibTeX

@inproceedings{tseng2024icmlw-go,
  title     = {{Can Go AIs Be Adversarially Robust?}},
  author    = {Tseng, Tom and McLean, Euan and Pelrine, Kellin and Wang, Tony Tong and Gleave, Adam},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/tseng2024icmlw-go/}
}