Combining Online and Offline Knowledge in UCT

Abstract

The UCT algorithm learns a value function online using sample-based search. The TD(lambda) algorithm can learn a value function offine for the on-policy distribution. We consider three approaches for combining offine and online value functions in the UCT algorithm. First, the offine value function is used as a default policy during Monte-Carlo simulation. Second, the UCT value function is combined with a rapid online estimate of action values. Third, the offine value function is used as prior knowledge in the UCT search tree. We evaluate these algorithms in 9 X 9 Go against GnuGo 3.7.10. The first algorithm performs better than UCT with a random simulation policy, but surprisingly, worse than UCT with a weaker, handcrafted simulation policy. The second algorithm outperforms UCT altogether. The third algorithm outperforms UCT with handcrafted prior knowledge. We combine these algorithms in MoGo, the world's strongest 9 X 9 Go program. Each technique significantly improves MoGo's playing strength.

Cite

Text

Gelly and Silver. "Combining Online and Offline Knowledge in UCT." International Conference on Machine Learning, 2007. doi:10.1145/1273496.1273531

Markdown

[Gelly and Silver. "Combining Online and Offline Knowledge in UCT." International Conference on Machine Learning, 2007.](https://mlanthology.org/icml/2007/gelly2007icml-combining/) doi:10.1145/1273496.1273531

BibTeX

@inproceedings{gelly2007icml-combining,
  title     = {{Combining Online and Offline Knowledge in UCT}},
  author    = {Gelly, Sylvain and Silver, David},
  booktitle = {International Conference on Machine Learning},
  year      = {2007},
  pages     = {273-280},
  doi       = {10.1145/1273496.1273531},
  url       = {https://mlanthology.org/icml/2007/gelly2007icml-combining/}
}