Temporal-Difference Search in Computer Go
Abstract
Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9×9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9×9 Computer Go Server.
Cite
Text
Silver et al. "Temporal-Difference Search in Computer Go." Machine Learning, 2012. doi:10.1007/S10994-012-5280-0Markdown
[Silver et al. "Temporal-Difference Search in Computer Go." Machine Learning, 2012.](https://mlanthology.org/mlj/2012/silver2012mlj-temporaldifference/) doi:10.1007/S10994-012-5280-0BibTeX
@article{silver2012mlj-temporaldifference,
title = {{Temporal-Difference Search in Computer Go}},
author = {Silver, David and Sutton, Richard S. and Müller, Martin},
journal = {Machine Learning},
year = {2012},
pages = {183-219},
doi = {10.1007/S10994-012-5280-0},
volume = {87},
url = {https://mlanthology.org/mlj/2012/silver2012mlj-temporaldifference/}
}