Approximate Policy Iteration with Bisimulation Metrics

Abstract

Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation (VFA). In this work we first prove that bisimulation and $\pi$-bisimulation metrics can be defined via a more general class of Sinkhorn distances, which unifies various state similarity metrics used in recent work. Then we describe an approximate policy iteration (API) procedure that uses a bisimulation-based discretization of the state space for VFA and prove asymptotic performance bounds. Next, we bound the difference between $\pi$-bisimulation metrics in terms of the change in the policies themselves. Based on these results, we design an API($\alpha$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. We discuss how such API procedures map onto practical actor-critic methods that use bisimulation metrics for state representation learning. Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.

Cite

Text

Kemertas and Jepson. "Approximate Policy Iteration with Bisimulation Metrics." Transactions on Machine Learning Research, 2022.

Markdown

[Kemertas and Jepson. "Approximate Policy Iteration with Bisimulation Metrics." Transactions on Machine Learning Research, 2022.](https://mlanthology.org/tmlr/2022/kemertas2022tmlr-approximate/)

BibTeX

@article{kemertas2022tmlr-approximate,
  title     = {{Approximate Policy Iteration with Bisimulation Metrics}},
  author    = {Kemertas, Mete and Jepson, Allan Douglas},
  journal   = {Transactions on Machine Learning Research},
  year      = {2022},
  url       = {https://mlanthology.org/tmlr/2022/kemertas2022tmlr-approximate/}
}