Better Rates for Any Adversarial Deterministic MDP
Abstract
We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-the-art forward in two ways: First, it attains a regret of O(T^2/3) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T^3/4). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.
Cite
Text
Dekel and Hazan. "Better Rates for Any Adversarial Deterministic MDP." International Conference on Machine Learning, 2013.Markdown
[Dekel and Hazan. "Better Rates for Any Adversarial Deterministic MDP." International Conference on Machine Learning, 2013.](https://mlanthology.org/icml/2013/dekel2013icml-better/)BibTeX
@inproceedings{dekel2013icml-better,
title = {{Better Rates for Any Adversarial Deterministic MDP}},
author = {Dekel, Ofer and Hazan, Elad},
booktitle = {International Conference on Machine Learning},
year = {2013},
pages = {675-683},
volume = {28},
url = {https://mlanthology.org/icml/2013/dekel2013icml-better/}
}