Probing in the Dark: State Entropy Maximization for POMDPs

Abstract

Sample efficiency is one of the main bottlenecks for optimal decision making via reinforcement learning. Pretraining a policy to maximize the entropy of the state visitation can substantially speedup reinforcement learning of downstream tasks. It is still an open question how to maximize the state entropy in POMDPs, where the true states of the environment, or their entropy, are not observed. In this work, we propose to maximize the entropy of a sufficient statistic of the history, which is called an information state. First, we show that a recursive latent model that predicts future observations is an information state in this setting. Then, we provide a practical algorithm, called LatEnt, to simultaneously learn the latent model and a latent-based policy maximizing the corresponding entropy objective from reward-free interactions with the POMDP. We empirically show that our approach induces higher state entropy than existing methods, which translates to better performance on downstream tasks. As a byproduct, we open-source PROBE, the first benchmark to test reward-free pretraining in POMDPs.

Cite

Text

Ashlag et al. "Probing in the Dark: State Entropy Maximization for POMDPs." International Conference on Learning Representations, 2026.

Markdown

[Ashlag et al. "Probing in the Dark: State Entropy Maximization for POMDPs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ashlag2026iclr-probing/)

BibTeX

@inproceedings{ashlag2026iclr-probing,
  title     = {{Probing in the Dark: State Entropy Maximization for POMDPs}},
  author    = {Ashlag, Yonatan and Mutti, Mirco and Tamar, Aviv and Levy, Kfir Yehuda},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ashlag2026iclr-probing/}
}