Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction

Sowmya, Emani Naga Sai Venkata; Kesari, Amit; Joseph, Ajin George

Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction

Emani Naga Sai Venkata Sowmya, Amit Kesari, Ajin George Joseph

TMLR 2026

/tmlr/2026/sowmya2026tmlr-mitigating/

Abstract

We explore the off-policy value prediction problem in the reinforcement learning setting, where one estimates the value function of the target policy using the sample trajectories obtained from a behaviour policy. Importance sampling is a standard tool for correcting action-level mismatch between behaviour and target policies. However, it only addresses single-step discrepancies. It cannot correct steady-state bias, which arises from long-horizon differences in how the behaviour policy visits states. In this paper, we propose an off-policy value-prediction algorithm under linear function approximation that explicitly corrects discrepancies in state visitation distributions. We provide rigorous theoretical guarantees for the resulting estimator. In particular, we prove asymptotic convergence under Markov noise and show that the corrected update matrix has favourable spectral properties that ensure stability. We also derive an error decomposition showing that the estimation error is bounded by a constant multiple of the best achievable approximation in the function class. This constant depends transparently on the quality of the distribution estimate and the choice of features. Empirical evaluation across multiple benchmark domains demonstrates that our method effectively mitigates steady-state bias and can be a robust alternative to existing methods in scenarios where distributional shift is critical.

PDF TMLR Semantic Scholar

Cite

Text

Sowmya et al. "Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction." Transactions on Machine Learning Research, 2026.

Markdown

[Sowmya et al. "Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/sowmya2026tmlr-mitigating/)

BibTeX

@article{sowmya2026tmlr-mitigating,
  title     = {{Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction}},
  author    = {Sowmya, Emani Naga Sai Venkata and Kesari, Amit and Joseph, Ajin George},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/sowmya2026tmlr-mitigating/}
}