Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction
Abstract
We explore the off-policy value prediction problem in the reinforcement learning setting, where one estimates the value function of the target policy using the sample trajectories obtained from a behaviour policy. Importance sampling is a standard tool for correcting action-level mismatch between behaviour and target policies. However, it only addresses single-step discrepancies. It cannot correct steady-state bias, which arises from long-horizon differences in how the behaviour policy visits states. In this paper, we propose an off-policy value-prediction algorithm under linear function approximation that explicitly corrects discrepancies in state visitation distributions. We provide rigorous theoretical guarantees for the resulting estimator. In particular, we prove asymptotic convergence under Markov noise and show that the corrected update matrix has favourable spectral properties that ensure stability. We also derive an error decomposition showing that the estimation error is bounded by a constant multiple of the best achievable approximation in the function class. This constant depends transparently on the quality of the distribution estimate and the choice of features. Empirical evaluation across multiple benchmark domains demonstrates that our method effectively mitigates steady-state bias and can be a robust alternative to existing methods in scenarios where distributional shift is critical.
Cite
Text
Sowmya et al. "Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction." Transactions on Machine Learning Research, 2026.Markdown
[Sowmya et al. "Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/sowmya2026tmlr-mitigating/)BibTeX
@article{sowmya2026tmlr-mitigating,
title = {{Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction}},
author = {Sowmya, Emani Naga Sai Venkata and Kesari, Amit and Joseph, Ajin George},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/sowmya2026tmlr-mitigating/}
}