Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

Singh, Utsav; Chakraborty, Souradip; Suttle, Wesley A.; Sadler, Brian M.; Asher, Derrik E.; Sahu, Anit Kumar; Shah, Mubarak; Namboodiri, Vinay P.; Bedi, Amrit Singh

Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

Utsav Singh, Souradip Chakraborty, Wesley A. Suttle, Brian M. Sadler, Derrik E. Asher, Anit Kumar Sahu, Mubarak Shah, Vinay P. Namboodiri, Amrit Singh Bedi

ICLR 2026

/iclr/2026/singh2026iclr-direct/

Abstract

Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on higher-level learning. To address infeasible subgoals, DIPPER incorporates lower-level value function regularization that encourages the higher-level policy to propose achievable subgoals. We introduce two novel metrics to quantitatively verify that DIPPER mitigates non-stationarity and infeasible subgoal generation issues in HRL. Empirical evaluation on challenging robotic navigation and manipulation benchmarks shows that DIPPER achieves upto 40% improvements over state-of-the-art baselines on challenging sparse-reward scenarios, highlighting the potential of preference-based learning for addressing longstanding HRL limitations.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Singh et al. "Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach." International Conference on Learning Representations, 2026.

Markdown

[Singh et al. "Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/singh2026iclr-direct/)

BibTeX

@inproceedings{singh2026iclr-direct,
  title     = {{Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach}},
  author    = {Singh, Utsav and Chakraborty, Souradip and Suttle, Wesley A. and Sadler, Brian M. and Asher, Derrik E. and Sahu, Anit Kumar and Shah, Mubarak and Namboodiri, Vinay P. and Bedi, Amrit Singh},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/singh2026iclr-direct/}
}