Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control

Zhang, Zhilong; Mei, Yunpeng; Du, Xinghao; Cao, Hongjie; Wang, Haonan; Min, Pengyuan; Wang, Chenyu; Chen, Pengfei; Xin, Chenbo; Wang, Yijie; Luo, Wenyu; Sun, Yihao; Wang, Yidi; Yuan, Lei; Wang, Gang; Yu, Yang

Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control

Zhilong Zhang, Yunpeng Mei, Xinghao Du, Hongjie Cao, Haonan Wang, Pengyuan Min, Chenyu Wang, Pengfei Chen, Chenbo Xin, Yijie Wang, Wenyu Luo, Yihao Sun, Yidi Wang, Lei Yuan, Gang Wang, Yang Yu

ICLR 2026

/iclr/2026/zhang2026iclr-hierarchical/

Abstract

Scaling imitation learning to high-DoF whole-body robots is fundamentally constrained by the scarcity of expert demonstrations. In contrast, large amounts of suboptimal data are readily available and offer a practical way to alleviate supervision bottlenecks in real-world whole-body control. However, leveraging such data introduces two central challenges: how to extract informative signals from imperfect trajectories, and how to cope with the increased learning complexity induced by high-dimensional control. To overcome this, we propose **HVD** (Hierarchical Value-Decomposed Offline Reinforcement Learning). The offline RL formulation provides principled data selection over suboptimal datasets, enabling the policy to prioritize high-value behaviors while down-weighting harmful ones. Complementarily, hierarchical value decomposition organizes learning along the robot’s kinematic structure, improving credit assignment and reducing learning complexity in high-DoF systems. Built on a Transformer-based architecture, HVD supports *multi-modal* and *multi-task* learning, allowing flexible integration of diverse sensory inputs. To enable realistic evaluation and training, we further introduce **WB-50**, a 50-hour dataset of teleoperated and policy rollout trajectories annotated with rewards and preserving natural imperfections, including partial successes, corrections, and failures. Experiments show HVD significantly outperforms existing baselines in success rate across complex whole-body tasks. Our results suggest effective policy learning for high-DoF systems can emerge not from perfect demonstrations, but from structured learning over realistic, imperfect data. Our code is available at https://github.com/LAMDA-RL/HVD.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-hierarchical/)

BibTeX

@inproceedings{zhang2026iclr-hierarchical,
  title     = {{Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control}},
  author    = {Zhang, Zhilong and Mei, Yunpeng and Du, Xinghao and Cao, Hongjie and Wang, Haonan and Min, Pengyuan and Wang, Chenyu and Chen, Pengfei and Xin, Chenbo and Wang, Yijie and Luo, Wenyu and Sun, Yihao and Wang, Yidi and Yuan, Lei and Wang, Gang and Yu, Yang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-hierarchical/}
}