Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Gao, Heyang; Sun, Zexu; Min, Erxue; Cai, Hengyi; Wang, Shuaiqiang; Yin, Dawei; Chen, Xu

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Heyang Gao, Zexu Sun, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen

ICLR 2026

/iclr/2026/gao2026iclr-solving/

Abstract

Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides stable signals but blur where credit should be assigned within long trajectories, whereas step-level DPO offers fine-grained supervision but can be statistically noisy and data-inefficient when Monte Carlo rollouts are limited, and can be hard to fully exploit multi-step structured behaviors that only reveal their effect over several actions. To balance this trade-off, we introduce **H**ierarchical **P**reference **L**earning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Gao et al. "Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents." International Conference on Learning Representations, 2026.

Markdown

[Gao et al. "Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/gao2026iclr-solving/)

BibTeX

@inproceedings{gao2026iclr-solving,
  title     = {{Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents}},
  author    = {Gao, Heyang and Sun, Zexu and Min, Erxue and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Chen, Xu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/gao2026iclr-solving/}
}