A Roadmap for Human-Agent Moral Alignment: Integrating Pre-Defined Intrinsic Rewards and Learned Reward Models

Abstract

The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. This approach suffers from low transparency, low controllability and high cost. More recently, researchers have introduced the design of intrinsic reward functions that explicitly encode core human moral values for Reinforcement Learning-based fine-tuning of foundation agent models. This approach offers a way to explicitly define transparent values for agents, while also being cost-effective due to automated agent fine-tuning. However, its weaknesses include simplicity, lack of flexibility and the inability to dynamically adapt to the needs or preferences of (potentially diverse) users. In this position paper, we argue that a combination of intrinsic rewards and learned reward models may provide an effective way forward for alignment research that enables human agency and control. Integrating intrinsic rewards and learned reward models in post-training can allow models to act in a way that is respectful of the specific users' moral preferences while also relying on a transparent foundation of pre-defined values.

Cite

Text

Tennant et al. "A Roadmap for Human-Agent Moral Alignment: Integrating Pre-Defined Intrinsic Rewards and Learned Reward Models." ICLR 2025 Workshops: Bi-Align, 2025.

Markdown

[Tennant et al. "A Roadmap for Human-Agent Moral Alignment: Integrating Pre-Defined Intrinsic Rewards and Learned Reward Models." ICLR 2025 Workshops: Bi-Align, 2025.](https://mlanthology.org/iclrw/2025/tennant2025iclrw-roadmap/)

BibTeX

@inproceedings{tennant2025iclrw-roadmap,
  title     = {{A Roadmap for Human-Agent Moral Alignment: Integrating Pre-Defined Intrinsic Rewards and Learned Reward Models}},
  author    = {Tennant, Elizaveta and Hailes, Stephen and Musolesi, Mirco},
  booktitle = {ICLR 2025 Workshops: Bi-Align},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/tennant2025iclrw-roadmap/}
}