Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding
Abstract
This paper studies meta-reinforcement learning with adaptation from human feedback. It aims to pre-train a meta-model that can achieve few-shot adaptation for new tasks from human preference queries without relying on reward signals. To solve the problem, we propose the framework adaptation via Preference-Order-preserving EMbedding (POEM). In the meta-training, the framework learns a task encoder, which maps tasks to a preference-order-preserving task embedding space, and a decoder, which maps the embeddings to the task-specific policies. In the adaptation from human feedback, the task encoder facilitates efficient task embedding inference for new tasks from the preference queries and then obtains the task-specific policy. We provide a theoretical guarantee for the convergence of the adaptation process to the task-specific optimal policy and experimentally demonstrate its state-of-the-art performance with substantial improvement over baseline methods.
Cite
Text
Xu and Zhu. "Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Xu and Zhu. "Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/xu2025icml-metareinforcement/)BibTeX
@inproceedings{xu2025icml-metareinforcement,
title = {{Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding}},
author = {Xu, Siyuan and Zhu, Minghui},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {69967-69991},
volume = {267},
url = {https://mlanthology.org/icml/2025/xu2025icml-metareinforcement/}
}