ML Anthology
Authors
Search
About
McKee-Reid, Leo
2 publications
NeurIPSW
2024
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
Leo McKee-Reid
,
Christoph Sträter
,
Maria Angelica Martinez
,
Joe Needham
,
Mikita Balesni
NeurIPSW
2024
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
Leo McKee-Reid
,
Christoph Sträter
,
Maria Angelica Martinez
,
Joe Needham
,
Mikita Balesni