Balesni, Mikita

6 publications

NeurIPSW 2024 Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni
NeurIPSW 2024 Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni
ICLRW 2024 Large Language Models Can Strategically Deceive Their Users When Put Under Pressure Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn
NeurIPS 2024 Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans
ICLR 2024 The Reversal Curse: LLMs Trained on “a Is B” Fail to Learn “b Is A” Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
NeurIPSW 2023 The Reversal Curse: LLMs Trained on "a Is B" Fail to Learn "b Is A" Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Stickland, Tomasz Korbak, Owain Evans