Pres, Itamar

2 publications

ICML 2024 A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
NeurIPSW 2024 Towards Reliable Evaluation of Behavior Steering Interventions in LLMs Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, David Krueger