ML Anthology
Authors
Search
About
Pres, Itamar
2 publications
ICML
2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
,
Xiaoyan Bai
,
Itamar Pres
,
Martin Wattenberg
,
Jonathan K. Kummerfeld
,
Rada Mihalcea
NeurIPSW
2024
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres
,
Laura Ruis
,
Ekdeep Singh Lubana
,
David Krueger