Thomas, Drake

2 publications

NeurIPS 2024 Catastrophic Goodhart: Regularizing RLHF with KL Divergence Does Not Mitigate Heavy-Tailed Reward Misspecification Thomas Kwa, Drake Thomas, AdriĆ  Garriga-Alonso
ICMLW 2024 Catastrophic Goodhart: Regularizing RLHF with KL Divergence Does Not Mitigate Heavy-Tailed Reward Misspecification Thomas Kwa, Drake Thomas, AdriĆ  Garriga-Alonso