Dorn, Diego

2 publications

ICMLW 2024 BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards Diego Dorn, Alexandre Variengien, Charbel-Raphael Segerie, Vincent Corruble
NeurIPSW 2023 Goal Misgeneralization as Implicit Goal Conditioning Diego Dorn, Neel Alex, David Krueger