Xhonneux, Sophie

6 publications

ICLRW 2025 A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens Sophie Xhonneux, David Dobre, Mehrnaz Mofakhami, Leo Schwinn, Gauthier Gidel
ICLR 2025 Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville
NeurIPS 2024 Efficient Adversarial Training in LLMs with Continuous Attacks Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn
NeurIPSW 2024 Faster, More Efficient RLHF Through Off-Policy Asynchronous Learning Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville
ICMLW 2024 In-Context Learning, Can It Break Safety? Sophie Xhonneux, David Dobre, Michael Noukhovitch, Jian Tang, Gauthier Gidel, Dhanya Sridhar
NeurIPS 2024 Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs Through the Embedding Space Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Günnemann