Xhonneux, Sophie

6 publications

ICLRW 2025 A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens Sophie Xhonneux, David Dobre, Mehrnaz Mofakhami, Leo Schwinn, Gauthier Gidel

ICLR 2025 Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville

NeurIPS 2024 Efficient Adversarial Training in LLMs with Continuous Attacks Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn

NeurIPSW 2024 Faster, More Efficient RLHF Through Off-Policy Asynchronous Learning Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville

ICMLW 2024 In-Context Learning, Can It Break Safety? Sophie Xhonneux, David Dobre, Michael Noukhovitch, Jian Tang, Gauthier Gidel, Dhanya Sridhar

NeurIPS 2024 Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs Through the Embedding Space Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Günnemann