Wei, Boyi

9 publications

TMLR 2025 An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando
NeurIPS 2025 Dynamic Risk Assessments for Offensive Cybersecurity Agents Boyi Wei, Benedikt Stroebl, Jiacen Xu, Joie Zhang, Zhou Li, Peter Henderson
ICLR 2025 On Evaluating the Durability of Safeguards for Open-Weight LLMs Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, Peter Henderson
ICLR 2025 SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal
NeurIPSW 2024 An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando
NeurIPSW 2024 An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando
ICML 2024 Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
ICLRW 2024 Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
NeurIPS 2024 Evaluating Copyright Takedown Methods for Language Models Boyi Wei, Weijia Shi, Yangsibo Huang, Noah A. Smith, Chiyuan Zhang, Luke Zettlemoyer, Kai Li, Peter Henderson