Yin, Xuwang

6 publications

NeurIPS 2025 Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Oliver Zhang, Dan Hendrycks
ICML 2024 HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks
NeurIPSW 2024 RenderAttack: Hundreds of Adversarial Attacks Through Differentiable Texture Generation Dron Hazra, Alex Bie, Mantas Mazeika, Xuwang Yin, Andy Zou, Dan Hendrycks, Maximilian Kaufmann
NeurIPS 2024 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? Richard Ren, Steven Basart, Adam Khoja, Alexander Pan, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Gabriel Mukobi, Ryan Hwang Kim, Stephen Fitz, Dan Hendrycks
ECCV 2022 Learning Energy-Based Models with Adversarial Training Xuwang Yin, Shiying Li, Gustavo K. Rohde
ICLR 2020 GAT: Generative Adversarial Training for Adversarial Example Detection and Classification Xuwang Yin, Soheil Kolouri, Gustavo K Rohde