Ren, Richard

3 publications

NeurIPS 2025 Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Oliver Zhang, Dan Hendrycks
NeurIPS 2024 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? Richard Ren, Steven Basart, Adam Khoja, Alexander Pan, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Gabriel Mukobi, Ryan Hwang Kim, Stephen Fitz, Dan Hendrycks
NeurIPSW 2023 Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching James Campbell, Phillip Guo, Richard Ren