Collins, Anne

4 publications

ICLR 2026 PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm Jing-Jing Li, Joel Mire, Eve Fleisig, Valentina Pyatkin, Anne Collins, Maarten Sap, Sydney Levine
ICLRW 2025 Investigating the Role of Representation Switching Costs in Goal Persistence Bias Gaia Molinaro, Aly Lidayan, Anne Collins
ICML 2025 SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine
NeurIPSW 2024 SafetyAnalyst: Interpretable, Transparent, and Steerable LLM Safety Moderation Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine