Hariharan, Kaivalya

5 publications

NeurIPSW 2024 Between the Bars: Gradient-Based Jailbreaks Are Bugs That Induce Features Kaivalya Hariharan, Uzay Girit
NeurIPS 2024 Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans
NeurIPSW 2023 Forbidden Facts: An Investigation of Competing Objectives in Llama 2 Tony Wang, Miles Kai, Kaivalya Hariharan, Nir Shavit
NeurIPS 2023 Red Teaming Deep Neural Networks with Feature Synthesis Tools Stephen Casper, Tong Bu, Yuxiao Li, Jiawei Li, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell
NeurIPSW 2022 Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell