Chughtai, Bilal

9 publications

ICML 2025 Detecting Strategic Deception with Linear Probes Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn
TMLR 2025 Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
NeurIPS 2025 Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, Neel Nanda
ICLRW 2025 Robustly Identifying Concepts Introduced During Chat Fine-Tuning Using Crosscoders Julian Minder, Clément Dumas, Bilal Chughtai, Neel Nanda
NeurIPS 2024 Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans
NeurIPSW 2024 Model Manipulation Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che, Stephen Casper, Anirudh Satheesh, Rohit Gandikota, Domenic Rosati, Stewart Slocum, Lev E McKinney, Zichu Wu, Zikui Cai, Bilal Chughtai, Daniel Filan, Furong Huang, Dylan Hadfield-Menell
ICML 2023 A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Bilal Chughtai, Lawrence Chan, Neel Nanda
ICLRW 2023 Neural Networks Learn Representation Theory: Reverse Engineering How Networks Perform Group Operations Bilal Chughtai, Lawrence Chan, Neel Nanda