Goldowsky-Dill, Nicholas

5 publications

ICML 2025 Detecting Strategic Deception with Linear Probes Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
NeurIPS 2024 Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
ICMLW 2024 Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
ICMLW 2024 Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn