Bau, David

38 publications

NeurIPS 2025 Erasing Conceptual Knowledge from Language Models Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau
NeurIPS 2025 LLMs Encode Harmfulness and Refusal Separately Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi
ICML 2025 MIB: A Mechanistic Interpretability Benchmark Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov
NeurIPS 2025 Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark Lemley, Nicolas Papernot, Katherine Lee
ICLR 2025 NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, David Bau
NeurIPS 2025 One-Step Is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre, David Bau
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
TMLR 2025 Open Problems in Technical AI Governance Anka Reuel, Benjamin Bucknall, Stephen Casper, Timothy Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, David Bau, Paul Bricman, Neel Guha, Jessica Newman, Yoshua Bengio, Tobin South, Alex Pentland, Sanmi Koyejo, Mykel Kochenderfer, Robert Trager
ICCV 2025 SliderSpace: Decomposing the Visual Capabilities of Diffusion Models Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin
ICLR 2025 Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, Aaron Mueller
NeurIPS 2025 When Are Concepts Erased from Diffusion Models? Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen
ECCV 2024 Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau
ICLR 2024 Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau
ICLR 2024 Function Vectors in Large Language Models Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, David Bau
ICLR 2024 Linearity of Relation Decoding in Transformer Language Models Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau
NeurIPS 2024 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
ICMLW 2024 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
WACV 2024 Unified Concept Editing in Diffusion Models Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau
NeurIPSW 2023 An Alternative to Regulation: The Case for Public AI Nicholas Vincent, David Bau, Sarah Schwettmann, Joshua Tan
ICMLW 2023 Discovering Variable Binding Circuitry with Desiderata Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau
ICLR 2023 Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
ICCV 2023 Erasing Concepts from Diffusion Models Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, David Bau
NeurIPS 2023 FIND: A Function Description Benchmark for Evaluating Interpretability Methods Sarah Schwettmann, Tamar Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba
ICLR 2023 Mass-Editing Memory in a Transformer Kevin Meng, Arnab Sen Sharma, Alex J Andonian, Yonatan Belinkov, David Bau
ICCVW 2023 Multimodal Neurons in Pretrained Text-Only Transformers Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, Antonio Torralba
NeurIPSW 2023 Testing Language Model Agents Safely in the Wild Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau
CVPR 2022 Disentangling Visual and Written Concepts in CLIP Joanna Materzyńska, Antonio Torralba, David Bau
NeurIPS 2022 Locating and Editing Factual Associations in GPT Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov
ICLR 2022 Natural Language Descriptions of Deep Visual Features Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, Jacob Andreas
NeurIPS 2021 Editing a Classifier by Rewriting Its Prediction Rules Shibani Santurkar, Dimitris Tsipras, Mahalaxmi Elango, David Bau, Antonio Torralba, Aleksander Madry
ICCV 2021 Sketch Your Own GAN Sheng-Yu Wang, David Bau, Jun-Yan Zhu
ICCV 2021 Toward a Visual Concept Vocabulary for GAN Latent Space Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Klein, Jacob Andreas, Antonio Torralba
ECCV 2020 Rewriting a Deep Generative Model David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba
ECCV 2020 What Makes Fake Images Detectable? Understanding Properties That Generalize Lucy Chai, David Bau, Ser-Nam Lim, Phillip Isola
ICLR 2019 GAN Dissection: Visualizing and Understanding Generative Adversarial Networks David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba
ICLRW 2019 Visualizing and Understanding GANs David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba
ECCV 2018 Interpretable Basis Decomposition for Visual Explanation Bolei Zhou, Yiyou Sun, David Bau, Antonio Torralba
CVPR 2017 Network Dissection: Quantifying Interpretability of Deep Visual Representations David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, Antonio Torralba