Bau, David

38 publications

NeurIPS 2025 Erasing Conceptual Knowledge from Language Models Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

NeurIPS 2025 LLMs Encode Harmfulness and Refusal Separately Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

ICML 2025 MIB: A Mechanistic Interpretability Benchmark Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

NeurIPS 2025 Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark Lemley, Nicolas Papernot, Katherine Lee

ICLR 2025 NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, David Bau

NeurIPS 2025 One-Step Is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre, David Bau

TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath

TMLR 2025 Open Problems in Technical AI Governance Anka Reuel, Benjamin Bucknall, Stephen Casper, Timothy Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, David Bau, Paul Bricman, Neel Guha, Jessica Newman, Yoshua Bengio, Tobin South, Alex Pentland, Sanmi Koyejo, Mykel Kochenderfer, Robert Trager

ICCV 2025 SliderSpace: Decomposing the Visual Capabilities of Diffusion Models Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin

ICLR 2025 Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, Aaron Mueller

NeurIPS 2025 When Are Concepts Erased from Diffusion Models? Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen

ECCV 2024 Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau

ICLR 2024 Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau

ICLR 2024 Function Vectors in Large Language Models Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, David Bau

ICLR 2024 Linearity of Relation Decoding in Transformer Language Models Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau

NeurIPS 2024 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks

ICMLW 2024 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks

WACV 2024 Unified Concept Editing in Diffusion Models Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau

NeurIPSW 2023 An Alternative to Regulation: The Case for Public AI Nicholas Vincent, David Bau, Sarah Schwettmann, Joshua Tan

ICMLW 2023 Discovering Variable Binding Circuitry with Desiderata Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

ICLR 2023 Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

ICCV 2023 Erasing Concepts from Diffusion Models Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, David Bau

NeurIPS 2023 FIND: A Function Description Benchmark for Evaluating Interpretability Methods Sarah Schwettmann, Tamar Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba

ICLR 2023 Mass-Editing Memory in a Transformer Kevin Meng, Arnab Sen Sharma, Alex J Andonian, Yonatan Belinkov, David Bau

ICCVW 2023 Multimodal Neurons in Pretrained Text-Only Transformers Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, Antonio Torralba

NeurIPSW 2023 Testing Language Model Agents Safely in the Wild Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau

CVPR 2022 Disentangling Visual and Written Concepts in CLIP Joanna Materzyńska, Antonio Torralba, David Bau

NeurIPS 2022 Locating and Editing Factual Associations in GPT Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov

ICLR 2022 Natural Language Descriptions of Deep Visual Features Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, Jacob Andreas

NeurIPS 2021 Editing a Classifier by Rewriting Its Prediction Rules Shibani Santurkar, Dimitris Tsipras, Mahalaxmi Elango, David Bau, Antonio Torralba, Aleksander Madry

ICCV 2021 Sketch Your Own GAN Sheng-Yu Wang, David Bau, Jun-Yan Zhu

ICCV 2021 Toward a Visual Concept Vocabulary for GAN Latent Space Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Klein, Jacob Andreas, Antonio Torralba

ECCV 2020 Rewriting a Deep Generative Model David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba

ECCV 2020 What Makes Fake Images Detectable? Understanding Properties That Generalize Lucy Chai, David Bau, Ser-Nam Lim, Phillip Isola

ICLR 2019 GAN Dissection: Visualizing and Understanding Generative Adversarial Networks David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba

ICLRW 2019 Visualizing and Understanding GANs David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba

ECCV 2018 Interpretable Basis Decomposition for Visual Explanation Bolei Zhou, Yiyou Sun, David Bau, Antonio Torralba

CVPR 2017 Network Dissection: Quantifying Interpretability of Deep Visual Representations David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, Antonio Torralba