Wattenberg, Martin

17 publications

ICML 2025 Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, Talia Konkle
ICLR 2025 ICLR: In-Context Learning of Representations Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, Hidenori Tanaka
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
ICLRW 2025 Shared Global and Local Geometry of Language Model Embeddings Andrew Lee, Fernanda Viégas, Martin Wattenberg
ICML 2025 When Bad Data Leads to Good Models Kenneth Li, Yida Chen, Fernanda Viégas, Martin Wattenberg
ICML 2024 A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
NeurIPSW 2024 Causation Does Not Imply Correlation: A Study of Circuit Mechanisms and Model Behaviors Jenny Kaufmann, Victoria R Li, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra
ICLR 2024 Linearity of Relation Decoding in Transformer Language Models Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau
ICML 2024 Q-Probe: A Lightweight Approach to Reward Maximization for Language Models Kenneth Li, Samy Jelassi, Hugh Zhang, Sham M. Kakade, Martin Wattenberg, David Brandfonbrener
ICMLW 2024 Relational Composition in Neural Networks: A Survey and Call to Action Martin Wattenberg, Fernanda Viégas
ICLR 2023 Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
NeurIPS 2023 Inference-Time Intervention: Eliciting Truthful Answers from a Language Model Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
NeurIPSW 2022 Identifying Structure in the MIMIC ICU Dataset Zad Chin, Shivam Raval, Finale Doshi-Velez, Martin Wattenberg, Leo Anthony Celi
NeurIPS 2019 Visualizing and Measuring the Geometry of BERT Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Viegas, Andy Coenen, Adam Pearce, Been Kim
ICML 2018 Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres
Distill 2016 How to Use T-SNE Effectively Martin Wattenberg, Fernanda Viégas, Ian Johnson
NeurIPS 1995 Stochastic Hillclimbing as a Baseline Method for Evaluating Genetic Algorithms Ari Juels, Martin Wattenberg