Wu, Zhengxuan

15 publications

ICML 2025 AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts

JMLR 2025 Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard

NeurIPS 2025 Improved Representation Steering for Language Models Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D Manning, Christopher Potts

NeurIPS 2025 LLMs Encode Harmfulness and Refusal Separately Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

CLeaR 2024 Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah Goodman

ICML 2024 In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, Junxian He

ICLRW 2024 In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, Junxian He

NeurIPS 2024 ReFT: Representation Finetuning for Language Models Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts

ICLRW 2024 Symbolic Variables in Distributed Networks That Count Satchel Grant, Zhengxuan Wu, James Lloyd McClelland, Noah Goodman

ICML 2023 Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu, Karel D’Oosterlinck, Atticus Geiger, Amir Zur, Christopher Potts

NeurIPS 2023 Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah Goodman

NeurIPS 2022 CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior Eldar D Abraham, Karel D'Oosterlinck, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, Zhengxuan Wu

ICML 2022 Inducing Causal Structure for Interpretable Neural Networks Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, Christopher Potts

NeurIPS 2022 ZeroC: A Neuro-Symbolic Model for Zero-Shot Concept Recognition and Acquisition at Inference Time Tailin Wu, Megan Tjandrasuwita, Zhengxuan Wu, Xuelin Yang, Kevin Liu, Rok Sosic, Jure Leskovec

AAAI 2021 Context-Guided BERT for Targeted Aspect-Based Sentiment Analysis Zhengxuan Wu, Desmond C. Ong