Wu, Zhengxuan

15 publications

ICML 2025 AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts
JMLR 2025 Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard
NeurIPS 2025 Improved Representation Steering for Language Models Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D Manning, Christopher Potts
NeurIPS 2025 LLMs Encode Harmfulness and Refusal Separately Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi
CLeaR 2024 Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah Goodman
ICML 2024 In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, Junxian He
ICLRW 2024 In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, Junxian He
NeurIPS 2024 ReFT: Representation Finetuning for Language Models Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts
ICLRW 2024 Symbolic Variables in Distributed Networks That Count Satchel Grant, Zhengxuan Wu, James Lloyd McClelland, Noah Goodman
ICML 2023 Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu, Karel D’Oosterlinck, Atticus Geiger, Amir Zur, Christopher Potts
NeurIPS 2023 Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah Goodman
NeurIPS 2022 CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior Eldar D Abraham, Karel D'Oosterlinck, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, Zhengxuan Wu
ICML 2022 Inducing Causal Structure for Interpretable Neural Networks Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, Christopher Potts
NeurIPS 2022 ZeroC: A Neuro-Symbolic Model for Zero-Shot Concept Recognition and Acquisition at Inference Time Tailin Wu, Megan Tjandrasuwita, Zhengxuan Wu, Xuelin Yang, Kevin Liu, Rok Sosic, Jure Leskovec
AAAI 2021 Context-Guided BERT for Targeted Aspect-Based Sentiment Analysis Zhengxuan Wu, Desmond C. Ong