Stolfo, Alessandro

8 publications

ICLRW 2025 Antipodal Pairing and Mechanistic Signals in Dense SAE Latents Alessandro Stolfo, Ben Peng Wu, Mrinmaya Sachan
NeurIPS 2025 Dense SAE Latents Are Features, Not Bugs Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Peng Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark
ICLR 2025 Improving Instruction-Following in Language Models Through Activation Steering Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi
ICML 2025 MIB: A Mechanistic Interpretability Benchmark Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov
NeurIPS 2025 Transferring Linear Features Across Language Models with Model Stitching Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick
NeurIPS 2024 Confidence Regulation Neurons in Language Models Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
ICMLW 2024 Confidence Regulation Neurons in Language Models Alessandro Stolfo, Ben Peng Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
ICML 2024 Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? Andreas Opedal, Alessandro Stolfo, Haruki Shirakami, Ying Jiao, Ryan Cotterell, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan