Kwa, Thomas

7 publications

NeurIPS 2025 Measuring AI Ability to Complete Long Software Tasks Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Roa Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M Ziegler, Elizabeth Barnes, Lawrence Chan

NeurIPS 2024 Catastrophic Goodhart: Regularizing RLHF with KL Divergence Does Not Mitigate Heavy-Tailed Reward Misspecification Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso

ICMLW 2024 Catastrophic Goodhart: Regularizing RLHF with KL Divergence Does Not Mitigate Heavy-Tailed Reward Misspecification Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso

NeurIPS 2024 Compact Proofs of Model Performance via Mechanistic Interpretability Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

ICMLW 2024 Compact Proofs of Model Performance via Mechanistic Interpretability Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

NeurIPS 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso

ICMLW 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso