Kwa, Thomas

7 publications

NeurIPS 2025 Measuring AI Ability to Complete Long Software Tasks Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Roa Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M Ziegler, Elizabeth Barnes, Lawrence Chan
NeurIPS 2024 Catastrophic Goodhart: Regularizing RLHF with KL Divergence Does Not Mitigate Heavy-Tailed Reward Misspecification Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso
ICMLW 2024 Catastrophic Goodhart: Regularizing RLHF with KL Divergence Does Not Mitigate Heavy-Tailed Reward Misspecification Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso
NeurIPS 2024 Compact Proofs of Model Performance via Mechanistic Interpretability Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan
ICMLW 2024 Compact Proofs of Model Performance via Mechanistic Interpretability Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan
NeurIPS 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso
ICMLW 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso