He, He

26 publications

ICLR 2025 Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan
ICLR 2025 Language Models Learn to Mislead Humans via RLHF Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, Shi Feng
ICLRW 2025 Monitoring LLM Agents for Sequentially Contextual Harm Chen Yueh-Han, Nitish Joshi, Yulin Chen, He He, Rico Angell
NeurIPS 2025 Predicting Empirical AI Research Outcomes with Language Models Jiaxin Wen, Chenglei Si, Chen Yueh-Han, He He, Shi Feng
ICLR 2025 Transformers Struggle to Learn to Search Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Mehran Kazemi, Najoung Kim, He He
NeurIPSW 2024 Beyond the Binary: Capturing Diverse Preferences with Reward Regularization Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, He He
ICML 2024 Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen Mckeown
ICLR 2024 Does Writing with Language Models Reduce Content Diversity? Vishakh Padmakumar, He He
TMLR 2024 Foundational Challenges in Assuring Alignment and Safety of Large Language Models Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric J Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Chenyu Zhang, Ruiqi Zhong, Sean O hEigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Aleksandar Petrov, Christian Schroeder de Witt, Sumeet Ramesh Motwani, Yoshua Bengio, Danqi Chen, Philip Torr, Samuel Albanie, Tegan Maharaj, Jakob Nicolaus Foerster, Florian Tramèr, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger
NeurIPS 2024 Iterative Reasoning Preference Optimization Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston
TMLR 2024 Nuisances via Negativa: Adjusting for Spurious Correlations via Data Augmentation Aahlad Manas Puli, Nitish Joshi, Yoav Wald, He He, Rajesh Ranganath
NeurIPS 2024 The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale
ICML 2023 Extrapolative Controlled Sequence Generation via Iterative Refinement Vishakh Padmakumar, Richard Yuanzhe Pang, He He, Ankur P Parikh
ICLR 2023 Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought Abulhair Saparov, He He
NeurIPS 2023 Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Mehran Kazemi, Najoung Kim, He He
NeurIPS 2022 SeqPATE: Differentially Private Text Generation via Knowledge Distillation Zhiliang Tian, Yingxiu Zhao, Ziyue Huang, Yu-Xiang Wang, Nevin L. Zhang, He He
NeurIPS 2021 IRM—when It Works and When It Doesn't: A Test Case of Natural Language Inference Yana Dranker, He He, Yonatan Belinkov
ICLR 2021 Text Generation by Learning from Demonstrations Richard Yuanzhe Pang, He He
JMLR 2020 GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, Yi Zhu
IJCAI 2020 Partial Adversarial Behavior Deception in Security Games Thanh Hong Nguyen, Arunesh Sinha, He He
NeurIPS 2016 A Credit Assignment Compiler for Joint Prediction Kai-Wei Chang, He He, Stephane Ross, Hal Daume Iii, John Langford
WACV 2016 Object Detection in 20 Questions Xi Stephen Chen, He He, Larry S. Davis
ICML 2016 Opponent Modeling in Deep Reinforcement Learning He He, Jordan Boyd-Graber, Kevin Kwok, Hal Daumé
NeurIPS 2014 Learning to Search in Branch and Bound Algorithms He He, Hal Daume Iii, Jason M Eisner
NeurIPS 2012 Imitation Learning by Coaching He He, Jason Eisner, Hal Daume
CVPR 2011 Single Image Super-Resolution Using Gaussian Process Regression He He, Wan-Chi Siu