Smith, Logan

2 publications

NeurIPS 2024 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
NeurIPS 2021 Optimal Policies Tend to Seek Power Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli