Steinhardt, Jacob
79 publications
ICLR
2025
Iterative Label Refinement Matters More than Preference Optimization Under Weak Supervision
ICML
2024
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
NeurIPSW
2022
Are Neurons Actually Collapsed? on the Fine-Grained Structure in Neural Representations
NeurIPSW
2022
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
ICML
2022
More than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize