Bergsma, Shane
10 publications
NeurIPSW
2024
Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling
NeurIPS
2024
Normalization Layer Per-Example Gradients Are Sufficient to Predict Gradient Noise Scale in Transformers
NeurIPS
2024
Sparse Maximal Update Parameterization: A Holistic Approach to Sparse Training Dynamics