Wang, Mingze
14 publications
ICML
2025
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
NeurIPSW
2023
The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization
NeurIPS
2023
Understanding Multi-Phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks