Improving the Straight-Through Estimator with Zeroth-Order Information

Abstract

We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796$\times$ reduction in computation versus n-SPSA for a 2-layer MLP on MNIST. Code is available at [https://github.com/1733116199/fogzo](https://github.com/1733116199/fogzo).

Cite

Text

Yang and Aamodt. "Improving the Straight-Through Estimator with Zeroth-Order Information." Advances in Neural Information Processing Systems, 2025.

Markdown

[Yang and Aamodt. "Improving the Straight-Through Estimator with Zeroth-Order Information." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yang2025neurips-improving/)

BibTeX

@inproceedings{yang2025neurips-improving,
  title     = {{Improving the Straight-Through Estimator with Zeroth-Order Information}},
  author    = {Yang, Ningfeng and Aamodt, Tor M.},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/yang2025neurips-improving/}
}