Neuro-Symbolic Data Generation for Math Reasoning

Abstract

A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space.Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

Cite

Text

Li et al. "Neuro-Symbolic Data Generation for Math Reasoning." Neural Information Processing Systems, 2024. doi:10.52202/079017-0740

Markdown

[Li et al. "Neuro-Symbolic Data Generation for Math Reasoning." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/li2024neurips-neurosymbolic/) doi:10.52202/079017-0740

BibTeX

@inproceedings{li2024neurips-neurosymbolic,
  title     = {{Neuro-Symbolic Data Generation for Math Reasoning}},
  author    = {Li, Zenan and Zhou, Zhi and Yao, Yuan and Li, Yu-Feng and Cao, Chun and Yang, Fan and Zhang, Xian and Ma, Xiaoxing},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0740},
  url       = {https://mlanthology.org/neurips/2024/li2024neurips-neurosymbolic/}
}