Program Synthesis for Character Level Language Modeling

Abstract

We propose a statistical model applicable to character level language modeling and show that it is a good fit for both, program source code and English text. The model is parameterized by a program from a domain-specific language (DSL) that allows expressing non-trivial data dependencies. Learning is done in two phases: (i) we synthesize a program from the DSL, essentially learning a good representation for the data, and (ii) we learn parameters from the training data - the process is done via counting, as in simple language models such as n-gram. Our experiments show that the precision of our model is comparable to that of neural networks while sharing a number of advantages with n-gram models such as fast query time and the capability to quickly add and remove training data samples. Further, the model is parameterized by a program that can be manually inspected, understood and updated, addressing a major problem of neural networks.

Cite

Text

Bielik et al. "Program Synthesis for Character Level Language Modeling." International Conference on Learning Representations, 2017.

Markdown

[Bielik et al. "Program Synthesis for Character Level Language Modeling." International Conference on Learning Representations, 2017.](https://mlanthology.org/iclr/2017/bielik2017iclr-program/)

BibTeX

@inproceedings{bielik2017iclr-program,
  title     = {{Program Synthesis for Character Level Language Modeling}},
  author    = {Bielik, Pavol and Raychev, Veselin and Vechev, Martin T.},
  booktitle = {International Conference on Learning Representations},
  year      = {2017},
  url       = {https://mlanthology.org/iclr/2017/bielik2017iclr-program/}
}