CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

Abstract

Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account source code specifics. We propose subtokenziation that reduces average length by 17--40% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.

Cite

Text

Chirkova and Troshin. "CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code." International Conference on Learning Representations, 2023.

Markdown

[Chirkova and Troshin. "CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/chirkova2023iclr-codebpe/)

BibTeX

@inproceedings{chirkova2023iclr-codebpe,
  title     = {{CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code}},
  author    = {Chirkova, Nadezhda and Troshin, Sergey},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/chirkova2023iclr-codebpe/}
}