CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Abstract
Recent works has widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account source code specifics. We propose subtokenziation that reduces average length by 17--40% without downstream performance drop, and show that a carefully chosen subtokenization may significantly improve quality by 0.5-2%, possibly with some length increase.
Cite
Text
Chirkova and Troshin. "CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code." ICLR 2022 Workshops: DL4C, 2022.Markdown
[Chirkova and Troshin. "CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code." ICLR 2022 Workshops: DL4C, 2022.](https://mlanthology.org/iclrw/2022/chirkova2022iclrw-codebpe/)BibTeX
@inproceedings{chirkova2022iclrw-codebpe,
title = {{CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code}},
author = {Chirkova, Nadezhda and Troshin, Sergey},
booktitle = {ICLR 2022 Workshops: DL4C},
year = {2022},
url = {https://mlanthology.org/iclrw/2022/chirkova2022iclrw-codebpe/}
}