Character-Level Tokenizations as Powerful Inductive Biases for RNA Foundational Models
Abstract
RNA plays a critical role in cellular functions and is increasingly targeted for therapeutics, yet its structural complexity poses challenges for computational modeling. While foundational models have transformed protein representation learning, achieving similar success for RNA remains elusive. We introduce ChaRNABERT, a suite of sample- and parameter-efficient RNA foundational models that leverage a learnable tokenization process to achieve superior performance across established benchmarks. We further validate its capabilities on downstream tasks, including RNA-protein and aptamer-protein interaction prediction. The ChaRNABERT-8M model, along with inference code, will be publicly available for academic research, with additional models provided upon request.
Cite
Text
Morales-Pastor et al. "Character-Level Tokenizations as Powerful Inductive Biases for RNA Foundational Models." ICLR 2025 Workshops: AI4NA, 2025.Markdown
[Morales-Pastor et al. "Character-Level Tokenizations as Powerful Inductive Biases for RNA Foundational Models." ICLR 2025 Workshops: AI4NA, 2025.](https://mlanthology.org/iclrw/2025/moralespastor2025iclrw-characterlevel/)BibTeX
@inproceedings{moralespastor2025iclrw-characterlevel,
title = {{Character-Level Tokenizations as Powerful Inductive Biases for RNA Foundational Models}},
author = {Morales-Pastor, Adrian and Vázquez-Reza, Raquel and Wieczór, Miłosz and Valverde, Clàudia and Gil-Sorribes, Manel and Miquel-Oliver, Bertran and Serrano, Alvaro Ciudad and Molina, Alexis},
booktitle = {ICLR 2025 Workshops: AI4NA},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/moralespastor2025iclrw-characterlevel/}
}