ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Abstract

Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48% of the data required by traditional two-stage methods.

Cite

Text

Han et al. "ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling." Proceedings of the 10th Machine Learning for Healthcare Conference, 2025.

Markdown

[Han et al. "ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling." Proceedings of the 10th Machine Learning for Healthcare Conference, 2025.](https://mlanthology.org/mlhc/2025/han2025mlhc-ecgbyte/)

BibTeX

@inproceedings{han2025mlhc-ecgbyte,
  title     = {{ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling}},
  author    = {Han, William and Duan, Chaojing and Rosenberg, Michael and Liu, Emerson and Zhao, Ding},
  booktitle = {Proceedings of the 10th Machine Learning for Healthcare Conference},
  year      = {2025},
  volume    = {298},
  url       = {https://mlanthology.org/mlhc/2025/han2025mlhc-ecgbyte/}
}