Latent Speech-Text Transformer

Lu, Yen-Ju; Gaur, Yashesh; Zhou, Wei; Muller, Benjamin; Villalba, Jesus; Dehak, Najim; Zettlemoyer, Luke; Ghosh, Gargi; Lewis, Mike; Iyer, Srini; Le, Duc

Latent Speech-Text Transformer

Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srini Iyer, Duc Le

ICLR 2026

/iclr/2026/lu2026iclr-latent/

Abstract

Auto-regressive speech–text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalance disproportionately allocates pre-training and inference compute to speech, potentially hindering effective cross-modal alignment and slowing performance scaling by orders of magnitude. We introduce the Latent Speech-Text Transformer (LST), which aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units. This design aligns the sequence-modeling granularity between speech and text while improving computational efficiency. The resulting patches can align with textual units to facilitate cross-modal knowledge transfer and compactly capture recurring acoustic patterns such as silence. Across story-completion benchmarks under both compute-controlled and data-controlled settings, LST consistently improves speech accuracy while also improving text performance, achieving up to +6.5% absolute gain on speech HellaSwag in compute-controlled training (+5.3% in data-controlled training). Under compute-controlled scaling from 420M to 1.8B parameters in a near compute-optimal regime, gains grow with scale, and improvements persist up to 7B parameters under fixed-token budgets. These benefits extend to downstream tasks: LST stabilizes ASR adaptation and reduces the effective autoregressive sequence length during ASR and TTS inference, lowering computational cost without degrading reconstruction quality. The Code is available at https://github.com/facebookresearch/lst.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Lu et al. "Latent Speech-Text Transformer." International Conference on Learning Representations, 2026.

Markdown

[Lu et al. "Latent Speech-Text Transformer." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lu2026iclr-latent/)

BibTeX

@inproceedings{lu2026iclr-latent,
  title     = {{Latent Speech-Text Transformer}},
  author    = {Lu, Yen-Ju and Gaur, Yashesh and Zhou, Wei and Muller, Benjamin and Villalba, Jesus and Dehak, Najim and Zettlemoyer, Luke and Ghosh, Gargi and Lewis, Mike and Iyer, Srini and Le, Duc},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lu2026iclr-latent/}
}