Symbolic Autoencoding for Self-Supervised Sequence Learning

Abstract

Traditional language models(LMs) excel at next-token prediction in text sequences but often struggle with transduction tasks involving distinct symbolic systems, particularly when parallel data is scarce or nonexistent. This issue is even more pronounced in domains dealing with complex, non-natural language sequences, such as audio signals, protein structures, or biological sequences, where the strengths of LMs in natural language do not directly translate. To address this challenge, we introduce symbolic autoencoding ($\Sigma$AE), a self-supervised framework designed to exploit the wealth of non-parallel data alongside limited parallel data. $\Sigma$AE integrates two generative models via a discrete bottleneck layer, optimizing the entire system end-to-end by minimizing unsupervised reconstruction loss for all data such that the sequence generated at the discrete bottleneck can be read out as the transduced input sequence, and separately optimizing the two models with supervised loss on the subset of labeled parallel data. To allow optimization of the models in the presence of discrete symbols, we use a family of straight-through gradient estimators. We demonstrate the effectiveness of $\Sigma$AE on four sequence-to-sequence transduction tasks, showing that it significantly outperforms strong baselines in weakly supervised settings.

Cite

Text

Amani et al. "Symbolic Autoencoding for Self-Supervised Sequence Learning." ICML 2024 Workshops: Differentiable_Almost_Everything, 2024.

Markdown

[Amani et al. "Symbolic Autoencoding for Self-Supervised Sequence Learning." ICML 2024 Workshops: Differentiable_Almost_Everything, 2024.](https://mlanthology.org/icmlw/2024/amani2024icmlw-symbolic/)

BibTeX

@inproceedings{amani2024icmlw-symbolic,
  title     = {{Symbolic Autoencoding for Self-Supervised Sequence Learning}},
  author    = {Amani, Mohammad Hossein and Baldwin, Nicolas and Mansouri, Amin and Josifoski, Martin and Peyrard, Maxime and West, Robert},
  booktitle = {ICML 2024 Workshops: Differentiable_Almost_Everything},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/amani2024icmlw-symbolic/}
}