Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers
Abstract
Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the ''Aligner-Encoder''. To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention---it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform ''self-transduction''.
Cite
Text
Stooke et al. "Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers." Neural Information Processing Systems, 2024. doi:10.52202/079017-3184Markdown
[Stooke et al. "Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/stooke2024neurips-alignerencoders/) doi:10.52202/079017-3184BibTeX
@inproceedings{stooke2024neurips-alignerencoders,
title = {{Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers}},
author = {Stooke, Adam and Prabhavalkar, Rohit and Sim, Khe Chai and Mengibar, Pedro Moreno},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3184},
url = {https://mlanthology.org/neurips/2024/stooke2024neurips-alignerencoders/}
}