TRecViT: A Recurrent Video Transformer

Abstract

We propose a novel block for causal video modelling. It relies on a time - space - channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having 3x less parameters, 12x smaller memory footprint, and 5x lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in real-time. When compared with causal transformer-based models (TSM, RViT) and other recurrent models like LSTM, TRecViT obtains state-of-the-art results on the challenging SSv2 dataset. Code and checkpoints are available online \url{https://github.com/google-deepmind/trecvit}.

Cite

Text

Patraucean et al. "TRecViT: A Recurrent Video Transformer." Transactions on Machine Learning Research, 2026.

Markdown

[Patraucean et al. "TRecViT: A Recurrent Video Transformer." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/patraucean2026tmlr-trecvit/)

BibTeX

@article{patraucean2026tmlr-trecvit,
  title     = {{TRecViT: A Recurrent Video Transformer}},
  author    = {Patraucean, Viorica and He, Xu Owen and Heyward, Joseph and Zhang, Chuhan and Sajjadi, Mehdi S. M. and Muraru, George-Cristian and Zholus, Artem and Karami, Mahdi and Goroshin, Ross and Chen, Yutian and Osindero, Simon and Carreira, Joao and Pascanu, Razvan},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/patraucean2026tmlr-trecvit/}
}