MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation

Abstract

Long-term dense action anticipation is very challenging since it requires predicting actions and their durations several minutes into the future based on provided video observations. To model the uncertainty of future outcomes, stochastic models predict several potential future action sequences for the same observation. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency. Our code is available at https://github.com/olga-zats/DIFF_MANTA.

Cite

Text

Zatsarynna et al. "MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00326

Markdown

[Zatsarynna et al. "MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zatsarynna2025cvpr-manta/) doi:10.1109/CVPR52734.2025.00326

BibTeX

@inproceedings{zatsarynna2025cvpr-manta,
  title     = {{MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation}},
  author    = {Zatsarynna, Olga and Bahrami, Emad and Farha, Yazan Abu and Francesca, Gianpiero and Gall, Juergen},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3438-3448},
  doi       = {10.1109/CVPR52734.2025.00326},
  url       = {https://mlanthology.org/cvpr/2025/zatsarynna2025cvpr-manta/}
}