UniSTD: Towards Unified Spatio-Temporal Learning Across Diverse Disciplines

Abstract

Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce UniSTD, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.

Cite

Text

Tang et al. "UniSTD: Towards Unified Spatio-Temporal Learning Across Diverse Disciplines." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02720

Markdown

[Tang et al. "UniSTD: Towards Unified Spatio-Temporal Learning Across Diverse Disciplines." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/tang2025cvpr-unistd/) doi:10.1109/CVPR52734.2025.02720

BibTeX

@inproceedings{tang2025cvpr-unistd,
  title     = {{UniSTD: Towards Unified Spatio-Temporal Learning Across Diverse Disciplines}},
  author    = {Tang, Chen and Ma, Xinzhu and Su, Encheng and Song, Xiufeng and Liu, Xiaohong and Li, Wei-Hong and Bai, Lei and Ouyang, Wanli and Yue, Xiangyu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29213-29224},
  doi       = {10.1109/CVPR52734.2025.02720},
  url       = {https://mlanthology.org/cvpr/2025/tang2025cvpr-unistd/}
}