ViTs for SITS: Vision Transformers for Satellite Image Time Series
Abstract
In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes can be found at https://github.com/michaeltrs/DeepSatModels.
Cite
Text
Tarasiou et al. "ViTs for SITS: Vision Transformers for Satellite Image Time Series." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01004Markdown
[Tarasiou et al. "ViTs for SITS: Vision Transformers for Satellite Image Time Series." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/tarasiou2023cvpr-vits/) doi:10.1109/CVPR52729.2023.01004BibTeX
@inproceedings{tarasiou2023cvpr-vits,
title = {{ViTs for SITS: Vision Transformers for Satellite Image Time Series}},
author = {Tarasiou, Michail and Chavez, Erik and Zafeiriou, Stefanos},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {10418-10428},
doi = {10.1109/CVPR52729.2023.01004},
url = {https://mlanthology.org/cvpr/2023/tarasiou2023cvpr-vits/}
}