D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-Shot Action Recognition

Abstract

Adapting pre-trained image models to video modality has proven to be an effective strategy for robust few-shot action recognition. In this work, we explore the potential of adapter tuning in image-to-video model adaptation and propose a novel video adapter tuning framework, called Disentangled-and-Deformable Spatio-Temporal Adapter (D^2ST-Adapter). It features a lightweight design, low adaptation overhead and powerful spatio-temporal feature adaptation capabilities. D^2ST-Adapter is structured with an internal dual-pathway architecture that enables built-in disentangled encoding of spatial and temporal features within the adapter, seamlessly integrating into the single-stream feature learning framework of pre-trained image models. In particular, we develop an efficient yet effective implementation of the D^2ST-Adapter, incorporating the specially devised anisotropic Deformable Spatio-Temporal Attention as its pivotal operation. This mechanism can be individually tailored for two pathways with anisotropic sampling densities along the spatial and temporal domains in 3D spatio-temporal space, enabling disentangled encoding of spatial and temporal features while maintaining a lightweight design. Extensive experiments by instantiating our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition. Code will be released.

Cite

Text

Pei et al. "D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-Shot Action Recognition." International Conference on Computer Vision, 2025.

Markdown

[Pei et al. "D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-Shot Action Recognition." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/pei2025iccv-d2stadapter/)

BibTeX

@inproceedings{pei2025iccv-d2stadapter,
  title     = {{D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-Shot Action Recognition}},
  author    = {Pei, Wenjie and Tan, Qizhong and Lu, Guangming and Tian, Jiandong and Yu, Jun},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {11317-11326},
  url       = {https://mlanthology.org/iccv/2025/pei2025iccv-d2stadapter/}
}