Spatio-Temporal Contrastive Domain Adaptation for Action Recognition
Abstract
Unsupervised domain adaptation (UDA) for human action recognition is a practical and challenging problem. Compared with image-based UDA, video-based UDA is comprehensive to bridge the domain shift on both spatial representation and temporal dynamics. Most previous works focus on short-term modeling and alignment with frame-level or clip-level features, which is not discriminative sufficiently for video-based UDA tasks. To address these problems, in this paper we propose to establish the cross-modal domain alignment via self-supervised contrastive framework, i.e., spatio-temporal contrastive domain adaptation (STCDA), to learn the joint clip-level and video-level representation alignment. Since the effective representation is modeled from unlabeled data by self-supervised learning (SSL), spatio-temporal contrastive learning (STCL) is proposed to explore the useful long-term feature representation for classification, using self-supervision setting trained from the contrastive clip/video pairs with positive or negative properties. Besides, we involve a novel domain metric scheme, i.e., video-based contrastive alignment (VCA), to optimize the category-aware video-level alignment and generalization between source and target. The proposed STCDA achieves stat-of-the-art results on several UDA benchmarks for action recognition.
Cite
Text
Song et al. "Spatio-Temporal Contrastive Domain Adaptation for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00966Markdown
[Song et al. "Spatio-Temporal Contrastive Domain Adaptation for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/song2021cvpr-spatiotemporal/) doi:10.1109/CVPR46437.2021.00966BibTeX
@inproceedings{song2021cvpr-spatiotemporal,
title = {{Spatio-Temporal Contrastive Domain Adaptation for Action Recognition}},
author = {Song, Xiaolin and Zhao, Sicheng and Yang, Jingyu and Yue, Huanjing and Xu, Pengfei and Hu, Runbo and Chai, Hua},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2021},
pages = {9787-9795},
doi = {10.1109/CVPR46437.2021.00966},
url = {https://mlanthology.org/cvpr/2021/song2021cvpr-spatiotemporal/}
}