Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Abstract

The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly refreshed the accuracy leaderboard of FSAR (Few-Shot Action Recognition). However, the trainable overhead of ensuring that the domain alignment of CLIP and FSAR is often unbearable. To mitigate this issue, we present an Efficient Multi-Level Post-Reasoning Network, namely EMP-Net. By design, a post-reasoning mechanism is proposed for domain adaptation, which avoids most gradient backpropagation, improving the efficiency; meanwhile, a multi-level representation is utilised during the reasoning and matching processes to improve the discriminability, ensuring effectiveness. Specifically, the proposed EMP-Net starts with a skip-fusion involving cached multi-stage features extracted by CLIP. After that, the fused feature is decoupled into multi-level representations, including global-level, patch-level, and frame-level. The ensuing spatiotemporal reasoning module operates on multi-level representations to generate discriminative features. As for matching, the contrasts between text-visual and support-query are integrated to provide comprehensive guidance. The experimental results demonstrate that EMP-Net can unlock the potential performance of CLIP in a more efficient manner. The code and supplementary material can be found at https://github.com/cong-wu/EMP-Net.

Cite

Text

Wu et al. "Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72646-0_3

Markdown

[Wu et al. "Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wu2024eccv-efficient/) doi:10.1007/978-3-031-72646-0_3

BibTeX

@inproceedings{wu2024eccv-efficient,
  title     = {{Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning}},
  author    = {Wu, Cong and Wu, Xiao-Jun and Li, Linze and Xu, Tianyang and Feng, Zhenhua and Kittler, Josef},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72646-0_3},
  url       = {https://mlanthology.org/eccv/2024/wu2024eccv-efficient/}
}