CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner

Yan, Tingbing; Zeng, Wenzheng; Xiao, Yang; Tong, Xingyu; Tan, Bo; Fang, Zhiwen; Cao, Zhiguo; Zhou, Joey Tianyi

doi:10.1007/978-3-031-72661-3_7

CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao, Xingyu Tong, Bo Tan, Zhiwen Fang, Zhiguo Cao, Joey Tianyi Zhou

ECCV 2024

doi:10.1007/978-3-031-72661-3_7 /eccv/2024/yan2024eccv-crossglg/

Abstract

Most existing one-shot skeleton-based action recognition focuses on raw low-level information (, joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design 2 prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (, local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only 2.8% than the previous SOTA. Code is available at [RGB]255,106,106CrossGLG. † Yang Xiao and Wenzheng Zeng are corresponding authors.

PDF ECCV Semantic Scholar

Cite

Text

Yan et al. "CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72661-3_7

Markdown

[Yan et al. "CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/yan2024eccv-crossglg/) doi:10.1007/978-3-031-72661-3_7

BibTeX

@inproceedings{yan2024eccv-crossglg,
  title     = {{CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner}},
  author    = {Yan, Tingbing and Zeng, Wenzheng and Xiao, Yang and Tong, Xingyu and Tan, Bo and Fang, Zhiwen and Cao, Zhiguo and Zhou, Joey Tianyi},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72661-3_7},
  url       = {https://mlanthology.org/eccv/2024/yan2024eccv-crossglg/}
}