Action Quality Assessment with Temporal Parsing Transformer

Abstract

Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.

Cite

Text

Bai et al. "Action Quality Assessment with Temporal Parsing Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19772-7_25

Markdown

[Bai et al. "Action Quality Assessment with Temporal Parsing Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/bai2022eccv-action/) doi:10.1007/978-3-031-19772-7_25

BibTeX

@inproceedings{bai2022eccv-action,
  title     = {{Action Quality Assessment with Temporal Parsing Transformer}},
  author    = {Bai, Yang and Zhou, Desen and Zhang, Songyang and Wang, Jian and Ding, Errui and Guan, Yu and Long, Yang and Wang, Jingdong},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19772-7_25},
  url       = {https://mlanthology.org/eccv/2022/bai2022eccv-action/}
}