PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

Abstract

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non-hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

Cite

Text

Sardari et al. "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00321

Markdown

[Sardari et al. "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/sardari2023iccvw-pat/) doi:10.1109/ICCVW60793.2023.00321

BibTeX

@inproceedings{sardari2023iccvw-pat,
  title     = {{PAT: Position-Aware Transformer for Dense Multi-Label Action Detection}},
  author    = {Sardari, Faegheh and Mustafa, Armin and Jackson, Philip J. B. and Hilton, Adrian},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2023},
  pages     = {2980-2989},
  doi       = {10.1109/ICCVW60793.2023.00321},
  url       = {https://mlanthology.org/iccvw/2023/sardari2023iccvw-pat/}
}