Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos

Abstract

Ultra-high-resolution aerial videos are becoming increasingly popular for enhancing surveillance capabilities in sparsely populated areas. However, analyzing human activities automatically, such as "who is doing what?" in these videos, is desirable to realize their surveillance potential. In contrast, atomic visual action detection has successfully recognized such activities in movie data. However, adapting it to ultra-high resolution aerial videos is challenging because the target persons appear relatively tiny from overhead views and are sparsely located. Additionally, existing atomic visual action detection methods are based on single-label actions. However, people can perform multiple actions simultaneously, so a multi-label approach would be more appropriate. To address these problems, we propose a multi-label action detection/recognition framework using a hybrid attention vision transformer (HAT) to recognize recurrent actions more efficiently. Additionally, a multi-scale, multi-granularity module inside the action recognition transformer block extracts relevant features without redundancy. Using the Okutama Dataset, we demonstrated that our method performs better than existing state-of-the-art methodologies for interpreting aerial videos for human activity.

Cite

Text

Khan et al. "Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00474

Markdown

[Khan et al. "Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/khan2024cvprw-dronehat/) doi:10.1109/CVPRW63382.2024.00474

BibTeX

@inproceedings{khan2024cvprw-dronehat,
  title     = {{Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos}},
  author    = {Khan, Mustaqeem and Ahmad, Jamil and El Saddik, Abdulmotaleb and Gueaieb, Wail and De Masi, Giulia and Karray, Fakhri},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {4713-4722},
  doi       = {10.1109/CVPRW63382.2024.00474},
  url       = {https://mlanthology.org/cvprw/2024/khan2024cvprw-dronehat/}
}