TriDet: Temporal Action Detection with Relative Boundary Modeling

Abstract

In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose a Scalable-Granularity Perception (SGP) layer to aggregate information across different temporal granularities, which is much more efficient than the recent transformer-based feature pyramid. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of 69.3% on THUMOS14, outperforming the previous best by 2.5%, but with only 74.6% of its latency.

Cite

Text

Shi et al. "TriDet: Temporal Action Detection with Relative Boundary Modeling." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01808

Markdown

[Shi et al. "TriDet: Temporal Action Detection with Relative Boundary Modeling." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/shi2023cvpr-tridet/) doi:10.1109/CVPR52729.2023.01808

BibTeX

@inproceedings{shi2023cvpr-tridet,
  title     = {{TriDet: Temporal Action Detection with Relative Boundary Modeling}},
  author    = {Shi, Dingfeng and Zhong, Yujie and Cao, Qiong and Ma, Lin and Li, Jia and Tao, Dacheng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {18857-18866},
  doi       = {10.1109/CVPR52729.2023.01808},
  url       = {https://mlanthology.org/cvpr/2023/shi2023cvpr-tridet/}
}