Prediction-Feedback DETR for Temporal Action Detection

Abstract

Temporal Action Detection (TAD) is fundamental yet challenging for real-world video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.

Cite

Text

Kim et al. "Prediction-Feedback DETR for Temporal Action Detection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32448

Markdown

[Kim et al. "Prediction-Feedback DETR for Temporal Action Detection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/kim2025aaai-prediction/) doi:10.1609/AAAI.V39I4.32448

BibTeX

@inproceedings{kim2025aaai-prediction,
  title     = {{Prediction-Feedback DETR for Temporal Action Detection}},
  author    = {Kim, Jihwan and Lee, Miso and Cho, Cheol-Ho and Lee, Jihyun and Heo, Jae-Pil},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4266-4274},
  doi       = {10.1609/AAAI.V39I4.32448},
  url       = {https://mlanthology.org/aaai/2025/kim2025aaai-prediction/}
}