Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation

Abstract

Although human action anticipation is a task which is inherently multi-modal, state-of-the-art methods on well known action anticipation datasets leverage this data by applying ensemble methods and averaging scores of uni-modal anticipation networks. In this work we introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage. Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches and presents state-of-the-art results outperforming previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily extensible and allows for adding new modalities without architectural changes. Consequently, we extracted audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.

Cite

Text

Zhong et al. "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Zhong et al. "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/zhong2023wacv-anticipative/)

BibTeX

@inproceedings{zhong2023wacv-anticipative,
  title     = {{Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation}},
  author    = {Zhong, Zeyun and Schneider, David and Voit, Michael and Stiefelhagen, Rainer and Beyerer, Jürgen},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {6068-6077},
  url       = {https://mlanthology.org/wacv/2023/zhong2023wacv-anticipative/}
}