Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.

Cite

Text

Hu et al. "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19781-9_26

Markdown

[Hu et al. "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/hu2022eccv-lightweight/) doi:10.1007/978-3-031-19781-9_26

BibTeX

@inproceedings{hu2022eccv-lightweight,
  title     = {{Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval}},
  author    = {Hu, Fan and Chen, Aozhu and Wang, Ziyue and Zhou, Fangming and Dong, Jianfeng and Li, Xirong},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19781-9_26},
  url       = {https://mlanthology.org/eccv/2022/hu2022eccv-lightweight/}
}