Learning Trajectory-Word Alignments for Video-Language Tasks

Abstract

In a video, an object usually appears as the trajectory, i.e., it spans over a few spatial but longer temporal patches, that contains abundant spatiotemporal contexts. However, modern Video-Language BERTs (VDL-BERTs) neglect this trajectory characteristic that they usually follow image-language BERTs (IL-BERTs) to deploy the patch-to-word (P2W) attention that may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks. Moreover, previous VDL-BERTs usually uniformly sample a few frames into the model while different trajectories have diverse graininess, i.e., some trajectories span longer frames and some span shorter, and using a few frames will lose certain useful temporal contexts. However, simply sampling more frames will also make pre-training infeasible due to the largely increased training burdens. To alleviate the problem, during the fine-tuning stage, we insert a novel Hierarchical Frame-Selector (HFS) module into the video encoder. HFS gradually selects the suitable frames conditioned on the text context for the later cross-modal encoder to learn better trajectory-word alignments. By the proposed T2W attention and HFS, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question-answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material.

Cite

Text

Yang et al. "Learning Trajectory-Word Alignments for Video-Language Tasks." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00237

Markdown

[Yang et al. "Learning Trajectory-Word Alignments for Video-Language Tasks." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/yang2023iccv-learning/) doi:10.1109/ICCV51070.2023.00237

BibTeX

@inproceedings{yang2023iccv-learning,
  title     = {{Learning Trajectory-Word Alignments for Video-Language Tasks}},
  author    = {Yang, Xu and Li, Zhangzikang and Xu, Haiyang and Zhang, Hanwang and Ye, Qinghao and Li, Chenliang and Yan, Ming and Zhang, Yu and Huang, Fei and Huang, Songfang},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {2504-2514},
  doi       = {10.1109/ICCV51070.2023.00237},
  url       = {https://mlanthology.org/iccv/2023/yang2023iccv-learning/}
}