Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Abstract

Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.

Cite

Text

Zhao et al. "Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01144

Markdown

[Zhao et al. "Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/zhao2022cvpr-modeling/) doi:10.1109/CVPR52688.2022.01144

BibTeX

@inproceedings{zhao2022cvpr-modeling,
  title     = {{Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation}},
  author    = {Zhao, Wangbo and Wang, Kai and Chu, Xiangxiang and Xue, Fuzhao and Wang, Xinchao and You, Yang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {11737-11746},
  doi       = {10.1109/CVPR52688.2022.01144},
  url       = {https://mlanthology.org/cvpr/2022/zhao2022cvpr-modeling/}
}