Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation

Abstract

Referring video object segmentation (RVOS) is a challenging language-guided video grounding task, which requires comprehensively understanding the semantic information of both video content and language queries for object prediction. However, existing methods adopt multi-modal fusion at a frame-based spatial granularity. The limitation of visual representation is prone to causing vision-language mismatching and producing poor segmentation results. To address this, we propose a novel multi-level representation learning approach, which explores the inherent structure of the video content to provide a set of discriminative visual embedding, enabling more effective vision-language semantic alignment. Specifically, we embed different visual cues in terms of visual granularity, including multi-frame long-temporal information at video level, intra-frame spatial semantics at frame level, and enhanced object-aware feature prior at object level. With the powerful multi-level visual embedding and carefully-designed dynamic alignment, our model can generate a robust representation for accurate video object segmentation. Extensive experiments on Refer-DAVIS_ 17 and Refer-YouTube-VOS demonstrate that our model achieves superior performance both in segmentation accuracy and inference speed.

Cite

Text

Wu et al. "Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00494

Markdown

[Wu et al. "Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/wu2022cvpr-multilevel/) doi:10.1109/CVPR52688.2022.00494

BibTeX

@inproceedings{wu2022cvpr-multilevel,
  title     = {{Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation}},
  author    = {Wu, Dongming and Dong, Xingping and Shao, Ling and Shen, Jianbing},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {4996-5005},
  doi       = {10.1109/CVPR52688.2022.00494},
  url       = {https://mlanthology.org/cvpr/2022/wu2022cvpr-multilevel/}
}