SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Abstract

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code is available at https://github.com/RobertLuo1/NeurIPS2023_SOC.

Cite

Text

Luo et al. "SOC: Semantic-Assisted  Object Cluster for Referring Video Object Segmentation." Neural Information Processing Systems, 2023.

Markdown

[Luo et al. "SOC: Semantic-Assisted  Object Cluster for Referring Video Object Segmentation." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/luo2023neurips-soc/)

BibTeX

@inproceedings{luo2023neurips-soc,
  title     = {{SOC: Semantic-Assisted  Object Cluster for Referring Video Object Segmentation}},
  author    = {Luo, Zhuoyan and Xiao, Yicheng and Liu, Yong and Li, Shuyan and Wang, Yitong and Tang, Yansong and Li, Xiu and Yang, Yujiu},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/luo2023neurips-soc/}
}