Robust Referring Video Object Segmentation with Cyclic Structural Consensus
Abstract
Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression. Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video. This assumption, which we refer to as "semantic consensus", is often violated in real-world scenarios, where the expression may be queried against false videos. In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches. Accordingly, we propose an extended task called Robust R-VOS (RRVOS), which accepts unpaired video-text inputs. We tackle this problem by jointly modeling the primary R-VOS problem and its dual (text reconstruction). A structural text-to-text cycle constraint is introduced to discriminate semantic consensus between video-text pairs and impose it in positive pairs, thereby achieving multi-modal alignment from both positive and negative pairs. Our structural constraint effectively addresses the challenge posed by linguistic diversity, overcoming the limitations of previous methods that relied on the point-wise constraint. A new evaluation dataset, RRYTVOS is constructed to measure the model robustness. Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our RRYTVOS dataset.
Cite
Text
Li et al. "Robust Referring Video Object Segmentation with Cyclic Structural Consensus." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.02032Markdown
[Li et al. "Robust Referring Video Object Segmentation with Cyclic Structural Consensus." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/li2023iccv-robust/) doi:10.1109/ICCV51070.2023.02032BibTeX
@inproceedings{li2023iccv-robust,
title = {{Robust Referring Video Object Segmentation with Cyclic Structural Consensus}},
author = {Li, Xiang and Wang, Jinglu and Xu, Xiaohao and Li, Xiao and Raj, Bhiksha and Lu, Yan},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {22236-22245},
doi = {10.1109/ICCV51070.2023.02032},
url = {https://mlanthology.org/iccv/2023/li2023iccv-robust/}
}