CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

Kao, Shiu-hong; Tai, Yu-Wing; Tang, Chi-Keung

CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang

ICLR 2026

/iclr/2026/kao2026iclr-cotrvs/

Abstract

Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose **CoT-RVS**, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by **temporal-semantic reasoning**: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Kao et al. "CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos." International Conference on Learning Representations, 2026.

Markdown

[Kao et al. "CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kao2026iclr-cotrvs/)

BibTeX

@inproceedings{kao2026iclr-cotrvs,
  title     = {{CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos}},
  author    = {Kao, Shiu-hong and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kao2026iclr-cotrvs/}
}