CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos
Abstract
Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose **CoT-RVS**, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by **temporal-semantic reasoning**: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.
Cite
Text
Kao et al. "CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos." International Conference on Learning Representations, 2026.Markdown
[Kao et al. "CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kao2026iclr-cotrvs/)BibTeX
@inproceedings{kao2026iclr-cotrvs,
title = {{CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos}},
author = {Kao, Shiu-hong and Tai, Yu-Wing and Tang, Chi-Keung},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/kao2026iclr-cotrvs/}
}