Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Yuan, Zhenlong; Qu, Xiangyan; Qian, Chengxuan; Chen, Rui; Tang, Jing; Sun, Lei; Chu, Xiangxiang; Zhang, Dapeng; Wang, Yiwei; Cai, Yujun; Li, Shuo

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li

ICLR 2026

/iclr/2026/yuan2026iclr-videostar/

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invokes domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, while maintaining computational efficiency.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yuan et al. "Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools." International Conference on Learning Representations, 2026.

Markdown

[Yuan et al. "Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yuan2026iclr-videostar/)

BibTeX

@inproceedings{yuan2026iclr-videostar,
  title     = {{Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools}},
  author    = {Yuan, Zhenlong and Qu, Xiangyan and Qian, Chengxuan and Chen, Rui and Tang, Jing and Sun, Lei and Chu, Xiangxiang and Zhang, Dapeng and Wang, Yiwei and Cai, Yujun and Li, Shuo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yuan2026iclr-videostar/}
}