VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos

Abstract

Recognizing subtle similarities and differences among sets of similar activities is central to many real-world applications, including skill acquisition, sports performance evaluation, and anomaly detection. Humans excel at such fine-grained analysis, which requires comprehensive video understanding and cross-video reasoning about action attributes, poses, positions, and emotional states. Yet existing video-based large language models typically address only single-video recognition, leaving their capacity for multi-video reasoning largely unexplored. We introduce VideoSetDiff, a curated dataset designed to test detail-oriented recognition across diverse activities, from subtle action attributes to viewpoint transitions. Our evaluation of current video-based LLMs on VideoSetDiff reveals critical shortcomings, particularly in fine-grained detail recognition and multi-video reasoning. To mitigate these issues, we propose an automatically generated dataset for instruction tuning alongside a novel multi-video recognition framework. While instruction tuning and specialized multi-video reasoning improve performance, all tested models remain far from satisfactory. These findings underscore the need for more robust video-based LLMs capable of handling complex multi-video tasks, enabling diverse real-world applications.

Cite

Text

Qiu et al. "VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos." International Conference on Computer Vision, 2025.

Markdown

[Qiu et al. "VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/qiu2025iccv-videosetdiff/)

BibTeX

@inproceedings{qiu2025iccv-videosetdiff,
  title     = {{VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos}},
  author    = {Qiu, Yue and Sun, Yanjun and Yagi, Takuma and Egami, Shusaku and Miyata, Natsuki and Fukuda, Ken and Hara, Kensho and Sagawa, Ryusuke},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {12242-12252},
  url       = {https://mlanthology.org/iccv/2025/qiu2025iccv-videosetdiff/}
}