SVIP: Sequence VerIfication for Procedures in Videos
Abstract
In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer encoder with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.
Cite
Text
Qian et al. "SVIP: Sequence VerIfication for Procedures in Videos." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01927Markdown
[Qian et al. "SVIP: Sequence VerIfication for Procedures in Videos." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/qian2022cvpr-svip/) doi:10.1109/CVPR52688.2022.01927BibTeX
@inproceedings{qian2022cvpr-svip,
title = {{SVIP: Sequence VerIfication for Procedures in Videos}},
author = {Qian, Yicheng and Luo, Weixin and Lian, Dongze and Tang, Xu and Zhao, Peilin and Gao, Shenghua},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {19890-19902},
doi = {10.1109/CVPR52688.2022.01927},
url = {https://mlanthology.org/cvpr/2022/qian2022cvpr-svip/}
}