Self-Supervised Video Hashing via Bidirectional Transformers

Abstract

Most existing unsupervised video hashing methods are built on unidirectional models with less reliable training objectives, which underuse the correlations among frames and the similarity structure between videos. To enable efficient scalable video retrieval, we propose a self-supervised video Hashing method based on Bidirectional Transformers (BTH). Based on the encoder-decoder structure of transformers, we design a visual cloze task to fully exploit the bidirectional correlations between frames. To unveil the similarity structure between unlabeled video data, we further develop a similarity reconstruction task by establishing reliable and effective similarity connections in the video space. Furthermore, we develop a cluster assignment task to exploit the structural statistics of the whole dataset such that more discriminative binary codes can be learned. Extensive experiments implemented on three public benchmark datasets, FCVID, ActivityNet and YFCC, demonstrate the superiority of our proposed approach.

Cite

Text

Li et al. "Self-Supervised Video Hashing via Bidirectional Transformers." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01334

Markdown

[Li et al. "Self-Supervised Video Hashing via Bidirectional Transformers." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/li2021cvpr-selfsupervised/) doi:10.1109/CVPR46437.2021.01334

BibTeX

@inproceedings{li2021cvpr-selfsupervised,
  title     = {{Self-Supervised Video Hashing via Bidirectional Transformers}},
  author    = {Li, Shuyan and Li, Xiu and Lu, Jiwen and Zhou, Jie},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {13549-13558},
  doi       = {10.1109/CVPR46437.2021.01334},
  url       = {https://mlanthology.org/cvpr/2021/li2021cvpr-selfsupervised/}
}