MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Abstract

We present a new state-of-the-art on the text-to-video re-trieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved using a single model and without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that our practical approach for training on different datasets can improve test results of each other. Additionally, we check intersection between many popular datasets and show that MSRVTT as well as ActivityNet contains a significant overlap between the test and the training parts. More details are available at https://github.com/papermsucode/mdmmt.

Cite

Text

Dzabraev et al. "MDMMT: Multidomain Multimodal Transformer for Video Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021. doi:10.1109/CVPRW53098.2021.00374

Markdown

[Dzabraev et al. "MDMMT: Multidomain Multimodal Transformer for Video Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.](https://mlanthology.org/cvprw/2021/dzabraev2021cvprw-mdmmt/) doi:10.1109/CVPRW53098.2021.00374

BibTeX

@inproceedings{dzabraev2021cvprw-mdmmt,
  title     = {{MDMMT: Multidomain Multimodal Transformer for Video Retrieval}},
  author    = {Dzabraev, Maksim and Kalashnikov, Maksim and Komkov, Stepan and Petiushko, Aleksandr},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2021},
  pages     = {3354-3363},
  doi       = {10.1109/CVPRW53098.2021.00374},
  url       = {https://mlanthology.org/cvprw/2021/dzabraev2021cvprw-mdmmt/}
}