MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

Abstract

Efficient lightweight neural networks have received increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video models still focus on the larger ViT architecture, and few works attempt to build efficient architecture. Since many efficient contrastive language-image pre-training (CLIP) models have shown strong zero-shot classification and retrieval capability, we attempt to fill the gap in video-text understanding models and propose a fast and efficient video-text model MobileViCLIP with strong zero-shot reasoning capability that can be deployed on mobile devices. In particular, our MobileViCLIP-Small obtains similar zero-shot retrieval performance as InternVideo2-L14 on text-to-video dataset MSR-VTT while being 46.7x faster when deployed on the mobile device. Furthermore, MobileViCLIP-Small can generalize to zero-shot action recognition task and obtains 1.0% better Top-1 accuracy than InternVideo2-S14 while being 5.6x faster on the mobile device. The code is available at https://github.com/MCG-NJU/MobileViCLIP.

Cite

Text

Yang et al. "MobileViCLIP: An Efficient Video-Text Model for Mobile Devices." International Conference on Computer Vision, 2025.

Markdown

[Yang et al. "MobileViCLIP: An Efficient Video-Text Model for Mobile Devices." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/yang2025iccv-mobileviclip/)

BibTeX

@inproceedings{yang2025iccv-mobileviclip,
  title     = {{MobileViCLIP: An Efficient Video-Text Model for Mobile Devices}},
  author    = {Yang, Min and Jia, Zihan and Dai, Zhilin and Guo, Sheng and Wang, Limin},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {20824-20835},
  url       = {https://mlanthology.org/iccv/2025/yang2025iccv-mobileviclip/}
}