Multi-Entity Video Transformers for Fine-Grained Video Representation Learning

Walmer, Matthew; Kanjirathinkal, Rose Catherine; Tai, Kai Sheng; Muzumdar, Keyur; Tian, Tai-Peng; Shrivastava, Abhinav

Multi-Entity Video Transformers for Fine-Grained Video Representation Learning

Matthew Walmer, Rose Catherine Kanjirathinkal, Kai Sheng Tai, Keyur Muzumdar, Tai-Peng Tian, Abhinav Shrivastava

CVPRW 2025 pp. 2110-2120

/cvprw/2025/walmer2025cvprw-multientity/

Abstract

The area of temporally fine-grained video representation learning focuses on generating frame-by-frame representations for temporally dense tasks, such as fine-grained action phase classification and frame retrieval. In this work, we advance the state-of-the-art for self-supervised models in this area by re-examining the design of transformer architectures for video representation learning. A key aspect of our approach is the improved sharing of scene information in the temporal pipeline by representing multiple salient entities per frame. Prior works use late-fusion architectures that reduce frames to a single-dimensional vector before modeling any cross-frame dynamics. In contrast, our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time. To achieve this, we propose a Learnable Spatial Token Pooling strategy to identify and extract features for multiple salient regions per frame. Through our experiments, we show that MV-Former outperforms previous self-supervised methods, and also surpasses some prior works that use additional supervision or training data. When combined with additional pre-training data from Kinetics-400, MV-Former achieves a further performance boost. Overall, our MV-Former achieves state-of-the-art results on multiple fine-grained video benchmarks and shows that parsing video scenes as collections of entities can enhance performance in video tasks.

PDF CVPRW Semantic Scholar

Cite

Text

Walmer et al. "Multi-Entity Video Transformers for Fine-Grained Video Representation Learning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Walmer et al. "Multi-Entity Video Transformers for Fine-Grained Video Representation Learning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/walmer2025cvprw-multientity/)

BibTeX

@inproceedings{walmer2025cvprw-multientity,
  title     = {{Multi-Entity Video Transformers for Fine-Grained Video Representation Learning}},
  author    = {Walmer, Matthew and Kanjirathinkal, Rose Catherine and Tai, Kai Sheng and Muzumdar, Keyur and Tian, Tai-Peng and Shrivastava, Abhinav},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {2110-2120},
  url       = {https://mlanthology.org/cvprw/2025/walmer2025cvprw-multientity/}
}