Elysium: Exploring Object-Level Perception in Videos Through Semantic Integration Using MLLMs

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce , a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models. All codes and datasets are released at https://github.com/Hon-Wong/Elysium.

Cite

Text

Wang et al. "Elysium: Exploring Object-Level Perception in Videos Through Semantic Integration Using MLLMs." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72670-5_10

Markdown

[Wang et al. "Elysium: Exploring Object-Level Perception in Videos Through Semantic Integration Using MLLMs." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wang2024eccv-elysium/) doi:10.1007/978-3-031-72670-5_10

BibTeX

@inproceedings{wang2024eccv-elysium,
  title     = {{Elysium: Exploring Object-Level Perception in Videos Through Semantic Integration Using MLLMs}},
  author    = {Wang, Han and Wang, Yanjie and Yongjie, Ye and Nie, Yuxiang and Huang, Can},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72670-5_10},
  url       = {https://mlanthology.org/eccv/2024/wang2024eccv-elysium/}
}