Aligning Instance Brownian Bridge with Texts for Open-Vocabulary Video Instance Segmentation

Abstract

Temporally locating objects with arbitrary class texts is the primary pursuit of open-vocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage the image-text pretraining model for recognizing object instances by separately aligning each frame with class texts. As a result, the separation breaks the instance movement context of videos and requires a lot of inference overhead. To tackle these issues, we propose BridgeText Alignment (BTA) to link frame-level instance representations as a Brownian Bridge. On one hand, we can calculate the global descriptor of a Brownian bridge for capturing instance dynamics, which enables extra considering temporal information rather than only static information of each frame for aligning with texts. On the other hand, according to the goal-conditioned property of the Brownian bridge, we can estimate the middle frame features via the start and the end frame features so the global feature calculation of a Brownian bridge only needs to infer a few frames, which largely reduces inference overhead. We term our overall pipeline as BriVIS. Following the training settings of previous works, BriVIS surpasses the SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary datasets (BURST, LVVIS), BriVIS achieves 5.7 and 20.9 mAP, which exhibits +2.2∼+6.7 mAP improvement compared to OV2Seg. Furthermore, after training via BTA, using only the head and the tail frames for alignment improves the speed by 32% (2.77 → 1.88 s/iter) while just decreasing the performance by 0.2 mAP (21.1 → 20.9 mAP).

Cite

Text

Cheng et al. "Aligning Instance Brownian Bridge with Texts for Open-Vocabulary Video Instance Segmentation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I3.32250

Markdown

[Cheng et al. "Aligning Instance Brownian Bridge with Texts for Open-Vocabulary Video Instance Segmentation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/cheng2025aaai-aligning/) doi:10.1609/AAAI.V39I3.32250

BibTeX

@inproceedings{cheng2025aaai-aligning,
  title     = {{Aligning Instance Brownian Bridge with Texts for Open-Vocabulary Video Instance Segmentation}},
  author    = {Cheng, Zesen and Li, Kehan and Li, Hao and Jin, Peng and Zheng, Xiawu and Liu, Chang and Chen, Jie},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {2482-2490},
  doi       = {10.1609/AAAI.V39I3.32250},
  url       = {https://mlanthology.org/aaai/2025/cheng2025aaai-aligning/}
}