SMITE: Segment Me in TimE

Abstract

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labeled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives. The project page is available at https://segment-me-in-time.github.io/

Cite

Text

Alimohammadi et al. "SMITE: Segment Me in TimE." International Conference on Learning Representations, 2025.

Markdown

[Alimohammadi et al. "SMITE: Segment Me in TimE." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/alimohammadi2025iclr-smite/)

BibTeX

@inproceedings{alimohammadi2025iclr-smite,
  title     = {{SMITE: Segment Me in TimE}},
  author    = {Alimohammadi, Amirhossein and Nag, Sauradip and Asgari, Saeid and Tagliasacchi, Andrea and Hamarneh, Ghassan and Amiri, Ali Mahdavi},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/alimohammadi2025iclr-smite/}
}