Weakly Supervised Learning of Object Segmentations from Web-Scale Video
Abstract
We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as “dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.
Cite
Text
Hartmann et al. "Weakly Supervised Learning of Object Segmentations from Web-Scale Video." European Conference on Computer Vision Workshops, 2012. doi:10.1007/978-3-642-33863-2_20Markdown
[Hartmann et al. "Weakly Supervised Learning of Object Segmentations from Web-Scale Video." European Conference on Computer Vision Workshops, 2012.](https://mlanthology.org/eccvw/2012/hartmann2012eccvw-weakly/) doi:10.1007/978-3-642-33863-2_20BibTeX
@inproceedings{hartmann2012eccvw-weakly,
title = {{Weakly Supervised Learning of Object Segmentations from Web-Scale Video}},
author = {Hartmann, Glenn and Grundmann, Matthias and Hoffman, Judy and Tsai, David and Kwatra, Vivek and Madani, Omid and Vijayanarasimhan, Sudheendra and Essa, Irfan A. and Rehg, James M. and Sukthankar, Rahul},
booktitle = {European Conference on Computer Vision Workshops},
year = {2012},
pages = {198-208},
doi = {10.1007/978-3-642-33863-2_20},
url = {https://mlanthology.org/eccvw/2012/hartmann2012eccvw-weakly/}
}