Weakly-Supervised Semantic Segmentation Using Motion Cues
Abstract
Fully convolutional neural networks (FCNNs) trained on a large number of images with strong pixel-level annotations have become the new state of the art for the semantic segmentation task. While there have been recent attempts to learn FCNNs from image-level weak annotations , they need additional constraints, such as the size of an object , to obtain reasonable performance. To address this issue, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates motion cues and is learned from video-level weak annotations. Our learning scheme to train the network uses motion segments as soft constraints, thereby handling noisy motion information. When trained on weakly-annotated videos, our method outperforms the state-of-the-art approach on the PASCAL VOC 2012 image segmentation benchmark. We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images. Finally, M-CNN substantially out-performs recent approaches in a related task of video co-localization on the YouTube-Objects dataset.
Cite
Text
Tokmakov et al. "Weakly-Supervised Semantic Segmentation Using Motion Cues." European Conference on Computer Vision, 2016. doi:10.1007/978-3-319-46493-0_24Markdown
[Tokmakov et al. "Weakly-Supervised Semantic Segmentation Using Motion Cues." European Conference on Computer Vision, 2016.](https://mlanthology.org/eccv/2016/tokmakov2016eccv-weakly/) doi:10.1007/978-3-319-46493-0_24BibTeX
@inproceedings{tokmakov2016eccv-weakly,
title = {{Weakly-Supervised Semantic Segmentation Using Motion Cues}},
author = {Tokmakov, Pavel and Alahari, Karteek and Schmid, Cordelia},
booktitle = {European Conference on Computer Vision},
year = {2016},
pages = {388-404},
doi = {10.1007/978-3-319-46493-0_24},
url = {https://mlanthology.org/eccv/2016/tokmakov2016eccv-weakly/}
}