Semantic Video CNNs Through Representation Warping

Abstract

In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called NetWarp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent improvements over different baseline networks. Our code and models are available at http://segmentation.is.tue.mpg.de

Cite

Text

Gadde et al. "Semantic Video CNNs Through Representation Warping." International Conference on Computer Vision, 2017. doi:10.1109/ICCV.2017.477

Markdown

[Gadde et al. "Semantic Video CNNs Through Representation Warping." International Conference on Computer Vision, 2017.](https://mlanthology.org/iccv/2017/gadde2017iccv-semantic/) doi:10.1109/ICCV.2017.477

BibTeX

@inproceedings{gadde2017iccv-semantic,
  title     = {{Semantic Video CNNs Through Representation Warping}},
  author    = {Gadde, Raghudeep and Jampani, Varun and Gehler, Peter V.},
  booktitle = {International Conference on Computer Vision},
  year      = {2017},
  doi       = {10.1109/ICCV.2017.477},
  url       = {https://mlanthology.org/iccv/2017/gadde2017iccv-semantic/}
}