Recurrent Temporal Deep Field for Semantic Video Labeling

Abstract

This paper specifies a new deep architecture, called Recurrent Temporal Deep Field (RTDF), for semantic video labeling. RTDF is a conditional random field (CRF) that combines a deconvolution neural network (DeconvNet) and a recurrent temporal restricted Boltzmann machine (RTRBM). DeconvNet is grounded onto pixels of a new frame for estimating the unary potential of the CRF. RTRBM estimates a high-order potential of the CRF by capturing long-term spatiotemporal dependencies of pixel labels that RTDF has already predicted in previous frames. We derive a mean-field inference algorithm to jointly predict all latent variables in both RTRBM and CRF. We also conduct end-to-end joint training of all DeconvNet, RTRBM, and CRF parameters. The joint learning and inference integrate the three components into a unified deep model – RTDF. Our evaluation on the benchmark Youtube Face Database (YFDB) and Cambridge-driving Labeled Video Database (Camvid) demonstrates that RTDF outperforms the state of the art both qualitatively and quantitatively.

Cite

Text

Lei and Todorovic. "Recurrent Temporal Deep Field for Semantic Video Labeling." European Conference on Computer Vision, 2016. doi:10.1007/978-3-319-46454-1_19

Markdown

[Lei and Todorovic. "Recurrent Temporal Deep Field for Semantic Video Labeling." European Conference on Computer Vision, 2016.](https://mlanthology.org/eccv/2016/lei2016eccv-recurrent/) doi:10.1007/978-3-319-46454-1_19

BibTeX

@inproceedings{lei2016eccv-recurrent,
  title     = {{Recurrent Temporal Deep Field for Semantic Video Labeling}},
  author    = {Lei, Peng and Todorovic, Sinisa},
  booktitle = {European Conference on Computer Vision},
  year      = {2016},
  pages     = {302-317},
  doi       = {10.1007/978-3-319-46454-1_19},
  url       = {https://mlanthology.org/eccv/2016/lei2016eccv-recurrent/}
}