A Multi-Scale CNN for Affordance Segmentation in RGB Images

Abstract

Given a single RGB image our goal is to label every pixel with an affordance type. By affordance, we mean an object’s capability to readily support a certain human action, without requiring precursor actions. We focus on segmenting the following five affordance types in indoor scenes: ‘walkable’, ‘sittable’, ‘lyable’, ‘reachable’, and ‘movable’. Our approach uses a deep architecture, consisting of a number of multi-scale convolutional neural networks, for extracting mid-level visual cues and combining them toward affordance segmentation. The mid-level cues include depth map, surface normals, and segmentation of four types of surfaces – namely, floor, structure, furniture and props. For evaluation, we augmented the NYUv2 dataset with new ground-truth annotations of the five affordance types. We are not aware of prior work which starts from pixels, infers mid-level cues, and combines them in a feed-forward fashion for predicting dense affordance maps of a single RGB image.

Cite

Text

Roy and Todorovic. "A Multi-Scale CNN for Affordance Segmentation in RGB Images." European Conference on Computer Vision, 2016. doi:10.1007/978-3-319-46493-0_12

Markdown

[Roy and Todorovic. "A Multi-Scale CNN for Affordance Segmentation in RGB Images." European Conference on Computer Vision, 2016.](https://mlanthology.org/eccv/2016/roy2016eccv-multi/) doi:10.1007/978-3-319-46493-0_12

BibTeX

@inproceedings{roy2016eccv-multi,
  title     = {{A Multi-Scale CNN for Affordance Segmentation in RGB Images}},
  author    = {Roy, Anirban and Todorovic, Sinisa},
  booktitle = {European Conference on Computer Vision},
  year      = {2016},
  pages     = {186-201},
  doi       = {10.1007/978-3-319-46493-0_12},
  url       = {https://mlanthology.org/eccv/2016/roy2016eccv-multi/}
}