Learning Spatial Common Sense with Geometry-Aware Recurrent Networks
Abstract
We integrate two powerful ideas, geometry and deep visual representation learning, into recurrent network architectures for mobile visual scene understanding. The proposed networks learn to “lift” 2D visual features and integrate them over time into latent 3D feature maps of the scene. They are equipped with differentiable geometric operations, such as projection, unprojection, egomotion stabilization, in order to compute a geometrically-consistent mapping between the world scene and their 3D latent feature space. We train the proposed architectures to predict novel image views given short frame sequences as input. Their predictions strongly generalize to scenes with a novel number of objects, appearances and configurations, and greatly outperform predictions of previous works that do not consider egomotion stabilization or a space-aware latent feature space. Our experiments suggest the proposed space-aware latent feature arrangement and egomotion-stabilized convolutions are essential architectural choices for spatial common sense to emerge in artificial embodied visual agents.
Cite
Text
Tung et al. "Learning Spatial Common Sense with Geometry-Aware Recurrent Networks." ICLR 2019 Workshops: LLD, 2019.Markdown
[Tung et al. "Learning Spatial Common Sense with Geometry-Aware Recurrent Networks." ICLR 2019 Workshops: LLD, 2019.](https://mlanthology.org/iclrw/2019/tung2019iclrw-learning/)BibTeX
@inproceedings{tung2019iclrw-learning,
title = {{Learning Spatial Common Sense with Geometry-Aware Recurrent Networks}},
author = {Tung, Hsiao-Yu and Cheng, Ricson and Fragkiadaki, Katerina},
booktitle = {ICLR 2019 Workshops: LLD},
year = {2019},
url = {https://mlanthology.org/iclrw/2019/tung2019iclrw-learning/}
}