3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations
Abstract
Concept learning lies at the very heart of intelligence, providing organizing principles with which to comprehend the world (6). Most computer vision models learn concept classifiers or detectors using labelled examples of object boxes, poses and categories. Self-supervised or unsupervised approaches mostly focus on (pre)training CNNs on auxiliary pretext tasks to lessen the need for human labels for a downstream recognition task. The visual "concepts" learnt from pretext tasks are implicit and represented as distributed neural CNN activations (7). We see the following limitations with representing concepts (solely) as neural activations across multiple layers of a deep network: i) visual memory and computation are not separated (3), which means that computation increases exponentially with the number of visual concepts learnt, ii) concepts are not stored somewhere and cannot be referred to or retrieved on demand, iii) the number of concepts cannot increase automatically based on new visual experiences, rather it is determined by the processing architecture; this misaligns with the idea that animals are capable of spontaneous concept instantiation in novel scenes (2), iv) concepts cannot be mentally manipulated by imagining variations, transformations or mental simulations (8), v) concepts do not have any spatial dimension and are hard to use for spatial reasoning.
Cite
Text
Prabhudesai et al. "3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020. doi:10.1109/CVPRW50498.2020.00202Markdown
[Prabhudesai et al. "3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.](https://mlanthology.org/cvprw/2020/prabhudesai2020cvprw-3dqnets/) doi:10.1109/CVPRW50498.2020.00202BibTeX
@inproceedings{prabhudesai2020cvprw-3dqnets,
title = {{3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations}},
author = {Prabhudesai, Mihir and Lal, Shamit and Tung, Hsiao-Yu Fish and Harley, Adam W. and Potdar, Shubhankar and Fragkiadaki, Katerina},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2020},
pages = {1567-1570},
doi = {10.1109/CVPRW50498.2020.00202},
url = {https://mlanthology.org/cvprw/2020/prabhudesai2020cvprw-3dqnets/}
}