Spatial Knowledge Distillation to Aid Visual Reasoning
Abstract
For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system's capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.
Cite
Text
Aditya et al. "Spatial Knowledge Distillation to Aid Visual Reasoning." IEEE/CVF Winter Conference on Applications of Computer Vision, 2019. doi:10.1109/WACV.2019.00030Markdown
[Aditya et al. "Spatial Knowledge Distillation to Aid Visual Reasoning." IEEE/CVF Winter Conference on Applications of Computer Vision, 2019.](https://mlanthology.org/wacv/2019/aditya2019wacv-spatial/) doi:10.1109/WACV.2019.00030BibTeX
@inproceedings{aditya2019wacv-spatial,
title = {{Spatial Knowledge Distillation to Aid Visual Reasoning}},
author = {Aditya, Somak and Saha, Rudra and Yang, Yezhou and Baral, Chitta},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2019},
pages = {227-235},
doi = {10.1109/WACV.2019.00030},
url = {https://mlanthology.org/wacv/2019/aditya2019wacv-spatial/}
}