VRT-Net: Real-Time Scene Parsing via Variable Resolution Transform
Abstract
Urban scene parsing is a basic requirement for various autonomous navigation systems especially in self-driving. Most of the available approaches employ generic image parsing architectures designed for segmentation of object focused scene captured in indoor setups. However, images captured in car-mounted cameras exhibit an extreme effect of perspective geometry, causing a significant scale disparity between near and farther objects. Recognizing this, we formalize a unique Variable Resolution Transform (VRT) technique motivated from the foveal magnification in human eye. Following this, we design a Fovea Estimation Network (FEN) which is trained to estimate a single most convenient fixation location along with the associated magnification factor, best suited for a given input image. The proposed framework is designed to enable its usage as a wrapper over the available real-time scene parsing models, thereby demonstrating a superior trade-off between speed and quality as compared to the prior state-of-the-arts.
Cite
Text
Kundu et al. "VRT-Net: Real-Time Scene Parsing via Variable Resolution Transform." Winter Conference on Applications of Computer Vision, 2020.Markdown
[Kundu et al. "VRT-Net: Real-Time Scene Parsing via Variable Resolution Transform." Winter Conference on Applications of Computer Vision, 2020.](https://mlanthology.org/wacv/2020/kundu2020wacv-vrtnet/)BibTeX
@inproceedings{kundu2020wacv-vrtnet,
title = {{VRT-Net: Real-Time Scene Parsing via Variable Resolution Transform}},
author = {Kundu, Jogendra Nath and Rajput, Gaurav Singh and Radhakrishnan, Venkatesh Babu},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2020},
url = {https://mlanthology.org/wacv/2020/kundu2020wacv-vrtnet/}
}