Scene Classification with Semantic Fisher Vectors

Abstract

With the help of a convolutional neural network~(CNN) trained to recognize objects, a scene image is represented as a bag of semantics (BoS). This involves classifying image patches using the network and considering the class posterior probability vectors as locally extracted semantic descriptors. The image BoS is summarized using a Fisher vector~(FV) embedding that exploits the properties of the space of these descriptors. The resulting representation is referred to as a semantic Fisher vector. Two implementations of a semantic FV are investigated. First involves modeling the BoS with a Dirichlet Mixture and computing the Fisher gradients for this model. Due to the difficulty of mixture modeling on a non-Euclidean probability simplex, this approach is shown to be unsuccessful. A second implementation is derived using the interpretation of semantic descriptors as parameters of a multinomial distribution. Like the parameters of any exponential family, these can be projected into their natural parameter space. For a CNN, this is shown equivalent to using inputs of its soft-max layer as patch descriptors. A semantic FV is then computed as a Gaussian Mixture FV in the space of these natural parameters. This representation is shown to outperform other alternatives such as FVs of features from the intermediate CNN layers or a classifier obtained by adapting (fine-tuning) the CNN. The proposed FV represents an embedding for object classification probabilities. As an image representation, therefore, it is complementary to the features obtained from a scene classification CNN. A combination of the two representations is shown to achieve state-of-the-art results on MIT Indoor scenes and SUN datasets.

Cite

Text

Dixit et al. "Scene Classification with Semantic Fisher Vectors." Conference on Computer Vision and Pattern Recognition, 2015. doi:10.1109/CVPR.2015.7298916

Markdown

[Dixit et al. "Scene Classification with Semantic Fisher Vectors." Conference on Computer Vision and Pattern Recognition, 2015.](https://mlanthology.org/cvpr/2015/dixit2015cvpr-scene/) doi:10.1109/CVPR.2015.7298916

BibTeX

@inproceedings{dixit2015cvpr-scene,
  title     = {{Scene Classification with Semantic Fisher Vectors}},
  author    = {Dixit, Mandar and Chen, Si and Gao, Dashan and Rasiwasia, Nikhil and Vasconcelos, Nuno},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2015},
  doi       = {10.1109/CVPR.2015.7298916},
  url       = {https://mlanthology.org/cvpr/2015/dixit2015cvpr-scene/}
}