Representing Sets of Instances for Visual Recognition

Abstract

In computer vision, a complex entity such as an image or video is often represented as a set of instance vectors, which are extracted from different parts of that entity. Thus, it is essential to design a representation to encode information in a set of instances robustly. Existing methods such as FV and VLAD are designed based on a generative perspective, and their performances fluctuate when difference types of instance vectors are used (i.e., they are not robust). The proposed D3 method effectively compares two sets as two distributions, and proposes a directional total variation distance (DTVD) to measure their dissimilarity. Furthermore, a robust classifier-based method is proposed to estimate DTVD robustly, and to efficiently represent these sets. D3 is evaluated in action and image recognition tasks. It achieves excellent robustness, accuracy and speed.

Cite

Text

Wu et al. "Representing Sets of Instances for Visual Recognition." AAAI Conference on Artificial Intelligence, 2016. doi:10.1609/AAAI.V30I1.10184

Markdown

[Wu et al. "Representing Sets of Instances for Visual Recognition." AAAI Conference on Artificial Intelligence, 2016.](https://mlanthology.org/aaai/2016/wu2016aaai-representing/) doi:10.1609/AAAI.V30I1.10184

BibTeX

@inproceedings{wu2016aaai-representing,
  title     = {{Representing Sets of Instances for Visual Recognition}},
  author    = {Wu, Jianxin and Gao, Bin-Bin and Liu, Guoqing},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2016},
  pages     = {2237-2243},
  doi       = {10.1609/AAAI.V30I1.10184},
  url       = {https://mlanthology.org/aaai/2016/wu2016aaai-representing/}
}