Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Abstract

This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network ("everything"), and (ii) we would really like to understand how this sketch-photo matching operates ("explainable"). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned "bag-of-words" paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches.

Cite

Text

Lin et al. "Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02236

Markdown

[Lin et al. "Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/lin2023cvpr-zeroshot/) doi:10.1109/CVPR52729.2023.02236

BibTeX

@inproceedings{lin2023cvpr-zeroshot,
  title     = {{Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style}},
  author    = {Lin, Fengyin and Li, Mingkang and Li, Da and Hospedales, Timothy and Song, Yi-Zhe and Qi, Yonggang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {23349-23358},
  doi       = {10.1109/CVPR52729.2023.02236},
  url       = {https://mlanthology.org/cvpr/2023/lin2023cvpr-zeroshot/}
}