Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Abstract

Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones supported by semantic information (e.g. attributes). However existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e. CNN or ViT) which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information resulting in undesirable visual-semantic interactions. To tackle this issue we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly and ii) discard the semantic-unrelated visual information. Specifically we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets i.e. CUB SUN and AWA2.

Cite

Text

Chen et al. "Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02262

Markdown

[Chen et al. "Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/chen2024cvpr-progressive/) doi:10.1109/CVPR52733.2024.02262

BibTeX

@inproceedings{chen2024cvpr-progressive,
  title     = {{Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning}},
  author    = {Chen, Shiming and Hou, Wenjin and Khan, Salman and Khan, Fahad Shahbaz},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {23964-23974},
  doi       = {10.1109/CVPR52733.2024.02262},
  url       = {https://mlanthology.org/cvpr/2024/chen2024cvpr-progressive/}
}