Multi-Scale Activation, Selection, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition

Abstract

Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an ``Activation-Selection-Aggregation'' paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.

Cite

Text

Zhang et al. "Multi-Scale Activation, Selection, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33127

Markdown

[Zhang et al. "Multi-Scale Activation, Selection, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhang2025aaai-multi-c/) doi:10.1609/AAAI.V39I10.33127

BibTeX

@inproceedings{zhang2025aaai-multi-c,
  title     = {{Multi-Scale Activation, Selection, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition}},
  author    = {Zhang, Zhicheng and Tang, Hao and Tang, Jinhui},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10385-10393},
  doi       = {10.1609/AAAI.V39I10.33127},
  url       = {https://mlanthology.org/aaai/2025/zhang2025aaai-multi-c/}
}