BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Song, Yiran; Zhou, Qianyu; Li, Xiangtai; Fan, Deng-Ping; Lu, Xuequan; Ma, Lizhuang

doi:10.1109/CVPR52733.2024.00305

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Yiran Song, Qianyu Zhou, Xiangtai Li, Deng-Ping Fan, Xuequan Lu, Lizhuang Ma

CVPR 2024 pp. 3162-3173

doi:10.1109/CVPR52733.2024.00305 /cvpr/2024/song2024cvpr-basam/

Abstract

In this paper we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications hindering the preservation of SAM's rich prior knowledge. Besides such task-specific tuning necessitates a complete retraining of the model which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper we reformulate this challenge as a length extrapolation problem where token sequence length varies while maintaining a consistent patch size for images with different sizes. To this end we propose a Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly we present a bias-mode attention mask that allows each token to prioritize neighboring information mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation of diverse datasets including DIS5K DUTS ISIC COD10K and COCO reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore we propose a generalized model and benchmark showcasing BA-SAM's generalizability across all four datasets simultaneously.

PDF CVPR Semantic Scholar

Cite

Text

Song et al. "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00305

Markdown

[Song et al. "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/song2024cvpr-basam/) doi:10.1109/CVPR52733.2024.00305

BibTeX

@inproceedings{song2024cvpr-basam,
  title     = {{BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model}},
  author    = {Song, Yiran and Zhou, Qianyu and Li, Xiangtai and Fan, Deng-Ping and Lu, Xuequan and Ma, Lizhuang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {3162-3173},
  doi       = {10.1109/CVPR52733.2024.00305},
  url       = {https://mlanthology.org/cvpr/2024/song2024cvpr-basam/}
}