BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
Abstract
In this paper we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications hindering the preservation of SAM's rich prior knowledge. Besides such task-specific tuning necessitates a complete retraining of the model which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper we reformulate this challenge as a length extrapolation problem where token sequence length varies while maintaining a consistent patch size for images with different sizes. To this end we propose a Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly we present a bias-mode attention mask that allows each token to prioritize neighboring information mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation of diverse datasets including DIS5K DUTS ISIC COD10K and COCO reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore we propose a generalized model and benchmark showcasing BA-SAM's generalizability across all four datasets simultaneously.
Cite
Text
Song et al. "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00305Markdown
[Song et al. "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/song2024cvpr-basam/) doi:10.1109/CVPR52733.2024.00305BibTeX
@inproceedings{song2024cvpr-basam,
title = {{BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model}},
author = {Song, Yiran and Zhou, Qianyu and Li, Xiangtai and Fan, Deng-Ping and Lu, Xuequan and Ma, Lizhuang},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {3162-3173},
doi = {10.1109/CVPR52733.2024.00305},
url = {https://mlanthology.org/cvpr/2024/song2024cvpr-basam/}
}