Which Backbone to Use: A Resource-Efficient Domain Specific Comparison for Computer Vision
Abstract
For computer vision applications on small, niche, and proprietary datasets, fine-tuning a neural network (NN) backbone that is pre-trained on a large dataset, such as the ImageNet, is a common practice. However, it is unknown whether the backbones that perform well on large datasets, such as vision transformers, are also the right choice for fine-tuning on smaller custom datasets. The present comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem. We systematically evaluated multiple lightweight, pre-trained backbones under consistent training settings across a variety of domains spanning natural, medical, deep space, and remote sensing images. We found that even though attention-based architectures are gaining popularity, they tend to perform poorly compared to CNNs when fine-tuned on small amounts of domain-specific data. We also observed that certain CNN architectures consistently perform better than others when controlled for network size. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones for a broad spectrum of computer vision domains.
Cite
Text
P and Sethi. "Which Backbone to Use: A Resource-Efficient Domain Specific Comparison for Computer Vision." Transactions on Machine Learning Research, 2025.Markdown
[P and Sethi. "Which Backbone to Use: A Resource-Efficient Domain Specific Comparison for Computer Vision." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/p2025tmlr-backbone/)BibTeX
@article{p2025tmlr-backbone,
title = {{Which Backbone to Use: A Resource-Efficient Domain Specific Comparison for Computer Vision}},
author = {P, Pranav Jeevan and Sethi, Amit},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/p2025tmlr-backbone/}
}