Efficient Test-Time Scaling for Small Vision-Language Models
Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
Cite
Text
Kaya et al. "Efficient Test-Time Scaling for Small Vision-Language Models." International Conference on Learning Representations, 2026.Markdown
[Kaya et al. "Efficient Test-Time Scaling for Small Vision-Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kaya2026iclr-efficient/)BibTeX
@inproceedings{kaya2026iclr-efficient,
title = {{Efficient Test-Time Scaling for Small Vision-Language Models}},
author = {Kaya, Mehmet Onurcan and Elliott, Desmond and Papadopoulos, Dim},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/kaya2026iclr-efficient/}
}