Analyzing the Impact of Learnable SoftMax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches
Abstract
This work does NOT read like “fabricate motivation - propose something - obtain sota results”. Instead, we provide an in-depth analysis of the learnable softmax temperature parameter in the practical training of contrastive visual-textual alignment models, commonly known as CLIP models. This parameter is critical for optimal system performance, yet its mechanism and potential drawbacks have been largely overlooked. Our study addresses this gap and proposes a novel solution by utilizing the architecture of Vision Transformers (ViTs). We focus on the crucial role of the softmax temperature in managing noisy training data. We demonstrate that there is a balance in the gradient of the contrastive loss, with the temperature parameter acting as a distance scaling factor. If not properly calibrated, the model struggles to align positive pairs due to numerical issues in the loss term. Conversely, a high temperature can lead to unstable learning dynamics. We explore alternative approaches to mitigate this problem from a topological perspective of the contrastive loss. Ultimately, we leverage multiple class tokens embedded within the transformer architecture to present a concise solution. This configuration significantly enhances zero-shot classification performance, improving baseline CLIP models pretrained on large-scale datasets by an average of 6.1%.
Cite
Text
Sun and Li. "Analyzing the Impact of Learnable SoftMax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches." Transactions on Machine Learning Research, 2024.Markdown
[Sun and Li. "Analyzing the Impact of Learnable SoftMax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/sun2024tmlr-analyzing/)BibTeX
@article{sun2024tmlr-analyzing,
title = {{Analyzing the Impact of Learnable SoftMax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches}},
author = {Sun, Zhun and Li, Chao},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/sun2024tmlr-analyzing/}
}