How CNNs and ViTs Perceive Similarities Between Categories
Abstract
Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) trained for supervised tasks are the leading networks used in practical computer vision. Despite using different techniques, they both perfect their object recognition skills. In this race, it is overall accuracy that matters at most. But is it enough? Should not we care about the correct perception of inter-class similarities? We believe we should, as similarity is a fundamental aspect of categorization and the structure of the world is highly correlated. Models should reasonably assess similarities for more nuanced perception, and we should examine it for more transparency and trust. That is why, we analyzed what state-of-the-art object recognition networks perceive as similar. We proposed a framework to visually and numerically examine and compare the perception of different trained models. We used it to answer a series of similarity-related questions based on experiments on a large population of 42 models.
Cite
Text
Filus and Domanska. "How CNNs and ViTs Perceive Similarities Between Categories." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025. doi:10.1007/978-3-032-06078-5_2Markdown
[Filus and Domanska. "How CNNs and ViTs Perceive Similarities Between Categories." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025.](https://mlanthology.org/ecmlpkdd/2025/filus2025ecmlpkdd-cnns/) doi:10.1007/978-3-032-06078-5_2BibTeX
@inproceedings{filus2025ecmlpkdd-cnns,
title = {{How CNNs and ViTs Perceive Similarities Between Categories}},
author = {Filus, Katarzyna and Domanska, Joanna},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2025},
pages = {22-40},
doi = {10.1007/978-3-032-06078-5_2},
url = {https://mlanthology.org/ecmlpkdd/2025/filus2025ecmlpkdd-cnns/}
}