A Novel Perspective for Multi-Modal Multi-Label Skin Lesion Classification
Abstract
The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods for skin diseases relies on analyzing multiple data modalities (i.e. clinical+dermoscopic images and patient metadata) and addressing the challenges of multi-label classification. Current approaches tend to rely on limited multi-modal techniques and treat the multi-label problem as a multiple multi-class problem overlooking issues related to imbalanced learning and multi-label correlation. This paper introduces the innovative Skin Lesion Classifier utilizing a Multi-modal Multi-label TransFormer-based model (SkinM2Former). For multi-modal analysis we introduce the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder. For multi-label classification we introduce a multi-head attention (MHA) module to learn multi-label correlations complemented by an optimisation that handles multi-label and imbalanced learning problems. SkinM2Former achieves a mean average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the public Derm7pt dataset outperforming state-of-the-art (SOTA) methods.
Cite
Text
Zhang et al. "A Novel Perspective for Multi-Modal Multi-Label Skin Lesion Classification." Winter Conference on Applications of Computer Vision, 2025.Markdown
[Zhang et al. "A Novel Perspective for Multi-Modal Multi-Label Skin Lesion Classification." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/zhang2025wacv-novel/)BibTeX
@inproceedings{zhang2025wacv-novel,
title = {{A Novel Perspective for Multi-Modal Multi-Label Skin Lesion Classification}},
author = {Zhang, Yuan and Xie, Yutong and Wang, Hu and Avery, Jodie C and Hull, M Louise and Carneiro, Gustavo},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2025},
pages = {3549-3558},
url = {https://mlanthology.org/wacv/2025/zhang2025wacv-novel/}
}