TY - JOUR
T1 - Explainable Disease Classification
T2 - Exploring Grad-CAM Analysis of CNNs and ViTs
AU - Alqutayfi, Ali
AU - Almattar, Wadha
AU - Al-Azani, Sadam
AU - Khan, Fakhri Alam
AU - Al Qahtani, Abdullah
AU - Alageel, Solaiman
AU - Alzahrani, Mohammed
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025
Y1 - 2025
N2 - Deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM’s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.
AB - Deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM’s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.
KW - Convolutional Neural Network (CNN)
KW - Gradient-weighted Class Activation Mapping (Grad-CAM)
KW - Vision Transformer (ViT)
KW - explainable AI
KW - medical imaging
UR - https://www.scopus.com/pages/publications/85219145592
U2 - 10.12720/jait.16.2.264-273
DO - 10.12720/jait.16.2.264-273
M3 - Article
AN - SCOPUS:85219145592
SN - 1798-2340
VL - 16
SP - 264
EP - 273
JO - Journal of Advances in Information Technology
JF - Journal of Advances in Information Technology
IS - 2
ER -