Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs

Ali Alqutayfi; Wadha Almattar; Sadam Al-Azani; Fakhri Alam Khan; Abdullah Al Qahtani; Solaiman Alageel; Mohammed Alzahrani

doi:10.12720/jait.16.2.264-273

Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs

Ali Alqutayfi^*
, Wadha Almattar^*
, Sadam Al-Azani
, Fakhri Alam Khan
, Abdullah Al Qahtani
, Solaiman Alageel
, Mohammed Alzahrani

^*Corresponding author for this work

Ophthalmology Department

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM’s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.

Original language	English
Pages (from-to)	264-273
Number of pages	10
Journal	Journal of Advances in Information Technology
Volume	16
Issue number	2
DOIs	https://doi.org/10.12720/jait.16.2.264-273
State	Published - 2025

Keywords

Convolutional Neural Network (CNN)
Gradient-weighted Class Activation Mapping (Grad-CAM)
Vision Transformer (ViT)
explainable AI
medical imaging

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.12720/jait.16.2.264-273

Cite this

@article{39d1ecdc63384be4a01f6b24a21ddf22,

title = "Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs",

abstract = "Deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM{\textquoteright}s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.",

keywords = "Convolutional Neural Network (CNN), Gradient-weighted Class Activation Mapping (Grad-CAM), Vision Transformer (ViT), explainable AI, medical imaging",

author = "Ali Alqutayfi and Wadha Almattar and Sadam Al-Azani and Khan, \{Fakhri Alam\} and \{Al Qahtani\}, Abdullah and Solaiman Alageel and Mohammed Alzahrani",

note = "Publisher Copyright: {\textcopyright} 2025 by the authors.",

year = "2025",

doi = "10.12720/jait.16.2.264-273",

language = "English",

volume = "16",

pages = "264--273",

journal = "Journal of Advances in Information Technology",

issn = "1798-2340",

number = "2",

}

TY - JOUR

T1 - Explainable Disease Classification

T2 - Exploring Grad-CAM Analysis of CNNs and ViTs

AU - Alqutayfi, Ali

AU - Almattar, Wadha

AU - Al-Azani, Sadam

AU - Khan, Fakhri Alam

AU - Al Qahtani, Abdullah

AU - Alageel, Solaiman

AU - Alzahrani, Mohammed

PY - 2025

Y1 - 2025

N2 - Deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM’s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.

AB - Deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM’s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.

KW - Convolutional Neural Network (CNN)

KW - Gradient-weighted Class Activation Mapping (Grad-CAM)

KW - Vision Transformer (ViT)

KW - explainable AI

KW - medical imaging

UR - https://www.scopus.com/pages/publications/85219145592

U2 - 10.12720/jait.16.2.264-273

DO - 10.12720/jait.16.2.264-273

M3 - Article

AN - SCOPUS:85219145592

SN - 1798-2340

VL - 16

SP - 264

EP - 273

JO - Journal of Advances in Information Technology

JF - Journal of Advances in Information Technology

IS - 2

ER -

Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this