Evaluating ChatGPT’s adherence to evidence-based heart failure guidelines: a comparative analysis using the 2023 ESC and 2022 ACC/AHA/HFSA recommendations

Mohammed Jassim Alnuwaysir; Abdullah Shaker Aljama; Abdulkarim Kassim Abdulgalil Galib; Wafa Ali Aldawood; Farah Nedal Taiseer AlRatrout; Reem S. AlSulaiman; Kawthar Ali AlNasser; Mohammed Taha Al-Hariri

doi:10.1080/00015385.2026.2668803

Evaluating ChatGPT’s adherence to evidence-based heart failure guidelines: a comparative analysis using the 2023 ESC and 2022 ACC/AHA/HFSA recommendations

Mohammed Jassim Alnuwaysir
, Abdullah Shaker Aljama
, Abdulkarim Kassim Abdulgalil Galib
, Wafa Ali Aldawood
, Farah Nedal Taiseer AlRatrout
, Reem S. AlSulaiman
, Kawthar Ali AlNasser
, Mohammed Taha Al-Hariri^*

^*Corresponding author for this work

Physiology Department

Imam Abdulrahman Bin Faisal University

Research output: Contribution to journal › Article › peer-review

Abstract

Background: Heart failure (HF) remains a major cause of morbidity and mortality worldwide. Large language models (LLMs) such as ChatGPT are emerging as potential clinical decision support tools, but their adherence to specialty guidelines is not well characterised. Objectives: To evaluate the accuracy and guideline concordance of ChatGPT-5 in managing real-world HF scenarios compared with the 2023 European Society of Cardiology (ESC) and 2022 American College of Cardiology (ACC)/American Heart Association (AHA)/Heart Failure Society of America (HFSA) recommendations. Methods: Thirty-eight anonymised HF clinical vignettes spanning reduced, mildly reduced, and preserved ejection fraction phenotypes and varied New York Heart Association (NYHA) classes were presented to ChatGPT-5. Two board-certified cardiologists independently graded each response for concordance with guideline recommendations using a 4-point scale (3 = fully concordant, 2 = partially concordant, 1 = discordant, 0 = unsafe/harmful). Discrepancies were adjudicated by a third reviewer. Descriptive statistics summarised performance and inter-rater agreement. Results: Of the 38 responses, 20 (53%) were fully concordant, 4 (11%) partially concordant, 8 (21%) discordant, and 6 (16%) unsafe/harmful. Most inaccuracies involved vague drug titration guidance, incomplete device therapy recommendations, or omission of guideline-directed medical therapy (GDMT). Unsafe suggestions occurred in complex device or advanced therapy decisions. Inter-rater agreement was high. Conclusions: ChatGPT-5 showed moderate concordance with ESC and ACC/AHA/HFSA HF guidelines, indicating potential value as a tool for knowledge synthesis and preliminary clinical support. However, its outputs require expert validation, and safe clinical integration will depend on future models incorporating guideline-based frameworks, real-time data, and rigorous physician oversight.

Original language	English
Journal	Acta Cardiologica
DOIs	https://doi.org/10.1080/00015385.2026.2668803
State	Accepted/In press - 2026

Keywords

ChatGPT-5
clinical vignettes
Heart failure

Access to Document

10.1080/00015385.2026.2668803

Cite this

Alnuwaysir, M. J., Aljama, A. S., Galib, A. K. A., Aldawood, W. A., AlRatrout, F. N. T., AlSulaiman, R. S., AlNasser, K. A., & Al-Hariri, M. T. (Accepted/In press). Evaluating ChatGPT’s adherence to evidence-based heart failure guidelines: a comparative analysis using the 2023 ESC and 2022 ACC/AHA/HFSA recommendations. Acta Cardiologica. https://doi.org/10.1080/00015385.2026.2668803

@article{9fab0b2ae8634b2fb177f80d7395c4e9,

title = "Evaluating ChatGPT{\textquoteright}s adherence to evidence-based heart failure guidelines: a comparative analysis using the 2023 ESC and 2022 ACC/AHA/HFSA recommendations",

abstract = "Background: Heart failure (HF) remains a major cause of morbidity and mortality worldwide. Large language models (LLMs) such as ChatGPT are emerging as potential clinical decision support tools, but their adherence to specialty guidelines is not well characterised. Objectives: To evaluate the accuracy and guideline concordance of ChatGPT-5 in managing real-world HF scenarios compared with the 2023 European Society of Cardiology (ESC) and 2022 American College of Cardiology (ACC)/American Heart Association (AHA)/Heart Failure Society of America (HFSA) recommendations. Methods: Thirty-eight anonymised HF clinical vignettes spanning reduced, mildly reduced, and preserved ejection fraction phenotypes and varied New York Heart Association (NYHA) classes were presented to ChatGPT-5. Two board-certified cardiologists independently graded each response for concordance with guideline recommendations using a 4-point scale (3 = fully concordant, 2 = partially concordant, 1 = discordant, 0 = unsafe/harmful). Discrepancies were adjudicated by a third reviewer. Descriptive statistics summarised performance and inter-rater agreement. Results: Of the 38 responses, 20 (53\%) were fully concordant, 4 (11\%) partially concordant, 8 (21\%) discordant, and 6 (16\%) unsafe/harmful. Most inaccuracies involved vague drug titration guidance, incomplete device therapy recommendations, or omission of guideline-directed medical therapy (GDMT). Unsafe suggestions occurred in complex device or advanced therapy decisions. Inter-rater agreement was high. Conclusions: ChatGPT-5 showed moderate concordance with ESC and ACC/AHA/HFSA HF guidelines, indicating potential value as a tool for knowledge synthesis and preliminary clinical support. However, its outputs require expert validation, and safe clinical integration will depend on future models incorporating guideline-based frameworks, real-time data, and rigorous physician oversight.",

keywords = "ChatGPT-5, clinical vignettes, Heart failure",

author = "Alnuwaysir, \{Mohammed Jassim\} and Aljama, \{Abdullah Shaker\} and Galib, \{Abdulkarim Kassim Abdulgalil\} and Aldawood, \{Wafa Ali\} and AlRatrout, \{Farah Nedal Taiseer\} and AlSulaiman, \{Reem S.\} and AlNasser, \{Kawthar Ali\} and Al-Hariri, \{Mohammed Taha\}",

note = "Publisher Copyright: {\textcopyright} 2026 Belgian Society of Cardiology.",

year = "2026",

doi = "10.1080/00015385.2026.2668803",

language = "English",

journal = "Acta Cardiologica",

issn = "0001-5385",

}

TY - JOUR

T1 - Evaluating ChatGPT’s adherence to evidence-based heart failure guidelines

T2 - a comparative analysis using the 2023 ESC and 2022 ACC/AHA/HFSA recommendations

AU - Alnuwaysir, Mohammed Jassim

AU - Aljama, Abdullah Shaker

AU - Galib, Abdulkarim Kassim Abdulgalil

AU - Aldawood, Wafa Ali

AU - AlRatrout, Farah Nedal Taiseer

AU - AlSulaiman, Reem S.

AU - AlNasser, Kawthar Ali

AU - Al-Hariri, Mohammed Taha

PY - 2026

Y1 - 2026

N2 - Background: Heart failure (HF) remains a major cause of morbidity and mortality worldwide. Large language models (LLMs) such as ChatGPT are emerging as potential clinical decision support tools, but their adherence to specialty guidelines is not well characterised. Objectives: To evaluate the accuracy and guideline concordance of ChatGPT-5 in managing real-world HF scenarios compared with the 2023 European Society of Cardiology (ESC) and 2022 American College of Cardiology (ACC)/American Heart Association (AHA)/Heart Failure Society of America (HFSA) recommendations. Methods: Thirty-eight anonymised HF clinical vignettes spanning reduced, mildly reduced, and preserved ejection fraction phenotypes and varied New York Heart Association (NYHA) classes were presented to ChatGPT-5. Two board-certified cardiologists independently graded each response for concordance with guideline recommendations using a 4-point scale (3 = fully concordant, 2 = partially concordant, 1 = discordant, 0 = unsafe/harmful). Discrepancies were adjudicated by a third reviewer. Descriptive statistics summarised performance and inter-rater agreement. Results: Of the 38 responses, 20 (53%) were fully concordant, 4 (11%) partially concordant, 8 (21%) discordant, and 6 (16%) unsafe/harmful. Most inaccuracies involved vague drug titration guidance, incomplete device therapy recommendations, or omission of guideline-directed medical therapy (GDMT). Unsafe suggestions occurred in complex device or advanced therapy decisions. Inter-rater agreement was high. Conclusions: ChatGPT-5 showed moderate concordance with ESC and ACC/AHA/HFSA HF guidelines, indicating potential value as a tool for knowledge synthesis and preliminary clinical support. However, its outputs require expert validation, and safe clinical integration will depend on future models incorporating guideline-based frameworks, real-time data, and rigorous physician oversight.

AB - Background: Heart failure (HF) remains a major cause of morbidity and mortality worldwide. Large language models (LLMs) such as ChatGPT are emerging as potential clinical decision support tools, but their adherence to specialty guidelines is not well characterised. Objectives: To evaluate the accuracy and guideline concordance of ChatGPT-5 in managing real-world HF scenarios compared with the 2023 European Society of Cardiology (ESC) and 2022 American College of Cardiology (ACC)/American Heart Association (AHA)/Heart Failure Society of America (HFSA) recommendations. Methods: Thirty-eight anonymised HF clinical vignettes spanning reduced, mildly reduced, and preserved ejection fraction phenotypes and varied New York Heart Association (NYHA) classes were presented to ChatGPT-5. Two board-certified cardiologists independently graded each response for concordance with guideline recommendations using a 4-point scale (3 = fully concordant, 2 = partially concordant, 1 = discordant, 0 = unsafe/harmful). Discrepancies were adjudicated by a third reviewer. Descriptive statistics summarised performance and inter-rater agreement. Results: Of the 38 responses, 20 (53%) were fully concordant, 4 (11%) partially concordant, 8 (21%) discordant, and 6 (16%) unsafe/harmful. Most inaccuracies involved vague drug titration guidance, incomplete device therapy recommendations, or omission of guideline-directed medical therapy (GDMT). Unsafe suggestions occurred in complex device or advanced therapy decisions. Inter-rater agreement was high. Conclusions: ChatGPT-5 showed moderate concordance with ESC and ACC/AHA/HFSA HF guidelines, indicating potential value as a tool for knowledge synthesis and preliminary clinical support. However, its outputs require expert validation, and safe clinical integration will depend on future models incorporating guideline-based frameworks, real-time data, and rigorous physician oversight.

KW - ChatGPT-5

KW - clinical vignettes

KW - Heart failure

UR - https://www.scopus.com/pages/publications/105038093098

U2 - 10.1080/00015385.2026.2668803

DO - 10.1080/00015385.2026.2668803

M3 - Article

AN - SCOPUS:105038093098

SN - 0001-5385

JO - Acta Cardiologica

JF - Acta Cardiologica

ER -

Evaluating ChatGPT’s adherence to evidence-based heart failure guidelines: a comparative analysis using the 2023 ESC and 2022 ACC/AHA/HFSA recommendations

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this