The Efficacy of GPT-4 in Systematic Reviews within the Cardiovascular Disease Domain

Article type
Authors
MIZUNO A1, Ota E
1St.luke's International Hospital, Tokyo, Tokyo, Japan
Abstract
"Background:
The surge in cardiovascular disease research, especially randomized controlled trials (RCTs), demands innovative systematic review methodologies. The complexity of cardiovascular diseases requires extensive screening efforts, challenging the efficiency of traditional review processes. Recent developments in generative AI, like GPT models, show promise in streamlining systematic review screenings.

Objectives:
This study aims to assess the efficiency of GPT models in enhancing screening processes within cardiovascular disease research.

Methods:
In our two-part study, we first assessed the performance of four versions of GPT models (GPT3.5 and GPT4) using an artificial dataset comprising significant heart failure RCTs and epidemiological studies. We compared the models based on their accuracy, sensitivity, specificity, PPV (Positive Predictive Value), and NPV (Negative Predictive Value). In the second part, we utilized the initial screening data from previously published systematic reviews on heart failure, involving 5148 papers, to compare the performance of two GPT4 versions.

Results:
In the analysis, we evaluated 404 papers, including 299 (74%) RCTs specifically addressing heart failure, out of a total of 319 (79%) pertinent studies. The GPT3.5_0125 model achieved an accuracy of 94.6%, while GPT4 versions demonstrated 100% specificity and PPV, affirming their precision in identifying relevant research. However, in screening 5,148 articles derived from established systematic reviews, GPT4_1106 and GPT4_0125 exhibited accuracies of 56.6% and 58.3% respectively, with sensitivities around 17%, but maintained high specificity above 98.6%. The only missed papers were those without explicit mentions of heart failure in the abstracts. These were primarily prevention-focused articles that indeed included populations with NYHA class II or higher heart failure but required full-text screening to ascertain this detail, underscoring the complexity and nuanced nature of such studies in systematic screening processes.

Conclusions:
The GPT-4 models have shown promise in streamlining the initial screening process for heart failure SRs, marked by high specificity and PPV. While effective, they warrant fine-tuning for capturing a broader spectrum of studies, especially in primary prevention. The future integration of GPT-4 in SR methodologies signifies a significant step towards reducing manual screening burdens in medical research."