AI-driven assessment of the application of GRADE evidence-to-decision frameworks in German oncological guidelines

Article type
Authors
Follmann M1, Jacobs A1, Langer T1, Wenzel G1
1German Cancer Society, Berlin, Germany
Abstract
Background: The application of evidence-to-decision (EtD) frameworks can help establish clear decision-making rationales for the formulation of recommendations. For guidelines in the German Guideline Program in Oncology (GGPO), the application of EtDs is not mandatory. With the advent of artificial intelligence (AI), there is potential to standardize and support this process, enhancing consistency and reliability in guideline development.

Objectives: This study investigates the reliability of AI-driven assessments of EtD framework application in oncological guidelines, in comparison with those made by human evaluators. The aim is to determine the feasibility of using AI to assist guideline authors in adhering to EtD criteria during the formulation of recommendations and associated explanatory texts.

Methods: Thirty consensus-based recommendations were extracted randomly from the 34 currently published guidelines of the GGPO, stratified by diagnosis and treatment. Three ChatGPT instances and 2 human assessors evaluated these recommendations against the EtD criteria from Dewidar et al 2022 on a 5-point scale. Intra- (IaRR) and inter-rater (IeRR) reliability were measured using intraclass correlation (ICC).

Results: Overall, the mean assessment score (AS) was 2.71 (human) versus 1.73 (AI). Stratified by EtD criteria, AS ranged from 3.87 to 1.00 for KI and 2.60 to 1.15 for human assessors. IaRR was 0.755, 95% CI = 0.53 to 0.87 for AI. The IeRR between the human raters was 0.37, 95% CI = -0.32 to 0.70. IeRR between humans and AI was 0.09, 95% CI = -0.11 to 0.36.

Conclusion: Congruence of ratings between the AI runs was high, indicating stable EtD assessments. The low congruence between human raters is unexpected, represents a limiting factor in this study, and may drive the low comparability between AI and humans. As a next step and before a conclusive suggestion for the integration of AI into the guideline development workflow with EtDs will be proposed, we plan to explore the cause for the low inter-rater reliability between the human assessors.