Clinical practice guideline evaluation with large language models: a comparative analysis using the AGREE II instrument

Article type
Authors
Chen Y1, Luo X2, Wang B2
1Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences, School of Basic Medical Sciences, Lanzhou University, Lanzhou City, Gansu Province, China; Key Laboratory of Evidence Based Medicine of Gansu Province, Lanzhou City, Gansu Province, China; WHO Collaborating Centre for Guideline Implementation and Knowledge Translation, Lanzhou City, Gansu Province, China
2Evidenceāƒbased Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou City , Gansu Province, China
Abstract
Background:
Large language models (LLMs) have been extensively studied for their application in various stages of clinical practice guideline development, including biomedical retrieval, information extraction, and clinical decision support, demonstrating their capability to handle complex medical issues. However, the use of LLMs, particularly in evaluating the methodological quality of clinical practice guidelines using the Appraisal of Guidelines for Research and Evaluation (AGREE-II) instrument, remains unclear.

Objectives:
As an international standard for assessing the quality of clinical practice guidelines, AGREE II plays a crucial role in enhancing guideline reliability. This study investigates the feasibility and effectiveness of LLMs, specifically ChatGPT 4, in evaluating the quality of clinical guidelines using the AGREE II instrument .

Methods:
The research is based on an article published in the JAMA Network Open by Manuel M. Montero-Odasso et al, which evaluated 15 clinical practice guidelines using the AGREE-II. Using the evaluation results of the article as a benchmark, ChatGPT 4 independently evaluated these 15 guidelines based on the 23 items of the AGREE-II instrument. Each guideline was evaluated 3 times to ensure the consistency and reliability of the results, and subsequently, the percentage scores in the 6 domains of AGREE-II for each guideline were calculated. The differences in scores across the domains between the 2 groups (original evaluation results and ChatGPT 4 evaluation results) were compared using paired sample t-tests or Wilcoxon signed-rank tests. Additionally, Cohen's d value will be calculated to quantify the practical significance of the differences between the 2 groups' evaluation results.

Results:
The overall evaluation results of ChatGPT 4 for the guidelines were lower than those in the original article, with the greatest discrepancy observed in the sixth domain. The differences in scores between the 2 groups across all AGREE-II domains will be presented at the conference using difference-mean plots and other methods.

Conclusions:
The method of using ChatGPT 4.0 to assess the quality of clinical guidelines based on the AGREE II tool holds potential for practical application. Further research is needed to optimize LLMs, providing new avenues for conducting quality assessments of clinical guidelines.