Using large language models to assess the compliance of randomized controlled trials on AI interventions with CONSORT-AI: a cross-sectional survey

Article type
Authors
Bian Z1, Chen F2, Chen Y3, Li Z4, Luo X3, on behalf of ADVANCED working group Z5, Yang L1, Zhang 4
1School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
2School of Information Science & Engineering, Lanzhou University, Lanzhou City, Gansu, China
3Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou City, Gansu, China
4Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China
5ADVANCED working group
Abstract
Background
Chatbots based on large language models have shown promise to evaluate the compliance of research. Previously, researchers used ChatGPT to if randomized controlled trial (RCT) abstracts adhered to the CONSORT-Abstract guidelines. However, the compliance of AI interventional RCTs align with the CONSORT-AI standards by large language models remains unclear.

Objectives
To identify the compliance of RCTs on AI interventions with CONSORT-AI using chatbots based on large language model

Methods
We employed ChatGPT 4 as an example to attempt to assess the compliance of RCTs on AI interventions. The sample selection is based on an article by Deborah Plana et al, published in JAMA Network Open, which included a total of 41 RCTs. To ensure error-free extraction of the article, all PDF documents will be converted to Microsoft Word format, and then, the same prompt will be provided across different models to evaluate their effectiveness. We calculated the recall for each chatbot, which involved using a prompt to score RCTs against each item in the CONSORT-AI checklist.
Building on the established methodology, each constituent subgroup was subsequently scored and categorized into one of two classifications (reported and not reported). An overall compliance score (OCS) was given out of 11, along with an OCS percentage. Bland-Altman analysis was used to evaluate the overall agreement between human and chatbot-generated OCS percentages.

Results
Our analysis of 41 RCTs revealed a median OCS of 81.8% (ranging from 63.6% to 100%), with an average of 84.5%. Three RCTs matched the gold standard perfectly in their evaluations. Among the 11 selected CONSORT-AI items, item 2 ("State the inclusion and exclusion criteria at the level of the input data") and item 4 ("State which version of the AI algorithm was used") had the lowest OCS, at 68.3%. Item 8 ("Specify the output of the AI intervention") had an OCS of 100%.

Conclusions
GPT-4 demonstrates strong recall in assessing the compliance of RCTs with CONSORT-AI. Nonetheless, refining the prompts could enhance the precision and consistency of the outcomes. It is also crucial to compare the performance across different models to more reliably extend these findings.