Assessing the Risk of Bias in Randomised Controlled Trials with Large Language Models: A Feasibility Study

Article type
Authors
Lai H1, Ge L1, Talukdar J2, Estill J3
1School of public health, Lanzhou University, Lanzhou, China
2Department of Health Research Methods, Evidence, and Impact, McMaster University, Ontario, Canada
3Institute of Global Health, University of Geneva, Geneva, Switzerland
Abstract
Background
Large language models (LLMs) may facilitate the labour-intensive process of systematic reviews.
Objective
To explore the feasibility and reliability of utilizing LLMs to assess risk of bias (ROB) in randomised controlled trials (RCTs).
Methods
We conducted a pilot and feasibility study in 30 RCTs selected from published systematic reviews. We developed robust prompts to guide ChatGPT (GPT-4) and Claude (Claude-2) in assessing the ROB in these RCTs using a modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University. Each RCT was assessed twice by both models, and the results were documented. The results were compared with an assessment by three experts, which we considered as a gold standard. We calculated correct assessment rates, sensitivity, specificity, and F1 scores to reflect accuracy, both overall and for each domain of the Cochrane ROB tool; consistent assessment rates, Cohen’s kappa (κ), and prevalence- and bias-adjusted kappa to gauge consistency; and assessment time to measure efficiency. Performance between the two models was compared using risk differences.
Results
Both models demonstrated high correct assessment rates. ChatGPT reached a mean correct assessment rate of 84.5% (95% confidence interval: 81.5% to 87.3%), and Claude a significantly higher rate of 89.5% (95%CI: 87.0% to 91.8%). In most domains, domain-specific correct assessment rates were around 80-90%; however, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Domains 4 (missing outcome data), 5 (selective outcome reporting), and 6 had F1 scores below 0.50. The consistency rates between the two assessments were 84.0% for ChatGPT and 87.3% for Claude. ChatGPT’s κ exceeded 0.80 in seven and Claude’s in eight domains. The mean time needed for assessment was 77 seconds for ChatGPT and 53 seconds for Claude.
Conclusion
ChatGPT and Claude have shown substantial accuracy in assessing ROB in RCTs, indicating their potential as supportive tools in systematic review processes.