Assessing the Risk of Bias in Randomized Controlled Trials using RoB2 by Large Language Models: A Feasibility Study

Article type
Authors
Huang J1, Lai H2, Pan B3, Ge L4, Ge L4
1College of Nursing, Gansu University of Chinese Medicine, Lanzhou, China
2Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China, Lanzhou, China; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China, Lanzhou, China
3Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China, Lanzhou, China
4Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China, Lanzhou, China; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China, Lanzhou, China; Key Laboratory of Evidence Based Medicine and Knowledge Translation of Gansu Province, Lanzhou, China, Lanzhou, China
Abstract
Background: The revised version of the risk of bias tool (RoB2) overcomes some original limitations, but concurrently introduces challenges in its application. Large language models (LLMs) may potentially assist the utilization of RoB2. However, the exact methods and reliability remain uncertain.
Objective: To explore the feasibility and reliability of utilizing LLMs to assess risk of bias (ROB) in randomized controlled trials (RCTs) with RoB2.
Methods: We conducted a pilot and feasibility study in RCTs selected from published systematic reviews. We developed robust prompts to guide Claude-2 in assessing the ROB in these RCTs using RoB2. Each RCT was assessed twice and the results were documented. The results were compared with an assessment by three experts, which we considered as a gold standard. We calculated correct assessment rates for each domain of RoB2; Cohen’s kappa (κ) to gauge consistency; and assessment time to measure efficiency. Due to the domain “deviations from the intended intervention” has different signaling questions depending on the effect of interest (assignment or adhering), we measured the correct assessment rates separately for the two types of assessment for this domain.
Results: A total of 30 outcomes from 30 RCTs were included. The overall correct rates were relatively high at 69.7% (95% confidence interval [CI]: 63.9% to 75.5%). Further breakdown of the domain-specific results revealed particularly high correct assessment rates for “deviations from intended intervention (assignment)” (81.1% , 95% CI: 68.0% to 94.4%) and “measurement of the outcome” (80.0% , 95% CI: 67.6% to 92.4%). Fair for the other domains, ranging from 75.0% (95% CI: 61.6% to 88.4%) for “randomization process” to 64.3% (95% CI: 32.2% to 96.4%) for “deviations from intended intervention (adhering)”. The Cohen's kappa coefficient between the two assessments is 0.44, demonstrating the stability of assessments outcome. Mean time to apply the tool was 147.9 seconds (standard deviation 6.7) per study outcome.
Conclusions: LLMs were capable of rapidly assessing the risk of bias in RCTs using ROB2, and exhibit a comparatively high level of accuracy. This suggests the potential utility of employing LLMs as adjunctive tools in the systematic review process.