How is the agreement between machine and humans? Use of RobotReviewer to evaluate the risk of bias of randomized trials

Article type
Authors
Armijo-Olivo S1, Craig R1, Campbell S2
1Institute of Health Economics
2University of Alberta
Abstract
Background: Evidence from new technologies and treatments is growing, along with demands for evidence to inform policy decisions. So, it is anticipated that the need for knowledge synthesis products (i.e. Health Technology Assessments (HTAs) and systematic reviews (SRs)) will increase. Increased demands will create challenges to complete assessments in a timely manner. New technologies such as RobotReviewer, a semi-autonomous risk of bias (RoB) assessment tool, seek to decrease the time and resource burden to complete HTAs/SRs. However, current evidence to validate the existing software for use in the HTA/SR process is limited.
Objectives: To test the accuracy and agreement between RobotReviewer and RoB assessments generated by consensus among human reviewers.
Methods: We used a random sample of randomized controlled trials (RCTs). We compared consensus assessments between two reviewers with the RoB ratings generated by RobotReviewer. We assessed agreement between RobotReviewer, and human reviewers using weighted kappa (K). We assessed the accuracy of RobotReviewer by calculating the sensitivity and specificity.
Results: In total, 372 trials were included in this study. Inter-rater reliability on individual domains of the RoB tool ranged from K = -0.01 (95% CI -0.03 to 0.001; no agreement) for overall RoB, to K = 0.62 (95% CI 0.534 to 0.697; good agreement) for random sequence generation. The agreement was fair for allocation concealment (K = 0.41; 95% CI 0.31 to 0.51), slight for blinding of outcome assessment (K = 0.23; 95% CI 0.13 to 0.34), and poor for blinding of participants and personnel K = 0.06 (95% CI 0.002 to 0.1). We found that > 70% of quotes for the RoB judgments for blinding of participants and personnel (72.6%) and blinding of outcome assessment (70.4%) were irrelevant.
Conclusions: This is the first study to provide a thorough analysis of the usability of RobotReviewer. Agreement between RobotReviewer and human reviewers ranged from no agreement to good agreement. However, RobotReviewer selected a high percentage of irrelevant quotes in making RoB assessments. Use of Robotreviewer in isolation as a first or second reviewer is not recommended at this point.
Patient or health consumer involvement: It is hoped that the results help knowledge synthesis teams whether to use such a tool to speed up the process of knowledge synthesis.