Machine learning tools to expedite citation screening and risk of bias appraisal in systematic reviews: evaluations of Abstrackr and RobotReviewer

Tags: Oral
Gates A1, Vandermeer B1, Johnson C1, Hartling L1
1Alberta Research Centre for Health Evidence, Department of Pediatrics, University of Alberta

Background: Abstrackr and RobotReviewer are emerging tools that semi-automate citation screening and 'Risk of bias' (RoB) appraisal in systematic reviews (SRs).

Objectives: We compared the reliability of Abstrackr’s predictions of relevant records and of RobotReviewer’s RoB judgments to human reviewer consensus.

Methods: We used a convenience sample of SRs completed at our centre. For Abstrackr, we selected four SRs that were heterogeneous with respect to search yield, topic, and screening complexity. We uploaded the records to Abstrackr and screened until a prediction of the relevance of the remaining records became available. We compared the predictions to human reviewer consensus and calculated precision, proportion missed, and workload savings. For RobotReviewer, we used 1180 trials from 10 SRs or methodological research projects that varied by topic. We compared RobotReviewer’s RoB judgments among six domains to human reviewer consensus and calculated reliability (Cohen’s kappa coefficient), sensitivity, and specificity.

Results: Abstrackr’s precision varied by screening task (median 27%, range 15% to 65%). Proportion missed was 0.1% for three of the SRs, and 6% for the final SR, accounting for a median 4% (range 0 to 12%) of records in the final reports. The workload savings were often large (median 27%, range 10% to 88%). RobotReviewer’s reliability (95% confidence interval (CI)) was moderate for random sequence generation (0.48 (0.43 to 0.53)), allocation concealment (0.45 (0.40 to 0.51)), and blinding of participants and personnel (0.42 (0.36 to 0.47)). Reliability (95% CI) was slight for blinding of outcome assessors (0.10 (0.05 to 0.14)), incomplete outcome data (0.14 (0.08 to 0.19)), and selective reporting (0.02 (-0.02 to 0.05)). Sensitivity and specificity (95% CI) ranged from 0.20 (0.18 to 0.23) to 0.76 (0.72 to 0.80) and from 0.61 (0.56 to 0.65) to 0.95 (0.93 to 0.96) across topics, respectively.

Conclusions: Abstrackr’s reliability and the workload savings varied by SR. Workload savings came at the expense of missing potentially relevant records. Compared to reliability between author groups, RobotReviewer’s reliability was similar for most domains. These promising tools should be tested on large samples of heterogeneous SRs to inform their practical utility and guidance for their use.

Patient or healthcare consumer involvement: None.