Using RobotReviewer to assist humans in conducting risk of bias assessments—a randomized user study

2019 Santiago

Soboczenski F¹, Trikalinos TA², Kuiper J³, Bias RG⁴, Wallace BC⁵, Marshall IJ¹

¹King's College London

²Brown University

³Vortext Systems

⁴University of Texas, Austin

⁵Northeastern University

Background: assessing risks of bias in randomized controlled trials (RCTs) is an important but time-consuming task when conducting systematic reviews. RobotReviewer, an open-source machine learning (ML) system, provides automatic suggestions for bias assessments with the aim of accelerating the manual task.

Objectives: we conducted a user study comparing ML-assisted bias assessment (with ML suggestions presented to reviewers online in the RCT journal article) versus a fully manual process (reviewers presented with the original journal article with no suggestions) using a set of example articles. We sought to evaluate whether ML saved time, the extent to which human reviewers changed/revised the ML suggestions, and whether the system was easy to use.

Methods: systematic reviewers applied the Cochrane 'Risk of bias' tool to four randomly selected RCT articles. Reviewers judged whether an RCT was at low, or high/unclear risk of bias for each bias domain in the Cochrane tool (Version 1) and highlighted article text justifying their decision (a screenshot of the interface is given in Figure 1). For a random two of the four articles, the process was ‘semi-automated’: users were provided with ML-suggested bias judgments and text highlights. Participants could delete the suggestions or add new ones if necessary. We measured time taken for the task, ML suggestions, usability via the System Usability Scale (SUS) and collected qualitative feedback.

Results: for 41 volunteers, semi-automation was on average quicker than manual assessment (mean 755 vs. 824 seconds; relative time 0.75, 95% confidence interval (CI) 0.62 to 0.92). Reviewers accepted 301/328 (91%) of the ML risk of bias judgments, and 202/328 (62%) of text highlights without change. Overall, the snippets (text highlights) suggested by ML had a recall of 0.90 (SD 0.14) and precision of 0.87 (SD 0.21) with respect to the users’ final versions (see Figure 2 for a breakdown of user interactions with the snippets). Reviewers assigned the system a mean 77.7 SUS score, corresponding to a rating between 'good' and 'excellent' (a breakdown is provided in Figure 3).

Conclusions: semi-automation (where humans validate ML suggestions) can improve the efficiency of evidence synthesis. Our system was rated as highly usable, and expedited bias assessment of RCTs. We found evidence that users engaged with the ML suggestions meaningfully, and did not automatically accept them.

Patient or healthcare consumer involvement: patients were not involved in the conduct of this study, which was evaluating a technology for researchers conducting systematic reviews.