Is it time to trust the robots? The reliability and usability of machine learning tools for screening in systematic reviews

Tags: Oral
Gates A1, Pillay J1, Guitard S1, Elliott S1, Dyson M1, Newton A2
1Alberta Research Centre for Health Evidence, University of Alberta, 2Department of Pediatrics, University of Alberta

Background. Machine learning tools can expedite the completion of systematic reviews (SRs) by reducing manual screening workloads, yet their application has been minimal. Evidence of their benefits and enhanced usability may improve their acceptance within the SR community.

Objectives. We tested the performance of three tools when used to: (a) eliminate irrelevant records (Simulation A) and (b) replace one of two independent reviewers (Simulation B). We evaluated the usability of each tool.

Methods. We selected three SRs completed at our centre and subjected these to two retrospective screening simulations. Using each tool (Abstrackr, DistillerSR, and RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. To test their performance, we calculated the proportion missed, workload savings, and estimated time savings compared to dual independent screening by two reviewers. To test usability, screeners undertook a screening exercise in each tool and completed a user experience survey, incorporating the System Usability Scale (SUS).

Results. Using Abstrackr, DistillerSR, and RobotAnalyst respectively, the median (range) proportion of records missed was 5 (0-28)%, 97 (96-100)%, and 70 (23-100)% in Simulation A and 1 (0-2)%, 2 (0-7)%, and 2 (0-4)% in Simulation B. The median (range) workload savings was 90 (82-93)%, 99 (98-99)%, and 85 (85-88)% for Simulation A and 40 (32-43)%, 49 (48-49%), and 35 (34-38%) for Simulation B. The median (range) time savings was 154 (91-183), 185 (95-201), and 157 (86-172) hours for Simulation A and 61 (42-82), 92 (46-100), and 64 (37-71) hours for Simulation B. Based on the median (IQR) SUS scores (/100), Abstrackr fell in the usable (79 (23)), DistillerSR the marginal (64 (31)), and RobotAnalyst the unacceptable (31 (8)) usability range (n = 8). Participants indicated that usability was contingent on six interdependent properties: user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s).

Conclusions. Our findings support the cautious use of machine learning tools to replace the second reviewer (Simulation B); the workload savings were substantial and few, if any, records were erroneously excluded. Designing tools based on reviewers’ self-identified preferences may improve their usability.

Patient or healthcare consumer involvement. None.