Comparing the performance of three tools for semi-automated abstract screening when conducting systematic reviews: Abstrackr, Rayyan and RobotAnalyst

2019 Santiago

Matyas N¹, Gartlehner G¹, Ravaud P², Atal I²

¹Department for Evidence-Based Medicine and Clinical Epidemiology, Danube University Krems

²Centre d’Épidémiologie Clinique, Hôpital Hôtel-Dieu, Paris

Background: with the increasing number of publications, new tools that employ text mining and machine learning have been developed to accelerate the screening of citations (titles and abstracts) when conducting systematic reviews (SRs). Based on human reviewers’ decisions regarding the inclusion or exclusion of citations, these predict the likelihood of unscreened citations eligibility for a given SR.

Objectives: the aim of this study was to compare the performance of three freely available tools for semi-automated screening - Abstrackr, Rayyan and Robot Analyst - using a set of diverse SRs.

Methods: we used a convenience sample of nine SRs from different medical fields as gold standards. Authors of the reviews provided us with bibliographic databases, which had documented the eligible and ineligible citations after title and abstract screening. We applied the screening tools on the citations of each review after training the tools with 5% and 10% of the databases’ coded citations. Subsequently, we downloaded the tools' predictions regarding the unscreened citations. We cross-checked the predictions with the information from the gold standard databases and calculated the sensitivity and specificity for each tool.

Results: the nine SRs comprized a median (minimum to maximum) number of 2073 (243 to 4734) citations. After screening 10% of all citations and using them as training set for the tools, the sensitivities for Abstrackr, Rayyan, and RobotAnalyst were 0.92 (0.64 to 1), 0.99 (0.90 to 1.0) and 0.72 (0.50 to 1), respectively. The corresponding specificities for Abstrackr, Rayyan, and RobotAnalyst were 0.74 (0.28 to 0.97), 0.17 (0.08 to 0.26) and 0.99 (0.97 to 1), respectively.

Conclusions: Abstrackr, Rayyan and RobotAnalyst showed different performances for sensitivity and specificity. At the moment none of these tools is accurate enough to replace human screeners in a SR.

Patient or healthcare consumer involvement: no patients were involved in the development of this methodological research paper.