Semi-automating citation screening: a retrospective assessment of a hybrid machine learning/crowdsourced approach using one year’s worth of human-generated data from the Embase crowdsourcing project

Tags: Oral
Wallace B1, Thomas J2, Cohen A3, Smalheiser N4, Dooley G5, Foxlee R6, Noel-Storr A7
1Department of Computer Science, University of Texas, USA, 2Institute of Education, University College London, UK, 3Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, USA, 4Department of Psychiatry, University of Illinois, Chicago , USA, 5Metaxis, UK, 6Cochrane Editorial Unit, Cochrane, UK, 7Cochrane Dementia and Cognitive Improvement Group, Oxford University, UK

Background: Previous work has already shown feasibility with regard to machine learning applications successfully classifying citations into prespecified categories, and has demonstrated reductions in human citation screening by 40% to 50%.

Objectives: We assessed the potential role of a machine learning approach in helping the crowd to identify reports of randomised trials eligible for Cochrane’s Central Register of Controlled Trials (CENTRAL).

Methods: The Embase project used crowdsourcing to identify reports of randomised trials from highly sensitive searches run in Embase. Using the citations fully assessed by the crowd from this project as a gold standard, we ran a number of simulations comparing machine performance alone or in various combinations with human assessment in order to understand the potential workload reductions and effects on recall and precision.

Results: A total of 60,468 fully assessed citations were included in the analyses. Six analyses were performed. The first, a simple comparison of machine predictions compared to the gold standard. Area under the curve was 0.977; and the maximum point on this curve corresponded to a recall of 71.2 % and precision of 73.4%. We then explored use of the machine classifier in addition to human workers via simulation experiments. The most effective approach entailed replacing one human screener with a computer prediction when three or more screeners are used. This resulted in a recall of 98.5% while reducing workload substantially. In addition, when the decision was deferred entirely to the machine when sufficiently confident in the prediction, 95% recall was achieved with a correspondingly dramatic reduction in workload.

Conclusions: The results of this important work have informed next steps towards implementation into the workflow for the Evidence Pipeline and Cochrane Crowd components of Project Transform. The identification of RCTs can be semi-automated and when applied appropriately within a crowd model can offer significant opportunities to reduce human effort without compromising recall.