Semi-automating citation screening: a retrospective assessment of a hybrid machine learning/crowdsourced approach using one year’s worth of human-generated data from the Embase crowdsourcing project

Oral

2016 Seoul

Wallace B¹, Thomas J², Cohen A³, Smalheiser N⁴, Dooley G⁵, Foxlee R⁶, Noel-Storr A⁷

¹Department of Computer Science, University of Texas, USA

²Institute of Education, University College London, UK

³Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, USA

⁴Department of Psychiatry, University of Illinois, Chicago , USA

⁵Metaxis, UK

⁶Cochrane Editorial Unit, Cochrane, UK

⁷Cochrane Dementia and Cognitive Improvement Group, Oxford University, UK

Background: Previous work has already shown feasibility with regard to machine learning applications successfully classifying citations into prespecified categories, and has demonstrated reductions in human citation screening by 40% to 50%.

Objectives: We assessed the potential role of a machine learning approach in helping the crowd to identify reports of randomised trials eligible for Cochrane’s Central Register of Controlled Trials (CENTRAL).

Methods: The Embase project used crowdsourcing to identify reports of randomised trials from highly sensitive searches run in Embase. Using the citations fully assessed by the crowd from this project as a gold standard, we ran a number of simulations comparing machine performance alone or in various combinations with human assessment in order to understand the potential workload reductions and effects on recall and precision.

Results: A total of 60,468 fully assessed citations were included in the analyses. Six analyses were performed. The first, a simple comparison of machine predictions compared to the gold standard. Area under the curve was 0.977; and the maximum point on this curve corresponded to a recall of 71.2 % and precision of 73.4%. We then explored use of the machine classifier in addition to human workers via simulation experiments. The most effective approach entailed replacing one human screener with a computer prediction when three or more screeners are used. This resulted in a recall of 98.5% while reducing workload substantially. In addition, when the decision was deferred entirely to the machine when sufficiently confident in the prediction, 95% recall was achieved with a correspondingly dramatic reduction in workload.

Conclusions: The results of this important work have informed next steps towards implementation into the workflow for the Evidence Pipeline and Cochrane Crowd components of Project Transform. The identification of RCTs can be semi-automated and when applied appropriately within a crowd model can offer significant opportunities to reduce human effort without compromising recall.