Integrating Machine Learning into a Systematic Review Workflow: Testing the Cochrane RCT Classifier in a Research Consultancy Setting

Article type
Authors
Marshall C1, Bracewell J1, Littlewood A1, Ferrante di Ruffano L1, Edwards M1, McCool R1
1York Health Economics Consortium
Abstract
Background: There is strong evidence that machine learning can substantially reduce the burden of manual systematic review (SR) screening. However, outside of Cochrane and some academic groups, the adoption and acceptability of tools is still weak.

Objectives: To assess the accuracy of the Cochrane Randomised Controlled Trial (RCT) Classifier to accelerate systematic review screening in a research consultancy setting.

Methods: Our Review and Evidence Synthesis (RES) team are developing a semiautomated workflow (“RESbot”), which pieces together compatible tools to accelerate SR production. We have tested the Cochrane RCT Classifier as a potential candidate to include in the workflow to support screening. From December 2022 to February 2023, the classifier was tested on three SRs of RCTs covering interventions for renal denervation (Review A), postpartum depression (Review B), and schizophrenia (Review C). For each review, the search results were screened manually by two independent reviewers. The same results were run through the classifier, using both the “sensitive” and “precision” version. Classifier results were cross-checked against reviewer decisions.

Results: For Review A, the search retrieved 2,795 records. Manual screening found 24 eligible trials. Loading the search results through the classifier reduced the volume to 1,504 (sensitive) and 701 (precision). The precision set contained 23 of the included trials and the sensitive set contained all 24. For Review B, 2,153 records were retrieved by the search. Manual screening found 22 eligible trials. The classifier reduced the volume to 1,594 (sensitive) and 1,265 (precision). Both sets contained all 22 trials. For Review C, 1,172 records were retrieved by the search. Manual screening found 20 eligible trials. The classifier reduced the volume to 823 (sensitive) and 548 (precision). Both sets contained all 20 trials. The reduction in screening burden across the tests ranged from 25.9% to 46.1% using the sensitive and 41.2% to 74.8% using the precision setting.

Conclusions: The Cochrane RCT Classifier performed well in our tests, with no trials missed across the three reviews using the sensitivity setting and only one trial missed using the precision setting (in Review A). Our findings support the wider adoption of this classifier to accelerate review production.