Background: machine learning can expedite evidence synthesis by semi-automating title and abstract screening. Evidence of the relative advantages and reliability of semi-automated screening approaches is needed to inform guidance on their integration into modern review processes.
Objectives: compared to screening by a single experienced reviewer in rapid reviews (RRs) and dual independent screening in systematic reviews (SRs), we investigated the reliability and relative advantages of using a machine learning tool to 1) automatically exclude irrelevant records and 2) replace one of two independent review authors. We evaluated the impact of erroneously excluded records on the primary outcome.
Methods: we selected 11 SRs and six RRs completed at our Centre and subjected these to two retrospective screening simulations in 'Abstrackr', a machine learning tool. For each SR and RR, we screened a 200-record training set and downloaded the predicted relevance of the remaining records. We calculated the proportion missed, workload savings, and estimated time savings compared to single (RRs and SRs) and dual-independent screening (SRs only) by human reviewers. We performed a citing articles search in Scopus or Google Scholar to determine if the missed studies would be identified via reference list scanning. For SRs with pairwise meta-analyses, we removed the missed studies and compared the pooled estimates of effects for the primary outcome to those in the final reports.
Results: when we used Abstrackr to exclude irrelevant records, the median (interquartile range (IQR)) proportion missed was 20 (21)% (i.e. 9 (10) studies) for the SRs and 6 (12)% (i.e. 2 (10) studies) for the RRs. When used to replace one of two reviewers in the SRs, the median (IQR) proportion missed was 0 (1)% (i.e. 0 (2) studies). This diminished to 0 (1) studies following the citing articles search (0 studies in 7 SRs, 1 study in 2 SRs and 2 studies in 2 SRs). The missed studies had no impact on the results of the SRs. When used to exclude irrelevant records, the median (IQR) workload saving was 83 (12)% for the SRs and 34 (11)% for the RRs, for an estimated time saving of 44 (67) hours and 3 (3) hours, respectively. When used to replace one of two reviewers in the SRs, the median (IQR) workload savings was 33 (12)%, for an estimated time saving of 20 (30) hours.
Conclusions: too many relevant studies were missed when records were automatically excluded to consider this approach. Few (≤ 3), if any, relevant studies were missed when Abstrackr was used to replace the second reviewer in a pair; however, this amounted to up to 14% of the included studies in small SRs. The proportion missed diminished to 10% or more (≤ 2 studies) after scanning reference lists. In the context of SRs with comprehensive search strategies, the cautious application of machine learning to replace one reviewer in a pair could save considerable screening time without impacting the results
Patient or healthcare consumer involvement: none