Decoding semi-automated title-abstract screening: an exploration of the review, study, and publication characteristics associated with Abstrackr's relevance predictions

Tags: Oral
Gates A1, Gates M1, Elliott S1, Pillay J1, DaRosa D1, Rahman S1, Vandermeer B1, Hartling L1
1Alberta Research Centre for Health Evidence, University of Alberta

Background. Machine learning (ML) tools can reduce screening workloads in systematic reviews but their adoption has been slow. To build trust, review teams may benefit from a better understanding of how and when ML-assisted screening may be most safely and effectively applied.

Objectives. We evaluated the risks (missed records) and benefits (time saving) of using Abstrackr to semi-automate title-abstract screening, and explored whether Abstrackr’s predictions varied by review or study-level characteristics.

Methods. For each of 16 reviews we uploaded the records to Abstrackr, screened a 200-record training set, and downloaded the predicted relevance of the remaining records. We then retrospectively simulated the liberal-accelerated screening approach, whereby the senior reviewer screened the records predicted as relevant and the second reviewer then screened those predicted as irrelevant and those excluded by the senior reviewer. We estimated the time savings (assuming 30 seconds per record) and calculated the proportion missed (records included in the final reports that were wrongly excluded) compared with dual independent screening. We compared the review and study-level characteristics of Abstrackr’s ‘correct’ and ‘incorrect’ predictions using Fisher’s Exact and unpaired t-tests.

Results. The median (interquartile range (IQR)) screening workload was 2123 (4641) records. Across systematic reviews our approach wrongly excluded 0 to 3 (0 to 14%) records in the final reports and saved a median (IQR) 26 (33) hours of screening time. Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review type (systematic (88% correct) or rapid (84%), P=0.37) or intervention type (simple (88%) or complex (86%), P=0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P = 0.01), and that included only trials (95%) vs. multiple study designs (86%) (P=0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P=0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P=0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P=0.02) were more often correctly predicted as relevant. There was no difference in the mean (SD) journal impact factor for correctly included (4.91 (8.39)) and wrongly excluded (4.61 (9.14)) records (P=0.74).

Conclusions. Our ML-assisted screening approach saved considerable time and may be suitable in conditions where the limited risk of missing relevant records is tolerable (e.g., rapid or scoping reviews). ML-assisted screening may be most trustworthy for reviews that seek to include only trials or more recent publications; however, as several of our findings are paradoxical further study is needed to understand the tasks to which ML-assisted screening is best suited.

Patient or healthcare consumer involvement: None.