Does screening automation negatively impact meta-analyses in systematic reviews of diagnostic test accuracy?

2019 Santiago

Norman C¹, Leeflang M², Porcher R³, Névéol A¹

¹LIMSI, CNRS

²AMC, University of Amsterdam

³METHODS Team, CRESS, Inserm U1153; Université Paris Descartes

Introduction: the large and increasing number of new studies published each year is making literature identification in systematic reviews ever more time-consuming and costly. Technological assistance has been suggested as an alternative to mitigate the cost, but has so far seen little adoption by the systematic review community. One likely reason is that conventionally, systematic reviews assume the identification of all relevant studies, which current methods cannot guarantee. In this study we examined to what extent missing studies in systematic reviews of diagnostic accuracy would have negatively impacted the findings. In particular, we examined whether it is possible to gauge when a systematic review of diagnostic test accuracy has uncovered sufficient evidence that additional evidence is unlikely to change the findings of the review, and whether perfect recall in the screening process is necessary to estimate accurately the summary sensitivity and specificity of the diagnostic tests.

Methods: we simulated the screening process in 48 Cochrane Reviews of diagnostic test accuracy, and re-ran 400 meta-analyses based on at least three studies. We compared screening prioritization and screening in randomized order and examined if the screening could have been stopped before identifying all relevant studies while still producing reliable summary sensitivity and specificity estimates.

Results: the main meta-analysis in each systematic review could have been performed after screening an average of 30% of the candidate articles (range: 0.07% to 100%) (fig. 1). No systematic review would have required screening more than 2308 studies, whereas manual screening would have required screening up to 43,363 studies. Despite leaving out 30% of the relevant articles, this procedure would have changed the estimates by only 1.3% on average (fig. 1). For comparison, systematic review authors are free to choose between several software packages to calculate the summary estimates (e.g. NLMixed in SAS, meqrlogit/xtmelogit in Stata, or reitsma from the mada R package), and we have previously estimated the results to vary by around 2% on average depending on the choice of software used.

Discussion: we observe that the summary estimates converge to their final values much more quickly and reliably than when screening in arbitrary order. In many cases the screening could be stopped prematurely while bounding the estimation error within prespecified limits (Fig. 1).

Conclusion: screening prioritization coupled with stopping criteria can reliably detect when the screening process can be safely stopped in diagnostic test accuracy reviews, and can therefore be used without having to sacrifice the reliability of a systematic review.

Patient and consumer involvement: we hope that in the long run patients will benefit from timely and complete reviews, but since this study is methodological and highly technical, we see no way to meaningfully involve patients.