Trustworthy stopping criteria for reliable work savings from machine learning prioritised screening

2023 London

Callaghan M¹, Müller-Hansen F¹

¹Mercator Research Institute on Global Commons and Climate Change

Background: Systematic review is a vital tool for producing trustworthy evidence that improves patient outcomes. Screening studies requires repetitive human labour, and machine learning promises to generate labour savings by learning to recognise and prioritise relevant studies. However, many systems report theoretical maximum labour savings that are unreachable in realistic conditions without prior knowledge of the number of relevant studies. Several systems suggest stopping criteria that risk zero work savings or catastrophic failures to meet targeted levels of recall. Previous work by the author team has identified a reliable stopping criterion using the hypergeometric distribution, although it is overly conservative.

Objectives: This paper presents an extension of our previous work on stopping criteria that uses biased urn theory to improve the precision of the criterion.

Methods: We demonstrate the reliability of the criterion by simulating machine learning-assisted screening processes on evaluation datasets under realistic conditions and contrast the work savings and recall achieved with commonly suggested heuristics.

Results: We show that our biased-urn based stopping criterion can achieve savings close to the theoretical maximum with a given dataset and classifier when the bias parameter can be accurately optimised. The stopping criteria enables reliable work savings that are independent of the classifier used or dataset.

Conclusions: Our stopping criterion offers a basis for capitalising on the labour-saving potential of machine-learning technologies in systematic review production without compromising on the trustworthiness of the evidence produced. The criterion can be adjusted for the users’s preferred recall target and level of confidence, and the results can be easily communicated in terms of rejecting a null hypothesis that a recall target has not been reached, making the use of machine learning technologies transparent and comprehensible to non-specialists.