Background: A great deal is expected of semi-automation of screening of titles and abstracts, given the ever increasing rate of scientific output. However, the reported experiences of users of the existing machine learning systems have been mixed, especially in terms of achieved sensitivity. Arguably, an important source of this variability is that the percentage of citations that need to be screened manually in order to achieve high sensitivity (the size of the training set) is different for every review. This parameter cannot be estimated a priori, only after exploration of a subset of the data, for example in the active learning scenario. It is unclear, however, how to improve reliability of updates, especially in the context of living systematic reviews, which – due to their frequency – require more hands-off approach.
Objectives: To identify which review-specific factors can lead to poor performance of the machine learning models used for screening and design improved models that would be free of these limitations.
Methods: We constructed a representative sample of 36 systematic reviews and performed a retrospective, simulated screening update using a re-implementation of the current state-of-the-art models, trained on 50% of the dataset and tested on the remaining half. For 10 of the reviews, we conducted error analysis to determine sources of poor performance.
Results: The performance of the baseline models varied greatly ranging from 33% to 100% in terms of sensitivity and 3% to 19% in precision. We attribute this variability to the fact that the training set size was insufficient for some of the reviews. The error analysis further showed that low recall was caused by a small number of included studies in the original review, complex inclusion criteria (e.g. indirect evidence) and topic drift (for instance, appearance of a new intervention in the updated review). Based on these findings, we constructed a formalized framework for definition of inclusion and exclusion criteria and designed a new machine learning model that can use this information. This model, inspired by the recent advancements in few-shot learning, differs from the previous approaches, as it is not trained individually for every review, but is pre-trained on a large set of existing reviews to meta-learn the specifics of the screening task. This direction proved to be very promising and while the work is still ongoing, we are hopeful that our results will encourage other groups to pursue this direction.
Conclusions: Typical problems encountered while applying machine learning to screening may be caused by insufficient information supplied to the model. For reviews with very low inclusion rates, it may never be feasible to train the models on the included and excluded citations alone. Therefore we propose to redefine the problem of screening automation to include other data, such as inclusion and exclusion criteria or the citation graph.
Patient or healthcare consumer involvement: None