Automating Screening of Studies for Systematic Reviews Using a Large Language Model

2024 Prague [Global Evidence Summit]

Xu Z¹, Teng L², Millard L¹, Higgins J¹, Martin R¹, Markozannes G², Tsilidis K², Chan D², Gaunt T¹, Liu Y¹

¹MRC Integrative Epidemiology Unit, Bristol Medical School, University Of Bristol, Bristol, UK

²Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London, UK

"Background
Systematic reviews demand extensive human resources. The World Cancer Research Fund International Global Cancer Update Programme (WCRF CUP-Global) employs a team of experienced researchers to conduct high-quality and up-to-date systematic reviews on how diet, nutrition, and physical activity affect cancer incidence and survival. A significant challenge in this process is the substantial time and personnel needed for the initial screening of relevant studies, typically taking two reviewers about a month for each systematic review. Here, we utilise machine learning and Large Language Model (LLM) approaches to automate this screening process.

Objective
This study aims to reduce the manual labour needed for the study screening processing using LLM-based automated inference to assist reviewers from the CUP-Global team in decision-making. Specifically, we aim to develop a pipeline to predict a study’s inclusion/exclusion status (as well as the likelihood of the inclusion/exclusion status) for existing and new review topics.

Methods
We use the BlueBERT LLM as the foundation model, as this has been pre-trained with sophisticated language understanding in the biomedical domain. We further tailor BlueBERTs to our study screening task using CUP-Global and public domain records containing studies’ inclusion/exclusion status for different review topics. The pipeline involves a set of document classification models (Figure 1) and a generalised dense retrieval model (Figure 2). The document classification models are specifically optimised to a set of prioritised CUP-Global topics, ensuring consistent and robust inference to the future updates of the existing key topics. Since the performance of the classification models depends on adequate training data for each topic, we also introduce a dense retrieval model in the pipeline. The retrieval model is trained on rich inclusion/exclusion data across wide-ranging systematic review topics, which provides a unified and generalised model for new systematic review topics or topics without sufficient data. The pipeline’s prediction of the inclusion/exclusion status is compared with human decisions regarding accuracy, precision, and sensitivity.

Statement on relevance and importance to patients: This study contributes to developing automation methods to reduce the time taken to carry out systematic reviews and benefit patients with more abundant and timely evidence-based guidelines and policies."