The use of nature language processing (NLP) for rapid literature screening in the update of systematic reviews: a comparison of four different models

Article type
Authors
Qin X1, Liu J1, Wang Y1, Li L1, Sun X1
1Chinese Evidence-Based Medicine Centre and Cochrane China Centre, West China Hospital, Sichuan University, Sichuan
Abstract
Background: Updating systematic reviews represents an important mission of Cochrane. Literature screening accounts for a large proportion of efforts in the update of systematic reviews (SRs). The natural language processing (NLP) technology may have a great potential for improving the efficiency of literature screening in the update of systematic reviews, particularly when the technology has learned from existing screened literature (i.e., the gold standard set).

Objectives: To compare the performance of different NLP models that are used for literature screening in the update of systematic review.

Methods: In our earlier systematic review of randomized controlled trials (RCTs) of SGLT2 inhibitors for treatment of type 2 diabetes (T2DM), we obtained 3460 de-duplicate reports by searching Medicine, EMBASE and Cochrane Central Register of Controlled Trials (CENTRAL) from inception to June 2019. Two methods-trained reviewers, using explicit eligibility criteria, manually screened titles and abstracts of these reports. We randomly divided these 3460 reports into training, development, test set at a ratio of 3:1:1. We firstly developed four supervised learning fitting models (i.e. NLP models) using the training and development set, including blueBERT-base uncased pretrained on PubMed (BlueBUP), blueBERT-base uncased pretrained pre-trained on PubMed abstracts and clinical notes (BlueBUPC), BERT-base cased (BBC), and BERT-base uncased (BBU). We then applied the test data set to evaluate the performance of four NLP models, including precision (i.e., the fraction of true positive samples among the retrieved positive samples), recall (i.e., sensitivity), f1 (i.e., harmonic mean of the precision and recall), and Areas Under the Receiver Operating Characteristics (AUROC).

Results: For the four NLP models - BlueBUPC, BlueBUP, BBC, and BBU - the precision scores were 0.767, 0.724, 0.724, and 0.728; the recall scores were 0.818, 0.869, 0.803, and 0.854; the f1 scores were 0.792, 0.815, 0.764, and 0.796; the accuracy scores were 0.915, 0.907, 0.902, and 0.919; and the AUROC were 0.96, 0.95, 0.96 and 0.96. The Receiver Operating Characteristics (ROC) curves were shown in Table 1 and Figures 1.

Conclusions: Our study showed that the NLP may be a useful tool to assist in literature screening when updating a systematic review, and the BlueBUP model may be a preferred method given the highest recall score and good f1 score, which are essential in literature screening. This approach is only usable for updating systematic reviews, and more validations studies are warranted.

Patient or healthcare consumer involvement: None.