Expediting literature screening with large language models: assessing the accuracy of different models

2024 Prague [Global Evidence Summit]

Chen Y¹, Zhang J²

¹Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences, School of Basic Medical Sciences, Lanzhou University, Lanzhou , Gansu, China; Key Laboratory of Evidence Based Medicine of Gansu Province, Lanzhou , Gansu, China; WHO Collaborating Centre for Guideline Implementation and Knowledge Translation, Lanzhou , Gansu, China

²Evidence-based Medicine Center, School Of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China

Background
Literature screening, a cornerstone of developing systematic reviews and clinical practice guidelines (CPGs), entails identifying studies aligning with predefined eligibility criteria. The screening directly impacts the credibility and validity of the review or guideline. Conventional screening methods primarily rely on manual efforts, which is a time-consuming, repetitive process. However, the emergence of large language models (LLMs) powered by natural language processing (NLP) offers novel opportunities for streamlined and potentially more efficient literature screening.
Objectives
To compare the accuracy of different LLMs for literature screening.
Methods
We identified a published systematic review (SR) from high-quality medical literature. We checked that the SR reported its complete search strategy and provided the full results of the screening. We screened selected key literature databases using the same screening strategy as the original SR and extracted the titles and abstracts of the identified studies. Next, we used the following LLMs to process the extracted titles and abstracts: GPT-3.5, GPT-4, Gemini, ERNIE Bot (Wen Xin Yi Yan), and GLM-10B-Chinese. Furthermore, the results of screening were screened in full texts. We repeated the process 5 times using both Chinese and English prompts tailored to the original inclusion criteria of the SR. Finally, we evaluated and compared the reliability and validity of each LLM in performing the literature selection based on the results of the original SR.
Results
We will present the results in detail in the Summit, including (1) the reliability of different LLMs for screening titles and abstracts and full texts; (2) the presence of differences between the results of different models and between using Chinese and English prompts; and (3) the presence of differences in the accuracy of different models for screening Chinese- and English-language literature.
Conclusions
This study provides insights about the reliability and validity of LLMs for selecting literature for SRs. We will analyze the advantages and shortcomings of LLM-based screening compared with traditional methods. Based on the results, we will propose recommendations for the optimized application of LLMs in literature screening that can ultimately accelerate the development of CPGs and allow patients to benefit from the latest evidence as early as possible.