Enhancing Systematic Reviews with Large Language Models: Data Extraction of Randomized Controlled Trials

2024 Prague [Global Evidence Summit]

Liu J¹, Lai H¹, Ge L¹

¹Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China

Background
As the basis for health care practice, systematic reviews are experiencing a surge in demand, with a growing emphasis on expanding their volume and enhancing their methodological rigor. The rise of large language models (LLMS) offers the possibility of greatly improving the productivity and quality of system reviews, especially for labor-intensive processes such as data extraction.
Objective
To explore the feasibility and reliability of utilizing LLMs to extract data of randomized controlled trials (RCTs).
Method
We conducted a pilot and feasibility study in 10 RCTs selected from published systematic reviews. We developed structured prompts to guide Claude (Claude-2) in extracting the 10 RCTs’ data according to the Cochrane Handbook and established a gold standard to evaluate the accuracy at overall, study-specific, domain-specific, and item-specific levels. Besides, we estimated the efficiency of data extraction by recording the mean time.
Results
Across the 10 RCTs’ data extraction, the Claude achieved an overall rate of accuracy of 94.77% (95%CI: 93.66% to 95.73%). At the level of domain-specific, the domain of “Others” illustrated the highest mean accuracy rate at 100% (83.16% to 100%), while the domain of “Baseline characteristics” showed the poorest performance with a mean accuracy rate of 77.97% (72.72% to 82.64%). The remaining domains—Methods, Participants, Outcomes, and Data and Analysis—each maintained mean accuracy rates exceeding 95%. At the level of item-specific, 51.72% (38/58) achieved a 100% accuracy rate, and 20.68% (12/58) had accuracy rates over 90%. Only 8 items (13.79%) had accuracy rates below 90%. The mean time needed for data extraction was 88 seconds for each RCT.
Conclusion
The structured prompts we developed are likely to facilitate efficient and accurate data extraction by Claude, which could greatly assist human reviewers in conducting systematic reviews.