Exploring the use of a large language model for data extraction in systematic reviews: a rapid feasibility study

2024 Prague [Global Evidence Summit]

Campbell F¹, Craig D¹, Engelbert M², Graziosi S³, Hair K⁴, Kapp C⁵, Khanteymoori A⁶, Schmidt L¹, Thomas J³

¹National Institute for Health and Care Research Innovation Observatory, Population Health Sciences Institute, Newcastle University

²International Initiative For Impact Evaluation (3ie)

³UCL Social Research Institute, University College London

⁴Centre for Clinical Brain Sciences, University of Edinburgh

⁵Institute for Quality and Efficiency in Health Care, Cologne, Germany

⁶Department of Neurosurgery, Neurocenter, Medical Center - University of Freiburg

The emergence of large language models (LLMs) has stimulated much discussion about the potential of these and similar AI tools to promote evidence generation and use. One such use case is the acceleration of evidence synthesis in the form of systematic reviews. If LLMs can decrease the time and human effort required to produce systematic reviews, this could result in substantially greater amounts of high-quality evidence synthesis available to decision-makers. However, careful testing of LLMs for accelerating systematic reviews is needed to ensure they can attain levels of reliability and accuracy comparable to those of human reviewers.

During the 2023 Evidence Synthesis Hackathon, we conducted 2 feasibility studies of using GPT-4, and LLM, to (semi)automate data extraction in systematic reviews. Firstly, to automatically extract study characteristics from studies in the human clinical, animal, and social science domains. We used 2 studies from each category for prompt development and 10 for evaluation. Secondly, we used the LLM to predict PICOs labeled within 100 abstracts in the EBM-NLP data set.

Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), whereas outcomes were more challenging. We also found that the LLM’s responses were not entirely stable across multiple submissions of the same prompt, and the order in which prompts were submitted made a substantial difference to the accuracy of responses.

This work presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example, as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into systematic review procedures. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.