Exploring the use of a large language model for data extraction in systematic reviews: a rapid feasibility study

Article type
Authors
Campbell F1, Craig D1, Engelbert M2, Graziosi S3, Hair K4, Kapp C5, Khanteymoori A6, Schmidt L1, Thomas J3
1National Institute for Health and Care Research Innovation Observatory, Population Health Sciences Institute, Newcastle University
2International Initiative For Impact Evaluation (3ie)
3UCL Social Research Institute, University College London
4Centre for Clinical Brain Sciences, University of Edinburgh
5Institute for Quality and Efficiency in Health Care, Cologne, Germany
6Department of Neurosurgery, Neurocenter, Medical Center - University of Freiburg
Abstract
The emergence of large language models (LLMs) has stimulated much discussion about the potential of these and similar AI tools to promote evidence generation and use. One such use case is the acceleration of evidence synthesis in the form of systematic reviews. If LLMs can decrease the time and human effort required to produce systematic reviews, this could result in substantially greater amounts of high-quality evidence synthesis available to decision-makers. However, careful testing of LLMs for accelerating systematic reviews is needed to ensure they can attain levels of reliability and accuracy comparable to those of human reviewers.

During the 2023 Evidence Synthesis Hackathon, we conducted 2 feasibility studies of using GPT-4, and LLM, to (semi)automate data extraction in systematic reviews. Firstly, to automatically extract study characteristics from studies in the human clinical, animal, and social science domains. We used 2 studies from each category for prompt development and 10 for evaluation. Secondly, we used the LLM to predict PICOs labeled within 100 abstracts in the EBM-NLP data set.

Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), whereas outcomes were more challenging. We also found that the LLM’s responses were not entirely stable across multiple submissions of the same prompt, and the order in which prompts were submitted made a substantial difference to the accuracy of responses.

This work presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example, as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into systematic review procedures. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.