From testing to development: building and assessing a large language model for public health evidence synthesis

2024 Prague [Global Evidence Summit]

Simmons Z¹, Fuller J¹, Woolnough H¹, Harris T¹, Evans B¹, Duval D¹

¹UK Health Security Agency, London, United Kingdom

Background: Recent global health challenges, such as the COVID-19 pandemic, have underscored the need for rapid evidence synthesis to support public health decision-making. Preliminary testing of data extraction with Large Language Models (LLMs) like ChatGPT has demonstrated promising capabilities to enhance capacity to deliver evidence synthesis. Inspired by initial results and recognizing a gap not met by existing commercial solutions, the evidence review and the data analytics and surveillance teams within our public health organization, have embarked on the development of a bespoke data extraction LLM.

Objectives: To develop and evaluate a bespoke data extraction LLM and assess its suitability for use in the evidence synthesis process.

Methods: A bespoke LLM is being developed to conduct data extraction for evidence synthesis in a public health context. The tool will utilize Retrieval-Augment-Generation to perform data extraction followed by text summarization on predefined data extraction fields (such as study design) using open-source LLMs. An evaluation framework is being finalized that will include metrics for accuracy and reliability and impact on review workflows. Human-completed data extractions on a range of observational and experimental study designs will be used as the comparative benchmark. If deemed suitable, the tool could be deployed to conduct initial data extraction, with completed extraction quality assured by a human reviewer. This approach facilitates an assessment of efficacy and integration into existing processes, ensuring that technical benchmarks, standards of transparency, and methodological rigor in evidence synthesis are met.

Expected Outcomes: This tool is still in development; however, initial results are promising. Evaluation is expected to be completed in March 2024, and results will be presented on the LLM performance and evaluated with the evaluation framework currently in development. Our hypothesis is that further testing will show the model’s ability to assist in handling large volumes of data more efficiently.

Implications for Practice: The development of a bespoke LLM for data extraction addresses the need for increased capacity and pace in public health evidence synthesis. By evaluating the model's accuracy and its operational impacts, we aim to determine whether the bespoke LLM is suitable for integration into our evidence synthesis processes.