Preclinical Risk of Bias Assessment using Large Language Models

2024 Prague [Global Evidence Summit]

Wang Q¹, Macleod M²

¹National Science Library, Chinese Academy of Sciences, Beijing, China

²Centre for Clinical Brain Sciences, The University of Edinburgh, Edinburgh, United Kingdom

"Background
Risk of Bias (RoB) assessment is a critical step in preclinical systematic reviews, which in turn supports translation from preclinical to clinical research. RoB assessment is time consuming and the community has advocated the use of AI techniques to speed up the process. Previous work has demonstrated the advantage of applying neural networks and transformer based models to automatically assessed preclinical RoB items [1]. However, these are supervised methods which require substantial human efforts to generate training sets. Large Language Models (LLM) has brought significant transformations to many application scenarios, which might also contribute to RoB assessment.

Objectives
We sought to apply LLM techniques to the task of automatic risk of bias assessment in preclinical literature, which could speed the process of systematic review and substantially reduce human efforts.

Methods
We use 784 full-text publications with randomisation and blinding labels each annotated by multiple reviewers as the gold standard, as reported in the previous work [1]. We apply LangChain, a framework for developing applications powered by language models [2], to load and recursively split individual full texts, convert text chunks to embeddings by vector database Chroma. We conduct similarity search for query of random allocation or blinded assessment of outcome to obtain relevant text chunks. Without any training set or extra annotations, we build prompt templates by zero or two examples with retrieved text chunks and feed them to ChatGLM3 (an open-source LLM) [3] to generate the final answer (yes/no). The whole pipeline is shown in Figure 1. We compare performance to the previous best approaches.

Results
Without any training set or extra annotations, the F1 scores by our LLM pipeline on the test set are 70.2% for random allocation, and 79.8% for blinded assessment of outcome (Table 1), which are respectively 12% and 2% lower than the performance of previous best models trained on more than 6000 annotated full-text publications.

Conclusions
Our study indicates the potential advantages of LLM approaches for RoB assessment in the preclinical full-text publications. Such strategies could migrate to other RoB items in specific domains.

Statement: No public/consumers were involved in the study. "