Automating risk-of-bias assessment with generative AI

Article type
Authors
Baldwin T1, Suster S2, Verspoor K3
1Department of Natural Language Processing, MBZUAI, United Arab Emirates
2School of Computing and Information Systems, The University of Melbourne, Australia
3School of Computing Technologies, RMIT University, Australia
Abstract
Background
Assessing the risk of bias (RoB) for studies included in systematic reviews is a critical component of medical evidence synthesis. While standard approaches to automating this task have typically relied on extensive training data to work well, the recent successes of generative artificial intelligence (AI) present an opportunity for achieving accurate predictions based on task instructions alone without the need for large training sets. In this talk, we examine how successfully large language models (LLMs) can predict the RoB when provided with Cochrane RoB guidelines and input text from trial publications.

Methods
Following Cochrane's latest guidelines (RoB2) designed for human reviewers, we prepare instructions that are fed as input to LLMs, which then infer the risk associated with a trial publication. LLMs are provided either with no or few examples on how to solve the task or are adapted through task-specific fine-tuning. We include in our analysis a variety of general- and medical-domain LLMs.

Results
When using instructions alone, without any examples, LLMs fail to provide accurate predictions. However, when a few task-specific examples are added, the performance increases. It is especially beneficial to adapt LLMs for RoB2 assessment using a larger labelled set, which brings the performance in a similar range as for more data-intensive RoB1 approaches.

Conclusions
While our study testifies to the difficulty of solving this task based on assessment guidelines alone, using available labelled data increases the predictive performance of LLMs. We offer recommendations for practitioners seeking to integrate LLMs into their work.