Evaluating the Capacity of Openly Available Language Models in Categorizing Randomised Controlled Trials (RCTs), according to types of interventions

2024 Prague [Global Evidence Summit]

Buitrago-garcia D¹, Courvoisier D¹, Iudici M¹, Mongin D¹

¹Division of Rheumatology, Geneva University Hospitals and University of Geneva, Geneva, Switzerland

"Background: Since its 2018 debut (1), Large Language Models (LLM) have been assessed for their application in research 1, particularly in evidence synthesis where tasks such as screening are very work-intensive. LLMs offer potential benefits by efficiently classifying evidence, saving time, and might reduce resource needs within research teams.

Objective: To test the capacity of openly available non-specifically trained models to classify abstracts of RCTs into drug or non-drug trials.

Methods: We included abstracts of RCTs published in rheumatology journals between 2009 to 2020. RCTs were classified in two categories: “drug” or “non-drug” using two popular zero shot text classification models 2 based on deBERT and BART , or using the few shot prompting method 3 with LLAMA-2. The reference standard followed definition of FDA and was assessed by two reviewers (one with experience in evidence synthesis and one without). Conflicts were resolved by a third reviewer. We calculated accuracy and the percentage of correctly categorized abstracts by LLM (resp. positive and negative predictive values (PPV and NPV)).

Results: Of the 1055 eligible RCTs, 453 assessed drugs and 602 non-drug interventions. The most common non-drug intervention were exercise therapy (147, 25%), procedure (144, 24%) and delivery of care (86, 14%). The zero-shot classification BART-based model achieved the highest accuracy when distinguishing between drug or non-drug (Figure 1): accuracy 88% [95%CI: 86-90%], PPV 81% [77-84%], NPV of 96% [94-97%]. Misclassifications of non-drug interventions within drug categories primarily included procedures (eg, intra-articular injections) or RCTs assessing food compounds, vitamins, or supplements. The category results yielded by the model were slightly better compared to the non-expert reviewer (accuracy 86% [84-88%], PPV 76% [72-89%], NPV 99% [98-100%]). Approaches based on LLAMA-2 and few shot prompting did not yield better results and needed significantly more computational power and tweaking.

Conclusions: Zero-shot LLM had adequate accuracy and was almost always correct when predicting non-drug (NPV: 96%). These tools can be used to help streamline screening processes, mainly on the exclusion criteria, or to replace a non-expert reviewer.

1 https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
https://huggingface.co/facebook/bart-large-mnli