A Pretrained Language Model for Classification of Cochrane Plain Languages Summaries on Conclusiveness of Recommendations

Article type
Mijatović A1, Ursić L1, Buljan I2, Marušić A1
1University of Split School of Medicine
2Department of Psychology, Faculty of Humanities and Social Sciences, University of Split
Background: Cochrane Plain Language Summary (PLS) is a stand-alone summary of a Cochrane Review used to disseminate the evidence in health to a non-research audience. They can be categorized according to level of conclusiveness, i.e., whether they contain conclusive recommendations on intervention’s efficacy and safety. The ever-growing field of natural language processing (NLP) and its encoder-decoder machine learning models such as Transformers show excellent performance in text classification tasks.

Objectives: To finetune and train SciBERT, a pretrained deep learning language model, for PLS classification according to three level of conclusiveness: conclusive, inconclusive, and unclear.

Methods: Our data source was a dataset containing all Cochrane PLSs of systematic reviews on intervention studies published until 2019, already classified according to nine categories of conclusiveness of effectiveness/safety evidence [1]. We merged these categories into three groups based on the strength of the conclusiveness: conclusive (0), inconclusive (1), and unclear (2). We used SciBERT, a pretrained language model based on Bidirectional Encoder Representations from Transformers (BERT), trained on 1.14M papers mostly from the biomedical domain, for finetuning the classification. The testbed was written in the Python language with the help of the PyTorch framework, and the pre-trained transformer model was taken from the HuggingFace transformers library. Evaluation metrics was performed using scikit-learn machine learning library, applying metric functions from the sklearn.metrics module. Area Under the Curve of the Receiver Operating Characteristic (AUCROC) score was used to measure model’s performance between sensitivity and specificity.

Results: After only three epochs of training, the model achieved high prediction scores: AUCROC was 0.88, 0.91, and 0.94 for conclusive, inconclusive, and unclear categories, respectively. The model had 87% balanced accuracy on the validation data set, indicating good efficiency.

Conclusions: Despite having a relatively small and unbalanced dataset, our model achieved a very good performance and will continue to improve as we classify new PLSs published from 2020 to 2022 and add them to the training, testing, and validation dataset.

Patient, public, and/or healthcare consumer involvement: PLSs containing conclusive recommendations on intervention’s safety and efficacy (whether positive or negative) will help patients obtain relevant information.