Automatic generation of comprehensive trial registers for specific health conditions

Article type
Authors
Marshall IJ1, Noel-Storr A2, Wallace BC3, Thomas J4
1King's College London
2University of Oxford/Cochrane Dementia
3Northeastern University
4EPPI-Centre, University College London
Abstract
Background: a key problem in understanding health research is the lack of trial registers on important clinical conditions, and the difficulty keeping them updated. Here, we develop and evaluate methods for automating the process of producing topic-specific clinical trial registries.

Objectives: we seek to develop and evaluate automatic methods for: updating an existing trial register; creating a new trial register where some but incomplete data are available; and creating a new trial register where no data are available.

Methods: to develop the system, we will make use of data from the Cochrane Register of Studies (CRS), restricted to data from groups most likely to be most comprehensive from their published methods (Pregnancy and Childbirth Register, Airways, Stroke, Peripheral Vascular Diseases, and Dementia). We will add additional groups’ registers depending on whether they publish methods for maintaining the register, and have made their register data available in the CRS.

The CRS contains only relevant articles. However, a machine learning system needs to be trained using relevant and non-relevant examples. We therefore aim to start with the candidate set of all published RCTs, obtained by running an existing machine learning classification system over the full contents of PubMed. This set of all published RCTs will be then further labeled as ‘relevant’ or ‘irrelevant’ based on whether it exists in a review group’s register.

We will evaluate: support vector machines, convolutional neural networks, and BERT (Bidirectional Encoder Representations from Transformers).
- For Aim 1: machine learning models will be trained on all of the data
- For Aim 2: machine learning models will be trained on a small portion of the data
- For Aim 3: articles will be classified using an already-trained general PICO model, and automatic MeSH term inference. The registry will be defined by location in the MeSH tree.

Results: we will present an evaluation of the precision and recall of each automated strategy on a recent portion (~2 years) of the generated dataset. Precision and recall will be evaluated for all strategies. We will present data on the overall accuracy of each system, and also learning curves, in order to learn how many data are needed to generate a clinical trials register with sufficient accuracy.

Conclusions: clinical trial registers grouped by health specialty are key tools for the practice of evidence-based medicine, but are laborious to create and maintain. Successfully automating the process could allow registers to be created on new topics, and reduce the effort required to keep existing registers up-to-date.

Patient or healthcare consumer involvement: we are grateful to the volunteers of Cochrane Crowd, who were instrumental in developing the initial dataset of randomized controlled trials.