A fully automated pipeline for a living review of methods to (semi)automate data extraction

Article type
Authors
Schmidt L1, Olorisade BK1, McGuinness LA1, Higgins JPT1
1University of Bristol
Abstract
Background: Data extraction is one of the most time-consuming and complex tasks for the authors of systematic reviews (SR). It is an area that holds promise for the application of machine learning technology and text mining. The fields of information science and data science are constantly evolving, and there is a steady flow of new research in data mining. This situation supports the choice of conducting a living review in this topic area. However, living reviews are resource-intensive, and require the application of technological support for efficiency and maximisation of their life cycles.

Objectives: (1) To conduct a living review of methods and tools for extracting specific items of information/data from reports of health research studies in order to (semi)automate parts of the systematic review process. (2) To develop fully automated, technology backed workflows to assist with this living review throughout its life cycle.

Methods: Publications for this living review are regularly retrieved from Medline, Web of Science, IEEE, dblp and the computer science arXiv using database APIs, Python and R libraries to scrape and search data. Screening of titles and abstracts is undertaken by two reviewers with the aid of machine learning algorithms, and takes place every two months. Eligible full texts are screened, and data related to design and quality of reporting are extracted for a cross-sectional analysis of the available evidence. Full review updates are planned in 6-month intervals if the amount of new evidence permits it. For machine learning we employ an ensemble of classical (SVM, LDA) and deep neural methods (BERT, XLM).

Results: The initial information retrieval is automated by the first two modules in our pipeline, using APIs and scraping of databases in order to automate systematic searches on grey literature and information science databases that do not offer advanced search techniques in their interfaces. The third module in our pipeline applies an ensemble machine leaning classifier based on our own, as well as on other previously published machine learning architectures.

Conclusions: We present a fully automated information retrieval pipeline with an integrated, active-learning abstract screening system to support a living review throughout its life cycle (Figure 1). By re-using and integrating previously published classifiers into one ensemble we strive to reduce duplication of efforts. The pipeline is modular, and parts related to the search strategy, searched databases, and training for the machine learning can be replaced when conducting a different living review.

Patient or healthcare consumer involvement: No patients were involved in this research. We involved fellow systematic reviewers as stakeholders and aimed to integrate already existing machine-learning infrastructures into this project in order to reduce duplication of efforts