Text mining for screening efficiency? Testing within a Cochrane public health review

Article type
Authors
Weightman AL1, Baker PRA2, Thomas J3, Francis DP2, Lovie-Toon Y4, O'Mara-Eves A3
1Information Retrieval Methods Group, United Kingdom
2Public Health Review Group, Australia
3Pregnancy and Childbirth Group, United Kingdom
4None, Australia
Abstract
Background:
The requirement for dual screening of titles and abstracts to select papers to examine in full text can create a huge workload, not least when the topic is complex and a broad search strategy is required, resulting in a large number of results. An automated system to reduce this burden, while still assuring high accuracy, has the potential to provide huge efficiency savings within the review process.

Objectives:
To undertake a direct comparison of manual screening with a semi-automated process (priority screening) using a machine classifier. The research is being carried out as part of the current update of a population-level public health review.

Methods:
Authors have hand-selected studies for the review update, in duplicate, using the standard Cochrane Handbook methodology. A retrospective analysis, simulating a quasi ‘active learning’ process (whereby a classifier is repeatedly trained based on ‘manually’ labelled data) is reported, using different starting parameters. Tests will be carried out to see how far different training sets, and the size of the training set, affect the classification performance; i.e. what percentage of papers would need to be manually screened to locate 100% of those papers included as a result of the traditional manual method.

Results:
From a search retrieval set of 9555 papers, authors excluded 9494 papers at title/abstract and 52 at full text, leaving nine papers for inclusion in the review update. The ability of the machine classifier to reduce the percentage of papers that need to be screened manually to identify all the included studies, under different training conditions, will be reported.

Conclusions:
The findings of this study will be presented along with an estimate of any efficiency gains for the author team, if the screening process can be semi-automated using text-mining methodology, along with a discussion of the implications for text mining in screening papers within complex health reviews.