Sense and sensibility: automated clustering to code studies across health and social care systematic reviews

2012 Auckland

Brunton G¹, Stansfield C¹, O’Mara-Eves A¹, Hauari H², Kavanagh J¹, Thomas J¹, Oliver S¹

¹EPPI Centre, Institute of Education, UK

²Thomas Coram Research Unit, Institute of Education, University of London

Background: Rapid systematic reviews for health and social care policy development often address complex concepts, required to a short deadline. Classifying a review’s studies requires careful coding tool development and piloting; further, the broad nature of these reviews canmeanmore references are located than can be coded in the available time. The possibility of using text mining technology to semi-automate the coding process, and save time, is clearly attractive.

Objectives: To assess the ability of automated clustering to provide rapid, accurate descriptive codes of a review’s included studies.

Methods: The CarrotSearch.com Lingo3G document clustering utility was used on text from included studies’ titles and abstracts across four systematic reviews of life checks, incentives, protected areas and community engagement. Text was automatically extracted and clustered, grouping sets of documents with similar words together. Sets were ‘labelled’ with a code to describe each group of documents. ‘Outlier’ studies were located and coded manually. Across reviews, codes were descriptively and reflectively analysed in terms of the process, time taken, utility of hierarchical levels and sensibility of codes. In two reviews, predictive ability of clustering was tested against manual coding developed: (1) with a previously used tool, and (2) in an independent parallel review.

Results: Initial evidence suggested that clustering compared to manual coding requires less time but more focused researcher input. A large proportion of nonsensical codes were generated, requiring manual interpretation. Using two hierarchical cluster levels was useful. Clustering identified the same health topics as manual coding, and sometimes identified new concepts unknown to reviewers. Clustering built reviewers’ topic knowledge and facilitated discussion between researchers and stakeholders.

Conclusions: Clustering methods for coding in systematic reviews are a promising technique for accurate, efficient study categorisation, allowing more efficient use of researcher time. These methods are currently being further tested.