Link checking and cleaning: preparing records for Linked Data

Article type
Authors
Dooley G1, Anstee D1, Foxlee R2
1Metaxis Ltd, United Kingdom
2CEU, United Kingdom
Abstract
Background:
Cochrane authors have been linking references and studies together in reviews for 20 years. The Cochrane Register of Studies (CRS) takes advantage of this to give every Cochrane Review Group a set of records linked together as part of their specialised register. With the growing importance of Linked Data to the Cochrane data architecture those links among references, studies and reviews are crucial for ensuring that navigation across reviews based on common studies is accurate. Since different review groups and authors report references to studies in different ways, detecting studies in different reviews that are actually the same study is problematic. Metaxis was commissioned to clean up those links and ensure that the data are correct prior to rolling out Linked Data.

Objectives:
The objectives of the project were:
1) to ensure that correct links among references, studies and reviews are detected and maintained;
2) to recognise where studies in different reviews refer to the same trial; and
3) to ensure that correct links are cascaded though to the CRS data store, and thus to segments and specialised registers, and to linked data APIs.

Results:
RevMan records yielded 314,943 studies representing 376,156 references. A series of routines were devised to match references and studies together and to link those to CRS references. 129,850 studies were found to match at least one other study using the automatic routines, 1872 were flagged as 'likely matches' and passed for manual scrutiny.

Conclusions:
Matching studies that come from different reviews with different study names and different numbers of references to each study is a difficult process to achieve automatically. This project illustrates how using a wide variety of conventional and probabilistic matching techniques, combined with a manageable level of human intervention can solve what seem like intractable matching problems.