Can artificial intelligence learn to identify systematic reviews on the effectiveness of public health interventions?

Article type
Authors
Read K1, Husson H1, Dobbins M1
1National Collaborating Centre for Methods and Tools (NCCMT)
Abstract
Background: Health Evidence™ aims to make it easier for public health professionals and decision-makers to use evidence in their programs and policies. We provide access to over 6,000 critically appraised systematic reviews on the effectiveness of public health interventions. As the number of reviews published each year continues to grow, maintaining this repository is becoming more resource-intensive. On average, 8,000-10,000 records are screened each month to identify around 50 relevant reviews that are critically appraised and uploaded to the registry. Artificial intelligence (AI) may be one way to ensure the maintenance of this registry continues to be feasible.

Objectives: To explore whether AI can be used to accurately and efficiently conduct monthly relevance screening for the Health Evidence™ registry.

Methods: Using the Distiller SR platform, we uploaded a large reference set (n=4,584 relevant reviews and 18,699 not relevant reviews) to train the Distiller Artificial Intelligence System (DAISY) with the Health Evidence™ relevance criteria. The team trained DAISY on 70% of the labelled training set and had the platform score the articles. We then established an exclusion threshold based on the lowest score to correctly identify not relevant reviews. To test this threshold, we used DAISY on the remaining 30% (n=9,985) of uploaded articles and tested two additional sets from two monthly updates (month a= 7,917, month b= 7,848). We calculated the percentage of reviews DAISY automatically excluded and compared the predicted results to our manual screening of the monthly update to identify classification errors.

Results: Using 70% of the training set, the team identified an exclusion threshold of 0.47 (as 0.49 was the lowest score to correctly identify not relevant reviews). Applying DAISY to the additional test sets automatically excluded 24% of the records. When comparing to the manual screening results, the false exclusion rate with these sets was 0.02%. On average, the Health Evidence™ team will manually screen approximately 500 records an hour. Using this estimate, the AI functionality in Distiller could save up to four hours of manual screening per month. The next steps will be to test additional monthly update sets where manual screening results are available and explore both inclusion and exclusion threshold options.

Conclusions: The use of AI technology shows promise for helping to automate the Health Evidence™ monthly update process and improve the feasibility of maintaining a registry of public health relevant systematic reviews.