Article type
Abstract
Background: Presence of errors in search strategies affect the validity of findings in systematic reviews in public health. As many as 90% of search strategies in published systematic reviews contain at least one error with some reaching as high as 93% error rate in a single search. To mitigate these errors, information scientists obtain peer-review from other information scientists but since peer-review is time intensive, carries a degree of subjectivity and is also prone to human error, using generative artificial intelligence (GenAI) may circumvent such barriers and improve the quality of search strategies.
Objectives: To compare error detection rates between GenAI and two human peer-reviewers across existing search strategies of published systematic reviews.
Methods: We will introduce random errors (spelling and Boolean operators) using error generating software in existing searches. Search strategies will be submitted to ChatGPT, Gemini, Claude and to two human information scientists for review. The independent variable will be peer-reviewing with two levels: human and AI, while the outcome (proportion of errors detected relative to existing errors in the search) will be the average difference in error rate between human and AI peer-reviewing performance. Human peer-reviewers will be blind to each other’s search strategies. The time it takes peer-reviewers to review strategies will be recorded.
Results: Results of peer review by human and each AI tool will be compared to determine the rate of errors which each reviewer was able to identify, and which tool(s) were able to correctly identify all errors. Due to the anticipated small sample size, we will use the non-parametric Fisher’s exact test (alpha level of .05) to compare the proportion of error detection between AI and human peer-review. Each point estimate will include 95% confidence intervals.
Conclusions: If generative AI can perform peer review of search strategies with an accuracy that matches or exceeds that of a human, it could help to improve the quality of literature searches carried out for systematic reviews. The ability to use GenAI in place of human peer review may increase capacity and save time in situations where rapid evidence production is important (e.g., in pandemic situations).
Objectives: To compare error detection rates between GenAI and two human peer-reviewers across existing search strategies of published systematic reviews.
Methods: We will introduce random errors (spelling and Boolean operators) using error generating software in existing searches. Search strategies will be submitted to ChatGPT, Gemini, Claude and to two human information scientists for review. The independent variable will be peer-reviewing with two levels: human and AI, while the outcome (proportion of errors detected relative to existing errors in the search) will be the average difference in error rate between human and AI peer-reviewing performance. Human peer-reviewers will be blind to each other’s search strategies. The time it takes peer-reviewers to review strategies will be recorded.
Results: Results of peer review by human and each AI tool will be compared to determine the rate of errors which each reviewer was able to identify, and which tool(s) were able to correctly identify all errors. Due to the anticipated small sample size, we will use the non-parametric Fisher’s exact test (alpha level of .05) to compare the proportion of error detection between AI and human peer-review. Each point estimate will include 95% confidence intervals.
Conclusions: If generative AI can perform peer review of search strategies with an accuracy that matches or exceeds that of a human, it could help to improve the quality of literature searches carried out for systematic reviews. The ability to use GenAI in place of human peer review may increase capacity and save time in situations where rapid evidence production is important (e.g., in pandemic situations).