Article type
Abstract
Background
Much of medical research requires assessment of all relevant previously published articles; often many thousands are assessed for relevance by at least 2 experts. Feasibility often implies that only articles in English are considered and that the complete list of search results is first screened by abstract and title. Only select few articles pass this first hurdle to be later screened as full texts. Rapidly developing AI methods show promise as multilingual and scalable alternatives to humans while both meeting skeptical reactions and gaining ground in the scientific community.
Disagreements between human and model decisions are often used as evidence of model mistakes, while disagreements between human experts are accepted magnanimously and with fundamental underestimation of the incidence of human mistakes.
Objectives
We bring to the forefront more realistic measures of mistakes made by humans, both “visible” mistakes, a lack of consensus between reviewers, and “invisible” mistakes, false unanimous inclusions and exclusions. We assess to what extent reviewer decisions can be attributed to common (but unobservable) criteria adopted by reviewer teams.
Methods
We use detailed data from projects assessed entirely by human experts. We compare humans to commercially available abstract screening tools such as DistillerSR and EPPI Reviewer.
Results
The “visible” errors of human reviewers are often already at 10% or more of abstracts selected for inclusion. There is a strong individual component in inclusion decisions (some reviewers are “easier graders” than others, and reviewers tend to place a significant weight on criteria that cannot be identified as common to the teams). In judging AI reviewing capabilities and assessing the standard accuracy, precision, specificity, and sensitivity measures (and using human decisions as the “truth”), only precision measure is significantly better among human reviewers than for the models, but, to be fair, the models learned about inclusion from both unanimous and split-decision inclusions.
Conclusions
Better understanding of human mistakes can result in more focus on precision of inclusion criteria, improved coordination among reviewers and greater acceptance of AI helpers, and, ultimately, better and more comprehensive literature reviews that in turn result in more accurate conclusions by researchers.
Much of medical research requires assessment of all relevant previously published articles; often many thousands are assessed for relevance by at least 2 experts. Feasibility often implies that only articles in English are considered and that the complete list of search results is first screened by abstract and title. Only select few articles pass this first hurdle to be later screened as full texts. Rapidly developing AI methods show promise as multilingual and scalable alternatives to humans while both meeting skeptical reactions and gaining ground in the scientific community.
Disagreements between human and model decisions are often used as evidence of model mistakes, while disagreements between human experts are accepted magnanimously and with fundamental underestimation of the incidence of human mistakes.
Objectives
We bring to the forefront more realistic measures of mistakes made by humans, both “visible” mistakes, a lack of consensus between reviewers, and “invisible” mistakes, false unanimous inclusions and exclusions. We assess to what extent reviewer decisions can be attributed to common (but unobservable) criteria adopted by reviewer teams.
Methods
We use detailed data from projects assessed entirely by human experts. We compare humans to commercially available abstract screening tools such as DistillerSR and EPPI Reviewer.
Results
The “visible” errors of human reviewers are often already at 10% or more of abstracts selected for inclusion. There is a strong individual component in inclusion decisions (some reviewers are “easier graders” than others, and reviewers tend to place a significant weight on criteria that cannot be identified as common to the teams). In judging AI reviewing capabilities and assessing the standard accuracy, precision, specificity, and sensitivity measures (and using human decisions as the “truth”), only precision measure is significantly better among human reviewers than for the models, but, to be fair, the models learned about inclusion from both unanimous and split-decision inclusions.
Conclusions
Better understanding of human mistakes can result in more focus on precision of inclusion criteria, improved coordination among reviewers and greater acceptance of AI helpers, and, ultimately, better and more comprehensive literature reviews that in turn result in more accurate conclusions by researchers.