Article type
Abstract
"Background
Decisionmaking requires contextual and background information in addition to comparative effectiveness evidence derived from systematic reviews. There is no established methodology to address context questions in systematic reviews. Artificial intelligence (AI) may help with the editorial structure and content of context questions in systematic reviews.
Objectives
The project aimed to investigate the feasibility of AI-generated responses to context questions in systematic reviews and to compare the AI-generated answers with human-generated content from published systematic reviews.
Methods
We assessed the performance of three large language models: Bard/Gemini, ChatGPT, and Claude for providing answers to 55 context questions in 20 systematic reviews published by the US Agency for Healthcare Research and Quality Evidence-based Practice Center (AHRQ EPC) program. The original context questions were used to prompt each AI tool with minimal prompt engineering. We established evaluation criteria a priori: face validity (i.e., is the answer sufficient?), content accuracy (i.e., number of factual errors), and congruence (i.e., is the answer concordant with human-generated content?). Two independent reviewers rated characteristics of the context questions and the AI performance.
Results
We documented favorable results regarding the feasibility of AI-generated answers to context questions. Using the original context questions as prompts produced relevant responses, and an initial review suggested that face validity can be established across questions and report topics. However, responses differed across AI models in word counts, editorial structure, and content. Performance may be further improved by refining prompts that elicit a pre-specified response for a target audience, and by providing citations to support AI-generated content. Limitations of AI-generated answers include inability of AI tools to transparently document information sources and to replicate answers specific to the user and timestamp. Implications of incorporating AI into the overall workflow of producing systematic reviews will be discussed.
Conclusions
The project documents feasibility of using AI tools to generate answers to context questions in AHRQ EPC systematic reviews, however methodologic limitations should be considered due to the need for transparency in conducting systematic reviews."
Decisionmaking requires contextual and background information in addition to comparative effectiveness evidence derived from systematic reviews. There is no established methodology to address context questions in systematic reviews. Artificial intelligence (AI) may help with the editorial structure and content of context questions in systematic reviews.
Objectives
The project aimed to investigate the feasibility of AI-generated responses to context questions in systematic reviews and to compare the AI-generated answers with human-generated content from published systematic reviews.
Methods
We assessed the performance of three large language models: Bard/Gemini, ChatGPT, and Claude for providing answers to 55 context questions in 20 systematic reviews published by the US Agency for Healthcare Research and Quality Evidence-based Practice Center (AHRQ EPC) program. The original context questions were used to prompt each AI tool with minimal prompt engineering. We established evaluation criteria a priori: face validity (i.e., is the answer sufficient?), content accuracy (i.e., number of factual errors), and congruence (i.e., is the answer concordant with human-generated content?). Two independent reviewers rated characteristics of the context questions and the AI performance.
Results
We documented favorable results regarding the feasibility of AI-generated answers to context questions. Using the original context questions as prompts produced relevant responses, and an initial review suggested that face validity can be established across questions and report topics. However, responses differed across AI models in word counts, editorial structure, and content. Performance may be further improved by refining prompts that elicit a pre-specified response for a target audience, and by providing citations to support AI-generated content. Limitations of AI-generated answers include inability of AI tools to transparently document information sources and to replicate answers specific to the user and timestamp. Implications of incorporating AI into the overall workflow of producing systematic reviews will be discussed.
Conclusions
The project documents feasibility of using AI tools to generate answers to context questions in AHRQ EPC systematic reviews, however methodologic limitations should be considered due to the need for transparency in conducting systematic reviews."