Article type
Abstract
Background: The use of large language models (LLMs) is rapidly increasing in the domain of evidence synthesis. This shift is driven by the remarkable surge in available evidence and availability of advanced LLMs in recent times.
Objectives: The objective of this study was to investigate the performance of GPT-3.5 Turbo for conducting primary screening across varied types of systematic literature reviews (SLRs).
Methods: We fed three independent screening rules to GPT-3.5 Turbo for primary screening of 200 studies each in economic burden, humanistic burden, and epidemiological SLR, respectively. The decisions made by the human reviewer were taken as the reference response to assess the performance of GPT-3.5 Turbo. The assessment criteria included decision match rate (identical inclusion and exclusion decisions between the human reviewer and GPT-3.5 Turbo) and sensitivity score (correct inclusions by GPT-3.5 Turbo relative to the human reviewer).
Results: The human reviewer screened all 600 articles, while GPT-3.5 Turbo screened a total of 581 articles and did not take any decision for remaining 19 articles. GPT-3.5 Turbo demonstrated decision match rates of 98.9%, 88.6%, and 75.5%, for the economic burden, humanistic burden, and epidemiology SLR, respectively. The corresponding sensitivity scores were 0.67, 0.61, and 0.94, respectively. In scenario analysis, we noted that the performance metrics of GPT-3.5 Turbo varied substantially based on the amends in screening rules.
Conclusions: In conclusion, the findings of this study show the potential of GPT-3.5 Turbo in conducting primary screening in SLRs. It is crucial to acknowledge that while GPT-3.5 Turbo shows promising capabilities, its performance is not uniform across all scenarios. Researchers should exercise caution and conduct thorough scenario analyses to understand the LLM's strengths and limitations in different contexts. Future research should focus on refining the model further, addressing domain-specific challenges, and exploring ways to enhance its adaptability in various research contexts.
Objectives: The objective of this study was to investigate the performance of GPT-3.5 Turbo for conducting primary screening across varied types of systematic literature reviews (SLRs).
Methods: We fed three independent screening rules to GPT-3.5 Turbo for primary screening of 200 studies each in economic burden, humanistic burden, and epidemiological SLR, respectively. The decisions made by the human reviewer were taken as the reference response to assess the performance of GPT-3.5 Turbo. The assessment criteria included decision match rate (identical inclusion and exclusion decisions between the human reviewer and GPT-3.5 Turbo) and sensitivity score (correct inclusions by GPT-3.5 Turbo relative to the human reviewer).
Results: The human reviewer screened all 600 articles, while GPT-3.5 Turbo screened a total of 581 articles and did not take any decision for remaining 19 articles. GPT-3.5 Turbo demonstrated decision match rates of 98.9%, 88.6%, and 75.5%, for the economic burden, humanistic burden, and epidemiology SLR, respectively. The corresponding sensitivity scores were 0.67, 0.61, and 0.94, respectively. In scenario analysis, we noted that the performance metrics of GPT-3.5 Turbo varied substantially based on the amends in screening rules.
Conclusions: In conclusion, the findings of this study show the potential of GPT-3.5 Turbo in conducting primary screening in SLRs. It is crucial to acknowledge that while GPT-3.5 Turbo shows promising capabilities, its performance is not uniform across all scenarios. Researchers should exercise caution and conduct thorough scenario analyses to understand the LLM's strengths and limitations in different contexts. Future research should focus on refining the model further, addressing domain-specific challenges, and exploring ways to enhance its adaptability in various research contexts.