Inter-rater reliability of AMSTAR-2 in a review of systematic reviews about interventions to prevent adverse events in the intensive care unit

2023 London

Pantoja PE¹, Suclupe S¹, Requeijo C², Salas K³, Merchán A⁴, Uya J⁵, Martinez-Zapata MJ⁶

¹Iberoamerican Cochrane Centre

²Clinical Epidemiology and Public Health Service. Hospital de la Santa Creu i Sant Pau. Institut de Recerca IIB Sant Pau

³Vall D'Hebron University Hospital

⁴Iberoamerican Cochrane Centre. Department of Social Medicine and Family Health, Universidad Del Cauca, Colombia

⁵Hospital Universitario de Bellvitge, Instituto Catala de Salut, Nursing Research Group, Bellvitge Institute for Biomedical Research

⁶Iberoamerican Cochrane Centre-IIB Sant Pau. CIBERESP

Background:
The AMSTAR-2 (A MeaSurement Tool to Assess systematic Reviews) is a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. With 16 items for evaluation (7 critical and 9 noncritical), the discordances are something to address with a kappa statistic calculation and a third reviewer, demanding researchers time and efforts to complete the overview tasks.

Objectives:
To evaluate the inter-rater reliability and the weighted kappa statistics of AMSTAR-2.

Methods:
We assessed the methodological quality with the AMSTAR-2 tool in an overview of systematic reviews about interventions to prevent adverse events in the intensive care unit (1). The study team was divided to evaluated 38 systematic reviews in pairs. We measured inter-rater agreement between reviewers. Kappa weighted score for agreement between pairs of ratters was calculated and compared by each study and AMSTAR-2 item.

Results:
Agreement between reviewers was significantly high (77.6%) with a good strength of agreement (kw=0.65, p-value < .01), been these results consistent with critical and noncritical items (74.3, .64, p-value < .01; and 80.9, .62, p-value < .01 respectively). Critical items with the least agreement were those referring to the risk of bias and the assessment of heterogeneity in non-randomized studies (9.2 and 11.2), respectively. The non-critical items with the least agreement were the explanation of the study designs selection and description in detail of the included studies (items 3 and 8).

Conclusions:
Our results are in line with the AMSTAR-2 development and validation study (2). The levels of agreement achieved by the pairs of ratters varied across items, but they were moderate to substantial for most items. Differences between ratters reflect the demanding nature of some item level judgments and should prompt group discussion of their causes and importance, and, if needed, consultation with experts in subject matter and methods.
Prior training of the reviewers in the AMSTAR-2 instrument is necessary so that there is maximum consensus when applying it individually.

Patient, public, and/or healthcare consumer involvement: No.

References:
1. Suclupe et al. Aust Crit Care. 2022; S1036-7314(22)00237-5; doi:10.1016/j.aucc.2022.11.003
2. Shea et al. BMJ 2017;358:j4008; doi: https://doi.org/10.1136/bmj.j4008