Inter-rater reliability of AMSTAR-2 in a review of systematic reviews about interventions to prevent adverse events in the intensive care unit

Article type
Authors
Pantoja PE1, Suclupe S1, Requeijo C2, Salas K3, Merchán A4, Uya J5, Martinez-Zapata MJ6
1Iberoamerican Cochrane Centre
2Clinical Epidemiology and Public Health Service. Hospital de la Santa Creu i Sant Pau. Institut de Recerca IIB Sant Pau
3Vall D'Hebron University Hospital
4Iberoamerican Cochrane Centre. Department of Social Medicine and Family Health, Universidad Del Cauca, Colombia
5Hospital Universitario de Bellvitge, Instituto Catala de Salut, Nursing Research Group, Bellvitge Institute for Biomedical Research
6Iberoamerican Cochrane Centre-IIB Sant Pau. CIBERESP
Abstract
Background:
The AMSTAR-2 (A MeaSurement Tool to Assess systematic Reviews) is a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. With 16 items for evaluation (7 critical and 9 noncritical), the discordances are something to address with a kappa statistic calculation and a third reviewer, demanding researchers time and efforts to complete the overview tasks.

Objectives:
To evaluate the inter-rater reliability and the weighted kappa statistics of AMSTAR-2.

Methods:
We assessed the methodological quality with the AMSTAR-2 tool in an overview of systematic reviews about interventions to prevent adverse events in the intensive care unit (1). The study team was divided to evaluated 38 systematic reviews in pairs. We measured inter-rater agreement between reviewers. Kappa weighted score for agreement between pairs of ratters was calculated and compared by each study and AMSTAR-2 item.

Results:
Agreement between reviewers was significantly high (77.6%) with a good strength of agreement (kw=0.65, p-value < .01), been these results consistent with critical and noncritical items (74.3, .64, p-value < .01; and 80.9, .62, p-value < .01 respectively). Critical items with the least agreement were those referring to the risk of bias and the assessment of heterogeneity in non-randomized studies (9.2 and 11.2), respectively. The non-critical items with the least agreement were the explanation of the study designs selection and description in detail of the included studies (items 3 and 8).

Conclusions:
Our results are in line with the AMSTAR-2 development and validation study (2). The levels of agreement achieved by the pairs of ratters varied across items, but they were moderate to substantial for most items. Differences between ratters reflect the demanding nature of some item level judgments and should prompt group discussion of their causes and importance, and, if needed, consultation with experts in subject matter and methods.
Prior training of the reviewers in the AMSTAR-2 instrument is necessary so that there is maximum consensus when applying it individually.

Patient, public, and/or healthcare consumer involvement: No.

References:
1. Suclupe et al. Aust Crit Care. 2022; S1036-7314(22)00237-5; doi:10.1016/j.aucc.2022.11.003
2. Shea et al. BMJ 2017;358:j4008; doi: https://doi.org/10.1136/bmj.j4008