The reliability, usability, and applicability of tools to appraise quality and risk of bias in systematic reviews: a prospective evaluation of AMSTAR, AMSTAR 2 and ROBIS

Article type
Authors
Gates M1, Gates A1, Duarte G2, Cary M3, Becker M4, Prediger B4, Vandermeer B1, Fernandes RM5, Pieper D4, Hartling L1
1Alberta Research Centre for Health Evidence (ARCHE), Department of Pediatrics, University of Alberta
2Clinical Pharmacology Unit, Instituto de Medicina Molecular, University of Lisbon
3Centre for Health Evaluation & Research (CEFAR), National Association of Pharmacies
4Institut für Forschung in der Operativen Medizin, Department für Humanmedizin, Universität Witten/Herdecke
5Clinical Pharmacology Unit, Instituto de Medicina Molecular, University of Lisbon & Department of Pediatrics, Santa Maria Hospital, Lisbon
Abstract
Background: readers of systematic reviews (SRs) and overview authors require valid, reliable, and practical means to evaluate the methodological quality and risk of bias of SRs. Evidence of the comparative reliability, usability, and applicability of common tools will inform how each should be used and interpreted.

Objective: to evaluate and compare the inter-rater and inter-centre reliability, usability, and applicability of AMSTAR, AMSTAR 2, and ROBIS.

Methods: using a random sample of 30 SRs of randomized trials, two review authors at each of three collaborating centres (Canada, Germany, Portugal) independently appraised the methodological quality or risk of bias of each SR using AMSTAR, AMSTAR 2, and ROBIS and reached consensus. We tested for inter-rater reliability between pairs of review authors and consensus decisions between centres, using Gwet’s AC1 statistic. To estimate usability, we calculated the median (interquartile range (IQR)) time to complete the appraisals and reach consensus. To inform applications of the tools, we tested for associations between methodological quality or risk of bias and the results and conclusions of the SRs.

Results: the median (IQR) time for review authors to complete the assessments was 15.7 (11.3), 19.7 (12.1), and 28.7 (17.4) minutes for AMSTAR, AMSTAR 2, and ROBIS respectively. The time to reach consensus was 2.6 (3.2), 4.6 (5.3), and 10.9 (10.8) minutes for AMSTAR, AMSTAR 2, and ROBIS, respectively. Inter-rater reliability varied by centre, but across all centres was substantial (AC1 0.61 to 0.80) to almost perfect (AC1 0.81 to 0.99) for 8/11 (73%) AMSTAR, 8/16 (50%) AMSTAR 2, and 13/24 (54%) ROBIS items. Inter-centre reliability was substantial to almost perfect for 6/11 (55%) AMSTAR, 12/16 (75%) AMSTAR 2, and 10/24 (42%) ROBIS items. Agreement on confidence in the results of the review (AMSTAR 2) ranged from slight (AC1 0.05, 95% confidence interval (CI) −0.17 to 0.27) to perfect (1.00) between review authors and moderate (AC1 0.58, 95% CI 0.30 to 0.85) to substantial (AC1 0.74, 95% CI 0.30 to 0.85) across centres. Agreement on overall risk of bias in the SR (ROBIS) ranged from moderate (AC1 0.47, 95% CI 0.17 to 0.77) to almost perfect (AC1 0.96, 95% CI 0.89 to 1.00) between review authors and from poor (AC1 −0.21, 95% CI −0.55 to 0.13) to moderate (AC1 0.56, 95% CI 0.30 to 0.83) across centres. There was no clear relationship between centre-specific appraisals and the results or conclusions of the SRs.

Conclusions: compared to AMSTAR 2 and ROBIS, review authors completed AMSTAR appraisals the quickest and obtained substantial agreement for a greater number (most) of items. Inter-centre reliability was highest for AMSTAR 2. Low levels of inter-centre reliability, particularly on overall AMSTAR 2 and ROBIS ratings, may limit readers’ ability to interpret the ratings applied by review groups. Improved documentation may be needed to assist review authors in consistently interpreting and applying each tool.

Patient or consumer involvement: none