The reliability, usability, and applicability of tools to appraise quality and risk of bias in systematic reviews: a prospective evaluation of AMSTAR, AMSTAR 2 and ROBIS

2019 Santiago

Gates M¹, Gates A¹, Duarte G², Cary M³, Becker M⁴, Prediger B⁴, Vandermeer B¹, Fernandes RM⁵, Pieper D⁴, Hartling L¹

¹Alberta Research Centre for Health Evidence (ARCHE), Department of Pediatrics, University of Alberta

²Clinical Pharmacology Unit, Instituto de Medicina Molecular, University of Lisbon

³Centre for Health Evaluation & Research (CEFAR), National Association of Pharmacies

⁴Institut für Forschung in der Operativen Medizin, Department für Humanmedizin, Universität Witten/Herdecke

⁵Clinical Pharmacology Unit, Instituto de Medicina Molecular, University of Lisbon & Department of Pediatrics, Santa Maria Hospital, Lisbon

Background: readers of systematic reviews (SRs) and overview authors require valid, reliable, and practical means to evaluate the methodological quality and risk of bias of SRs. Evidence of the comparative reliability, usability, and applicability of common tools will inform how each should be used and interpreted.

Objective: to evaluate and compare the inter-rater and inter-centre reliability, usability, and applicability of AMSTAR, AMSTAR 2, and ROBIS.

Methods: using a random sample of 30 SRs of randomized trials, two review authors at each of three collaborating centres (Canada, Germany, Portugal) independently appraised the methodological quality or risk of bias of each SR using AMSTAR, AMSTAR 2, and ROBIS and reached consensus. We tested for inter-rater reliability between pairs of review authors and consensus decisions between centres, using Gwet’s AC1 statistic. To estimate usability, we calculated the median (interquartile range (IQR)) time to complete the appraisals and reach consensus. To inform applications of the tools, we tested for associations between methodological quality or risk of bias and the results and conclusions of the SRs.

Results: the median (IQR) time for review authors to complete the assessments was 15.7 (11.3), 19.7 (12.1), and 28.7 (17.4) minutes for AMSTAR, AMSTAR 2, and ROBIS respectively. The time to reach consensus was 2.6 (3.2), 4.6 (5.3), and 10.9 (10.8) minutes for AMSTAR, AMSTAR 2, and ROBIS, respectively. Inter-rater reliability varied by centre, but across all centres was substantial (AC1 0.61 to 0.80) to almost perfect (AC1 0.81 to 0.99) for 8/11 (73%) AMSTAR, 8/16 (50%) AMSTAR 2, and 13/24 (54%) ROBIS items. Inter-centre reliability was substantial to almost perfect for 6/11 (55%) AMSTAR, 12/16 (75%) AMSTAR 2, and 10/24 (42%) ROBIS items. Agreement on confidence in the results of the review (AMSTAR 2) ranged from slight (AC1 0.05, 95% confidence interval (CI) −0.17 to 0.27) to perfect (1.00) between review authors and moderate (AC1 0.58, 95% CI 0.30 to 0.85) to substantial (AC1 0.74, 95% CI 0.30 to 0.85) across centres. Agreement on overall risk of bias in the SR (ROBIS) ranged from moderate (AC1 0.47, 95% CI 0.17 to 0.77) to almost perfect (AC1 0.96, 95% CI 0.89 to 1.00) between review authors and from poor (AC1 −0.21, 95% CI −0.55 to 0.13) to moderate (AC1 0.56, 95% CI 0.30 to 0.83) across centres. There was no clear relationship between centre-specific appraisals and the results or conclusions of the SRs.

Conclusions: compared to AMSTAR 2 and ROBIS, review authors completed AMSTAR appraisals the quickest and obtained substantial agreement for a greater number (most) of items. Inter-centre reliability was highest for AMSTAR 2. Low levels of inter-centre reliability, particularly on overall AMSTAR 2 and ROBIS ratings, may limit readers’ ability to interpret the ratings applied by review groups. Improved documentation may be needed to assist review authors in consistently interpreting and applying each tool.

Patient or consumer involvement: none