The reliability, usability, and utility of tools to appraise quality and risk of bias in systematic reviews: a prospective evaluation of AMSTAR, AMSTAR 2 and ROBIS

Tags: Oral
Gates M1, Gates A1, Duarte G2, Cary M3, Becker M4, Prediger B4, Vandermeer B1, Fernandes RM5, Pieper D4, Hartling L1
1Alberta Research Centre for Health Evidence, Department of Pediatrics, University of Alberta, 2Clinical Pharmacology Unit, Instituto de Medicina Molecular, University of Lisbon, 3Centre for Health Evaluation and Research (CEFAR), National Association of Pharmacies, 4Institut für Forschung in der Operativen Medizin, Department für Humanmedizin, Universität Witten/Herdecke, 5Clinical Pharmacology Unit, Instituto de Medicina Molecular, University of Lisbon; Department of Pediatrics, Santa Maria Hospital

Background. Readers of systematic reviews (SRs) and overview authors require valid, reliable, and practical means to evaluate the methodological quality and risk of bias of SRs. Evidence of the comparative reliability, usability, and utility of common tools will inform their use and interpretation.

Objective. To evaluate and compare the interrater and inter-centre reliability, usability, and utility (how the tool may be used to inform the inclusion of SRs in overviews) of three available tools for appraising the quality or risk of bias of SRs: AMSTAR, AMSTAR 2, and ROBIS.

Methods. Using a sample of 30 SRs of randomized trials, two reviewers at each of three centres (Canada, Germany, and Portugal) independently appraised the methodological quality or risk of bias of each SR using AMSTAR, AMSTAR 2, and ROBIS in a random sequence and reached consensus. To test for inter-rater reliability between pairs of reviewers and consensus decisions between centres, we used Gwet’s AC1 statistic. To estimate usability, we calculated the median (interquartile range (IQR)) time to complete the appraisal and reach consensus for each tool. To inform utility in informing the inclusion of SRs in overviews, we tested for associations between methodological quality or risk of bias and the results and conclusions of the SRs.

Results. Reviewers completed AMSTAR, AMSTAR 2, and ROBIS in median (IQR) 15.7 (11.3), 19.7 (12.1), and 28.7 (17.4) minutes, and reached consensus in 2.6 (3.2), 4.6 (5.3), and 10.9 (10.8) minutes, respectively. Across all centres, interrater reliability was substantial to almost perfect (AC1 0.61 to 0.99) for 8/11 (73%) AMSTAR, 9/16 (56%) AMSTAR 2, and 12/24 ROBIS (50%) items. Inter-centre reliability was substantial to almost perfect for 6/11 (55%) AMSTAR, 12/16 (75%) AMSTAR 2, and 7/24 (62.5%) ROBIS items. Inter-centre reliability for confidence in the results of the review or overall risk of bias was moderate (AC1 0.58, 95%CI 0.30 to 0.85) to substantial (AC1 0.74, 95%CI 0.30 to 0.85) for AMSTAR 2 and poor (AC1 -0.21, 95%CI -0.55 to 0.13) to moderate (AC1 0.56, 95%CI 0.30 to 0.83) for ROBIS. There was no clear relationship between centre-specific appraisals and the results or conclusions of the SRs.

Conclusions. Compared to AMSTAR 2 and ROBIS, reviewers completed AMSTAR appraisals more quickly and with better agreement. Inter-centre reliability was highest for AMSTAR 2, but ratings on the overall confidence in the results was variable. Both inter-rater and inter-centre reliability were highly variable for ROBIS. Low levels of inter-centre reliability, particularly on overall ratings of confidence or risk of bias, may limit readers’ ability to interpret the ratings applied by various review groups. It is not clear whether reviewers’ appraisals could be used to inform the inclusion or exclusion of SRs in overviews without altering the overview’s results or conclusions.

Patient or consumer involvement. Patients and consumers were not directly involved, but the findings will assist consumers in interpreting appraisals reported in overviews.