CHIMERAS showed better inter-rater reliability and inter-consensus reliability than GRADE in grading quality of evidence: a randomized controlled trial

2018 Edinburgh

Wu X¹, Chung VC², Wong CHL², Yip BHK², Cheung WKW², Wu JCY²

¹Central South University

²The Chinese University of Hong Kong

Background:

To inform decision making and guideline developing, appraising quality of evidence (QoE) is an essential process for performing a systematic review. The Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) is one of the tools for assessing QoE, however, concerns about its reliability and comprehensiveness have been raised.

Objective:

To address these shortcomings, we developed the Clinical and Health Intervention Meta-analysis Evidence RAting System (CHIMERAS). This randomized controlled trial aims to assess and compare the reliability of CHIMERAS and GRADE.

Methods:

A single-center, parallel randomized controlled trial was conducted to assess and compare inter-rater (IR) reliability (including IR reliability among individual raters and inter-consensus reliability across pairs of raters) of CHIMERAS and GRADE. Raters were randomly assigned into two groups. They were trained to use either GRADE or CHIMERAS for assessing QoE. QoE from 100 Cochrane systematic reviews (SRs) was assessed with GRADE in group 1 and CHIMERAS in group 2. IR reliability and inter-consensus reliability were evaluated by calculating the two-way random, single-measures intra-class correlation (ICC).

Results:

The 100 SRs covered 17 different categories of conditions, and had included both pharmacological (37.0%) and non-pharmacological interventions (63.0%). CHIMERAS showed moderate agreement (ICC = 0.54, 95% confidence interval (CI) 0.44 to 0.64), while GRADE had fair agreement (ICC = 0.38, 95% CI 0.28 to 0.49) for IR reliability among individual raters. CHIMERAS showed substantial agreement (ICC = 0.78, 95% CI 0.69 to 0.84), while GRADE had moderate agreement (ICC = 0.52, 95% CI 0.36 to 0.65) for inter-consensus reliability across pairs of raters. With GRADE, 77.0% and 11.0% of SRs were judged as having low or very low, and high QoE, respectively. With CHIMERAS, 10.0% and 54.0% of SRs were judged as having low or very low, and high or very high QoE, respectively.

Conclusions:

CHIMERAS outperformed GRADE in terms of IR reliability and inter-consensus reliability. CHIMERAS and GRADE also showed substantial disagreement in grading QoE, indicating the possible impact on decision making attributable to varying rating approaches.

Patient or healthcare consumer involvement: No patients or healthcare consumer was involved in this trial.