Applying the GRADE tool in systematic reviews: inter-rater reliability and sources of discrepancy

2010 Keystone

Hartling L¹, Fernandes R², Vandermeer B¹, Dryden D¹

¹Pediatrics, University of Alberta, Edmonton, Alberta, Canada

²Departamento da Crianca e da Familia, Hospital de Santa Maria, Lisboa, Portugal

Background: GRADE was developed to address shortcomings of tools to assess quality of a body of evidence. This is a key step in making recommendations to inform decision-making. While much has been published around GRADE, there are few empirical and systematic evaluations. Our objective was to assess GRADE for systematic reviews (SRs) in terms of reliability and identify areas of uncertainty. Methods: We applied GRADE to 2 SRs (n = 48 and 125 studies). Two reviewers graded evidence independently for outcomes deemed clinically important a priori. Inter-rater reliability (IRR) was assessed using kappas for 4 main domains (risk of bias [RoB], consistency, directness, and precision) and overall strength of evidence (SoE). Results: For the first SR, 51 outcomes were graded across 6 comparisons. IRR was: κ = 0.41 for RoB; 0.84 consistency; 0.18 precision; 0.44 overall SoE. Kappa could not be calculated for directness as one rater assessed all items as direct; assessors agreed in 41% of cases. For the second SR, 24 outcomes were graded across 11 comparisons. IRR was: 0.37 consistency and 0.19 precision. Kappa could not be assessed for other items; assessors agreed in 33% of cases for RoB; 100% directness; 58% SoE. Precision created the most uncertainty due to difficulties in identifying ‘‘optimal’’ information size and minimal clinically important differences and making assessments when there was no meta-analysis. Other sources of discrepancy were recorded and resolutions proposed. GRADE evaluations in other SRs are ongoing. Conclusions: As researchers with varied levels of training and experience use GRADE, there is increased risk for variability in interpretation and application. This study shows variable agreement across the GRADE domains, reflecting areas where judgment is required. Further evaluation is required to enhance consistency and ensure the same methodological rigour that is applied to other steps of a SR is applied to grading the evidence.