Reproducibility of Grading of Recommendations Assessment, Development and Evaluation (GRADE) factors on the strength of recommendations: an empirical assessment

2015 Vienna

Kumar A¹, Miladinovic B¹, Guyatt GH², Schunemann HJ², Djulbegovic B¹

¹Morsani College of Medicine, USF Health, Program for CER, Tampa, USA

²McMasters University, Hamilton, Canada

Background: GRADE is a widely used methodology for the development of clinical practice guideline but its reproducibility has not been tested in context of development of clinical practice guidelines.
Objective: Assess the reproducibility of all GRADE factors among guidelines panel members with limited exposure to GRADE methodology.
Methods: The study was conducted as part of the clinical practice guideline development process of American Association of Blood Banking (AABB) for the use of prophylactic versus therapeutic platelet transfusion in patients with thrombocytopenia. The results from systematic review and meta-analysis for each question were summarized as a GRADE evidence profile. Inter-rater agreement for all GRADE factors and strength of recommendations was summarized using a weighted kappa statistic with 95% confidence intervals (CI).
Results: Eighteen panel members participated in the deliberation of making recommendations and completed the online questionnaire. They were given two one-hour lectures about GRADE. The inter-rater agreement for the domain of quality of evidence was good (kappa value: 0.68; 95% CI 0.541 to 0.837), and fair for balance of benefit and harms (kappa value: 0.4; 95% CI 0.253 to 0.574) and use of resources (kappa value: 0.275: 95% CI 0.116 to 0.421). The inter-rater agreement was moderate for the GRADE domain of patients’ values and preferences (kappa value: 0.441; 95% CI 0.307 to 0.555). The inter-rater agreement for making a for/against recommendation was good (kappa value: 0.738; 95% CI 0.331 to 0.914) and fair for strong/weak recommendation (kappa value: 0.391; 95% CI 0.175 to 0.681).
Conclusions: While not all elements of GRADE system had good agreement, the inter-rater agreement for assessing the quality of evidence and issuing a recommendation of for, versus against, was substantial. This is probably because GRADE has operationalized these two areas in more details than other domains. Further operationalization of all GRADE domains would likely improve its reproducibility across the entire GRADE system.