Testing and revising an algorithm for the classification of study designs in systematic reviews of interventions and exposures

Article type
Authors
Dryden D1, Hartling L2, Bond K2
1ARCHE / Pediatrics, University of Alberta, Edmonton, Alberta, Canada
2Pediatrics, University of Alberta, Canada
Abstract
Background: Systematic reviewers may include nonrandomized studies to provide a more detailed picture of the current knowledge of an intervention. We previously described the development and testing of an algorithm to assist in the classification of study designs in systematic reviews (SR) of interventions and exposures. Such a tool may be used to inform key steps of the review process. Objectives: This study builds on our previous findings by testing the algorithm within the context of a single SR and refining the algorithm to further enhance reliability. Methods: The algorithm was applied to 51 studies included in an SR of the effectiveness of diabetes education. The reference standard classification was developed by 2 researchers who independently classified the studies; disagreements were resolved through discussion with a third reviewer. Four testers, varying in training and experience, independently applied the tool to the same 51 studies. Inter-rater reliability and accuracy against the reference standard were measured and areas of disagreement identified. Results: The 4 testers agreed on the classification for 12 studies; 3 agreed on 19; 2 agreed on 17. For 3 studies, there was no agreement. The overall level of agreement was fair (κ=0.36). Agreement for testers with graduate level training was moderate (κ=0.47). All 4 testers agreed with the reference standard for 11 studies. Agreement between the reference standard developers was moderate (κ=0.57). Two decision nodes were modified. Testing of the revised algorithm is ongoing and results within the context of a second SR will also be presented. Conclusion: The algorithm helped classify studies and identify difficulties due to lack of reporting. The results concur with previous findings showing better reliability among individuals with more training and experience. Additional testing and refinement using different samples will enhance the utility of the tool.