Impact of training and guidance on the inter-rater and inter-consensus reliability of risk of bias instruments for non-randomized studies

2020 Abstracts

Jeyaraman MM¹, Robson RC², Pollock M³, Copstein L¹, Balijepalli C⁴, Hofer K⁵, Xia J⁶, Al-Yousif N¹, Mansour S⁷, Fazeli MS⁵, Ansari MT⁸, Tricco AC⁹, Rabbani R¹, Abou-Setta AM¹

¹University of Manitoba

²Li Ka Shing Knowledge Institute, St. Michael's Hospital

³Institute of Health Economics

⁴Pharmalytics Group

⁵Evidinno Outcomes Research Inc.

⁶Nottingham Ningbo GRADE Centre

⁷University of Montréal

⁸University of Ottawa

⁹University of Toronto

Background: In 2016, a risk of bias (ROB) tool (Risk of Bias in Non-Randomized Studies of Interventions [ROBINS-I]) was developed by the Cochrane Bias Methods group. In 2019, this tool was adapted for non-randomized studies of exposures (Risk of Bias Instrument for Non-Randomized Studies of Exposures [ROB-NRSE]).
Objectives: To evaluate the impact of training and customized guidance on inter-rater reliability (IRR), inter-consensus reliability (ICR; comparison of consensus assessments across reviewer pairs), and evaluator burden of ROBINS-I and ROB-NRSE.
Methods: An international team of seven reviewers from six review centers appraised the ROB using either ROBINS-I (n=44) or ROB-NRSE (n=44) in two stages. Stage one was ROB assessments before training or customized guidance, and stage two was ROB assessments after training and customized guidance. Two pairs of reviewers independently assessed the same sample of study publications in both stages. After completion, each pair resolved conflicts through consensus. Reviewers also recorded the time taken for completion of each step. For analysis of the IRR and ICR, we used Gwet’s AC1 statistic. Agreements among the reviewers were categorized as: poor (0-0.09), slight (0.10-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), near perfect (0.81-0.99), or perfect (1.00).
Results: For ROBINS-I, the IRR (Table 1) improved after training and customized guidance for all domains except “bias in classification of interventions”, which showed a decrease in IRR (from moderate to slight agreement). For ROB-NRSE, the IRR for all domains showed improvement after training and customized guidance (Table 2), except for the “bias due to missing data” domain, for which there were no improvements, and the “bias in classification of exposures” domain, for which there was a slight decrease in IRR (from moderate to fair agreement). For ROBINS-I, the ICR improved for all domains (Table 3). For ROB-NRSE, all domains improved except for “bias due to confounding”, for which there was no improvement after guidance and training (Table 4). For both tools, the overall bias assessments for both IRR and ICR showed improvements after training and guidance.
The evaluator burden (time taken to read article + adjudication + consensus) decreased after guidance and training for ROBINS-I (before training and guidance: 48.45 min [95% CI 45.61, 51.29] vs. after training and guidance: 35.6 min [95% CI 32.77, 38.33), whereas there was a slight increase for ROB-NRSE (before training and guidance: 36.98 min [95% CI 34.80, 39.16] vs. after training and guidance: 40.5 min [95% CI 37.30, 43.66]).
Conclusions: In our cross-sectional study, the IRR and ICR of ROBINS-I and ROB-NRSE improved overall after training and customized guidance. While conducting systematic reviews of non-randomized studies, additional training and customized guidance to reviewers prior to ROB assessments is highly recommended.
Patient or healthcare consumer involvement: Patients or healthcare consumers were not involved in this project.