Impact of training and guidance on the inter-rater and inter-consensus reliability of risk of bias instruments for non-randomized studies

Article type
Authors
Jeyaraman MM1, Robson RC2, Pollock M3, Copstein L1, Balijepalli C4, Hofer K5, Xia J6, Al-Yousif N1, Mansour S7, Fazeli MS5, Ansari MT8, Tricco AC9, Rabbani R1, Abou-Setta AM1
1University of Manitoba
2Li Ka Shing Knowledge Institute, St. Michael's Hospital
3Institute of Health Economics
4Pharmalytics Group
5Evidinno Outcomes Research Inc.
6Nottingham Ningbo GRADE Centre
7University of Montréal
8University of Ottawa
9University of Toronto
Abstract
Background: In 2016, a risk of bias (ROB) tool (Risk of Bias in Non-Randomized Studies of Interventions [ROBINS-I]) was developed by the Cochrane Bias Methods group. In 2019, this tool was adapted for non-randomized studies of exposures (Risk of Bias Instrument for Non-Randomized Studies of Exposures [ROB-NRSE]).
Objectives: To evaluate the impact of training and customized guidance on inter-rater reliability (IRR), inter-consensus reliability (ICR; comparison of consensus assessments across reviewer pairs), and evaluator burden of ROBINS-I and ROB-NRSE.
Methods: An international team of seven reviewers from six review centers appraised the ROB using either ROBINS-I (n=44) or ROB-NRSE (n=44) in two stages. Stage one was ROB assessments before training or customized guidance, and stage two was ROB assessments after training and customized guidance. Two pairs of reviewers independently assessed the same sample of study publications in both stages. After completion, each pair resolved conflicts through consensus. Reviewers also recorded the time taken for completion of each step. For analysis of the IRR and ICR, we used Gwet’s AC1 statistic. Agreements among the reviewers were categorized as: poor (0-0.09), slight (0.10-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), near perfect (0.81-0.99), or perfect (1.00).
Results: For ROBINS-I, the IRR (Table 1) improved after training and customized guidance for all domains except “bias in classification of interventions”, which showed a decrease in IRR (from moderate to slight agreement). For ROB-NRSE, the IRR for all domains showed improvement after training and customized guidance (Table 2), except for the “bias due to missing data” domain, for which there were no improvements, and the “bias in classification of exposures” domain, for which there was a slight decrease in IRR (from moderate to fair agreement). For ROBINS-I, the ICR improved for all domains (Table 3). For ROB-NRSE, all domains improved except for “bias due to confounding”, for which there was no improvement after guidance and training (Table 4). For both tools, the overall bias assessments for both IRR and ICR showed improvements after training and guidance.
The evaluator burden (time taken to read article + adjudication + consensus) decreased after guidance and training for ROBINS-I (before training and guidance: 48.45 min [95% CI 45.61, 51.29] vs. after training and guidance: 35.6 min [95% CI 32.77, 38.33), whereas there was a slight increase for ROB-NRSE (before training and guidance: 36.98 min [95% CI 34.80, 39.16] vs. after training and guidance: 40.5 min [95% CI 37.30, 43.66]).
Conclusions: In our cross-sectional study, the IRR and ICR of ROBINS-I and ROB-NRSE improved overall after training and customized guidance. While conducting systematic reviews of non-randomized studies, additional training and customized guidance to reviewers prior to ROB assessments is highly recommended.
Patient or healthcare consumer involvement: Patients or healthcare consumers were not involved in this project.