Inter-rater reliability, inter-consensus reliability and evaluator burden of ROBINS-E and ROBINS-I: a cross-sectional study

2019 Santiago

Jeyaraman MM¹, Rabbani R¹, Robson R², Copstein L¹, Al-Yousif N¹, Xia J³, Pollock M⁴, Hofer K⁵, Balijepalli C¹, Mansour S⁶, Bond K⁴, Fazeli M⁵, Ansari M⁷, Tricco A⁸, Abou-Setta AM¹

¹George and Fay Yee Center for Healthcare Innovation, University of Manitoba

²Li Ka Shing Knowledge Institute, St. Michael's Hospital

³Nottingham Ningbo GRADE Centre

⁴Institute of Health Economics

⁵Evidinno

⁶University of Montreal

⁷University of Ottawa

⁸Dalla Lana School of Public Health, University of Toronto

Background: recently a 'Risk of bias' (RoB) tool was developed for non-randomized studies (NRS) of interventions (ROBINS-I), which was later modified and adapted for NRS of environmental/nutritional exposures (ROBINS-E). However, the inter-rater reliability (IRR) and inter-consensus reliability (ICR) of these tools has yet to be independently verified.

Objectives: the objectives of our study are to establish the IRR, ICR, and evaluator burden of ROBINS-I, and ROBINS-E.

Methods: an international team of evaluators from six participating centres appraised the RoB of a sample of NRS of either interventions or exposures, using either ROBINS-I (n = 44) or ROBINS-E (n = 44), respectively. Evaluators were paired up into teams that reviewed the same sample study publications in order to allow the evaluation of ICR. After completion of individual adjudications, each pair of evaluators resolved conflicts through consensus. They also tracked the time for completion of each step. For analysis of the IRR and ICR, we used Gwet’s AC1 statistic. We categorized agreements among evaluators as follows: poor (< 0), slight (0.00 to 0.20), fair (0.21 to 0.40), moderate (0.41 to 0.60), substantial (0.61 to 0.80), near perfect (0.81 to 0.99), or perfect (1.00). To assess evaluator burden, we analyzed the average time taken for individual adjudications, and the consensus process. We used Microsoft Excel, Review Manager 5.3 and SAS 9.4 for data management and analysis.

Results: for both ROBINS-1 and ROBINS-E, the IRR (Table 1) indicated slight agreement for evaluating 'bias due to confounding'. For ROBINS-I, the agreements for the remaining domains ranged from fair to substantial agreement. For ROBINS-E, the agreements for the remaining domains ranged from poor to moderate agreement, except for the domain 'bias in measurement of outcomes' for which there was almost perfect agreement. The overall bias assessments showed poor agreement for ROBINS-I, and slight agreement for ROBINS-E.

For ICR, ROBINS-I (Table 2) ranged from slight to substantial agreement and ROBINS-E (Table 2) ranged from poor to perfect agreement. The overall bias assessments showed slight and poor agreement between raters for ROBINS-I, and ROBINS-E, respectively.

With regards to the evaluator burden, the average time taken (reading the study report + adjudication + consensus) by the evaluators were 42.7 ± 7.7 minutes and 48 ± 8.3 minutes for ROBINS-I, and ROBINS-E, respectively. As an extension of this project, we are currently investigating whether training and additional supportive material would improve the IRR and ICR for both the tools.

Conclusions: overall, ROBINS-I had better IRR and ICR compared to ROBINS-E. The assessment times for both tools were similar. Measures to increase agreements between raters are required (e.g. detailed training, supportive material).

Patient or healthcare consumer involvement: healthcare consumers were not involved in this methods project.