Computed vs. effective intercoder reliability for systematic reviews

1998 Baltimore

Floyd JA, Moulton RA, Medler SM

Introduction: Although systematic reviews have the potential for producing valid summaries of research findings, results are only as sound as the procedures used, including intercoder reliability (ICR) procedures. When establishing ICR, there are two key considerations: (a) choice of statistics used in calculations of ICR, and (b) decreases in effective ICR when coding decisions are not independent. The effective ICR for dependent, e.g., nested, decisions is the product of their independent ICR coefficients.

Objectives: The purpose of this study was to determine the effective ICR for each research question posed in a formal meta-analysis on sleep and ageing.

Methods: Following extensive training on four prototype studies, a purposive sample of 20 studies was selected. The studies were selected to maximize the range of coding decisions necessary for completion of the project. The often-used percentage of agreement statistic was not used to examine ICR since it can overestimate reliability. Kappa and the intraclass correlation were used to estimate ICR for categorical and continuous variables, respectively.

Results: The overall mean ICR was .86 with a range of .31-1.00 for individual variables coded. Due to low computed ICR for some variables and lack of independence, the mean effective ICR's for four major research questions ranged from .58 to .74. The lowest mean effective ICR (.58) was found for a research question about the relationship between effect sizes obtained and study methods used. Because very diverse methods are used to study sleep and aging, coders are required to distinguish among many different types of design, sampling, measurement, and analysis strategies. It has been necessary to make extensive use of typical approaches for increasing ICR, i.e. additional training and rewording of items, to increase the effective ICR for coding methodological variation in sleep research reports.

Discussion: The results of this reliability study show the importance of not relying on a single mean ICR, which assumes independence among all coding decisions, when judging ICR for systematic review. Doing so can mask the unreliability of retrieval of information regarding specific research questions. Since effective ICR's were not in the "excellent" range for all research questions, it seemed desirable to go beyond the usual approaches to increasing ICR. A double-coding approach followed by discussion to resolve differences has been adopted for coding of all key variables in the meta-analysis. The combination of double-coding and traditional approaches to increasing ICR has resulted in effective ICR's in the "excellent" range for all research questions. Funding. This research is supported by the National Institute of Nursing Research, R01 NR03880.