Heterogeneity in meta-analysis: misconceiving I²

2008 Freiburg

Rücker G, Schwarzer G, Schumacher M

Background: Numbers for measuring statistical heterogeneity in metaanalysis include the between-study variance tau², estimated from a random-effects model, and I², measured as a percentage. While Higgins and Thompson (Stat. Med. 21(2002):1539–1558) thoroughly distinguished between these measures and described their properties, clinicians seem to particularly misconceive I². The Cochrane Handbook for Systematic Reviews of Interventions (Version 4.2.6, page 138) states ‘‘A value greater than 50% may be considered as substantial heterogeneity’’. Some reviewers conclude that studies must not be pooled if I² > 50%. However, in contrast to tau², the statistic I², interpreted as the percentage of variability due to between-study heterogeneity rather than sampling error, depends on the precision of the studies. Objectives: For illustrating this, we present a simulation study, based on a published meta-analysis. Methods: Sample sizes are ‘inflated’ using the random-effects model. Given an inflation factor k, for each trial a corresponding k-inflated trial is created with sampling variance reduced by 1/k. Sample size inflation based on the random-effects model does not only implicate that the precision increases with k, but also that the study means tend to theoretical values following a normal distribution. Results: While for increasing precision estimates of tau² vary only randomly, the values of I² increase rapidly to nearly 100%. Therefore it seems questionable to interpret statistical heterogeneity, measured in terms of I² and depending on precision, as clinical heterogeneity. This is comparable to the notoriously wrong interpretation of a significant p-value as evidence of clinical relevance. Conclusions: There are three kinds of heterogeneity that should not be confused: (i) statistical heterogeneity, quantified on the outcome measurement scale; (ii) clinically relevant heterogeneity on the same scale; and (iii) clinical heterogeneity, measured in baseline covariates (not necessarily reflected on the outcome scale). Heterogeneity on the outcome scale should preferably be measured in terms of tau² that does not depend on the number of studies nor on their size. In addition, because tau has the dimension of the outcome scale, it is possible to define a threshold of clinically relevant heterogeneity on this scale in advance.