Eterogeneity 9.1.: 9.1.1. What Do We Mean by Heterogeneity?
Eterogeneity 9.1.: 9.1.1. What Do We Mean by Heterogeneity?
Eterogeneity 9.1.: 9.1.1. What Do We Mean by Heterogeneity?
HETEROGENEITY
9.1. Overview
The term “heterogeneity” refers to the dispersion of true effects across studies.
Typically, the studies in a meta-analysis will differ from each other in various
ways. Each study is based on a unique population, and the impact of any
intervention will typically be larger in some populations and smaller in others.
The specifics of the intervention may vary from study to study, the scale used
to assess outcome may vary from study to study, and so on. Each of these
factors may have an impact on the effect size. One goal of the analysis will
be to determine how much the effect size varies across studies, and this is
variation is called heterogeneity (Ades, Lu, & Higgins, 2005; P. Glasziou &
Sanders, 2002; J. Higgins, Thompson, Deeks, & Altman, 2002; J. P. Higgins
et al., 2009; Keefe & Strom, 2009; Thompson, 1994).
75
76 MISTAKES RELATED TO HETEROGENEITY
The same ideas apply when we turn to meta-analysis. For example, consider
the following.
Castells et al. (2011) conducted a meta-analysis of seventeen studies to
assess the impact of methylphenidate in adults with Attention Deficit
Hyperactivity Disorder (ADHD). Patients with this disorder have trouble
performing cognitive tasks, and it was hypothesized that the drug would
improve their cognitive function. Patients were randomized to receive either
the drug or a placebo, and then tested on measures of cognitive function. The
effect size was the standardized mean difference between groups on the
measure of cognitive function.
In this context –
It turns out that the mean effect size is 0.50. On average, across all
comparable populations, the drug increases cognitive functioning by one-half
a standard deviation. But to understand the potential utility of the drug we
also need to ask about heterogeneity.
Heterogeneity − Overview 77
We might make the following decisions about the utility of the drug in
the three cases.
A. We can expect to see pretty much the same effect in all populations.
B. The impact will vary somewhat across populations, but from a clinical
perspective we can still talk about a common effect size.
C. The impact varies substantially across populations. It would be important
to establish where the impact is trivial, moderate, and high, so that we can
target this intervention more effectively. However, since the impact is
always positive, we could use this intervention immediately.
These judgments are subjective. For example, we can discuss whether to
recommend the intervention in case C, where the effect will be trivial in some
populations. What is clear though, is that when we discuss the potential utility
of the drug, it should be based on this type of information.
the effect size that we would see if we could somehow enroll the entire
population in the study. The observed effect size serves as an estimate of the
true effect size but invariably falls below or above the true effect size due to
sampling error.
The variance of observed effects tends to be larger than the variance of
true effects. To understand why, consider what would happen if we ran five
studies based on the same population, and computed the effect size in each.
The true effect size is the same in all five studies (all studies are estimating
the effect size in the same population) and so the variance of true effects is
zero. Yet, the observed effects will differ from each other because of
sampling error, and so the variance of the observed effects will be greater than
zero. While this is most intuitive in the case when the variance of true effects
is zero, it applies also when the true effects vary. The variance of observed
effects tends to exceed the variance of true effects.
The ADHD analysis serves as a case in point. Figure 24 shows two plots.
The inner plot shows the dispersion of true effects, while the outer plot shows
the dispersion of observed effects. We see the outer plot, but we care about
the inner plot since the inner plot tells us how much the effect size really varies
across populations.
each statistic means, and then focus on the ones that are relevant to this
question.
On the pages that follow, I address various issues including the following
9.2.1. Mistake
Heterogeneity refers to the fact that the true effect size varies across studies.
Some researchers believe that heterogeneity diminishes the utility of the
analysis. In an extreme version of this idea, some have asserted that when the
effect sizes are heterogeneous, it is a bad idea to perform a meta-analysis at
all. The truth is more complicated.
9.2.2. Details
Heterogeneity is not inherently good or bad, but it does affect what we can
learn from the analysis. If our goal in the analysis is to report that the
intervention increases scores by a certain value, then heterogeneity is indeed
a problem. In the absence of heterogeneity, we can report a common effect
size that applies to all populations. In the presence of heterogeneity, there is
no common effect size and so we cannot meet this goal.
However, in the presence of heterogeneity we can assess the extent of
heterogeneity and report, for example, that the effect size is as low as 0.05 in
some populations and as high as 0.95 in others. If this is the true state of
affairs, then this should be the goal of the analysis.
Figure 25 | High dose vs. standard dose of statins | Risk ratio < 1 favors high dose
Figure 26 | Methylphenidate for adults with ADHD | Effect size > 0 favors treatment
Second, I said that when heterogeneity is trivial, the mean effect size
provides definitive information about all comparable studies. This statement
comes with some important caveats.
A. This refers to the true heterogeneity, not the estimated heterogeneity. The
fact that heterogeneity is estimated as being trivial (or zero) does not
necessarily mean that the true heterogeneity is trivial.
B. The description of heterogeneity as being trivial or substantive refers to
the practical impact of the intervention rather than some statistical index.
The researcher (or reader) would need to decide what amount of
dispersion is of practical importance.
C. The statement that the mean effect size applies to all comparable studies
is more useful in theory than in practice. In practice, it may not be clear
what studies are comparable to those in the analysis.
The good folks in the town of New Cuyama erected a sign that captured
some key statistics. The population is 562, the town is 2150 feet above sea
level, and the town was established in the year 1951. They summed these
statistics and report the total is 4663.
Dr. Laird said that this would be an example where people had gone too
far. But in most cases, heterogeneity is not a problem if we treat it
appropriately.
Summary
9.3.1. Mistake
The prediction interval addresses the question we intend to ask when we ask
about heterogeneity. It tells us how the true effect size varies across
populations, and it does so on a scale that allows us to address the utility of
the intervention. The mistake that researchers make is that they neglect to
report this interval.
9.3.2. Details
The following examples show how the prediction interval addresses the issue
of heterogeneity in a concise and intuitive format.
Figure 28 | Methylphenidate for adults with ADHD | Effect size > 0 favors treatment
• Researchers typically report statistics such as Q, I2, and T2, but none of
these tells us how much the effect size varies. Here, Q is 30.106 with 16
degrees of freedom, I2 is 47%, and T2 is 0.039. Based on this information,
few readers would have any sense of the dispersion in effects.
• By contrast, the prediction interval reports the extent of the dispersion in
the same units as the effect size. The effect size varies over roughly 90
points (in d units) and we understand what that means.
• Additionally, the prediction interval reports the dispersion using absolute
values. It tells us not only that the effects vary over roughly 90 points, but
also that the specific range of values is 0.05 to 0.95 (rather than −0.45 to
+0.45, for example). The treatment is very helpful in some cases and
minimally helpful in others, but there are no populations within the
prediction interval where the treatment is likely to be harmful.
The prediction interval 87
• The prediction interval reports the extent of the dispersion in the same
units as the effect size (mmHg), and we understand what a range of 7
points means on this scale.
• The prediction interval reports the dispersion using absolute values. It
tells us not only that the effects vary over roughly 7 mmHg, but line [C]
shows that the treatment helpful (less than zero) in roughly 60% of
populations and harmful (greater than zero) in the other 40%.
Figure 29 | GLP-1 mimetics and diastolic BP | Mean difference < 0 favors treatment
be known, and it would not be possible to have this discussion (see section
9.5).
The prediction interval speaks to the dispersion in effects, and for that reason
only applies when the estimate of the variance (T2) is greater than zero. When
the estimate of T2 is zero, we generally would report the mean and confidence
interval, but not the prediction interval.
Figure 30 | High dose vs. standard dose of statins | Risk ratio < 1 favors high dose
In this analysis, τ2, the variance of true effects, was estimated as zero.
When τ2 is estimated as zero we can generally assume that this is an
underestimate and the actual value of τ2 is positive. Nevertheless, we assume
that the true variance is trivial, and proceed accordingly. Here we would
report that the mean effect size in the universe of comparable populations falls
in the interval 0.786 to 0.917, and that there is no evidence that the effect size
varies across studies.
As always, the confidence interval is an index of precision, not an index
of dispersion. The fact that the confidence interval is 0.786 to 0.917 does not
tell us that the effect size varies from 0.786 in some populations to 0.917 in
90 MISTAKES RELATED TO HETEROGENEITY
others. Rather, we assume that the true effect size is roughly the same in all
populations. This common effect size is assumed to fall somewhere in this
range. Since we assume that the effect size is roughly the same for all
populations, we omit the prediction interval [C].
I describe the prediction interval by reporting (for example) that the effect
size ranges from 0.05 in some populations to 0.95 in others. To be clear, this
is not simply a report of the lowest and highest effects. Rather, the basic
approach to computing prediction intervals is to use the mean plus or minus
two standard deviations, which is the same approach we would take in a
primary study. However, there are some technical issues that we need to
address. For all effect-size indices we need to expand the intervals to take
account of the fact that the mean and standard deviation are estimated with
error. For some effect-size indices we need to transform the values into
another metric before computing the intervals.
In Appendix VII, I present the formulas for computing prediction
intervals that address both issues. As a practical matter, it is much simpler to
use a spreadsheet for the computations. This spreadsheet may be downloaded
on the book’s web site. This spreadsheet may be used as an adjunct to any
computer program, since it requires the user to enter only four items (the
number of studies, the mean effect size, the upper limit of the confidence
interval, and T2).
9.3.8. Some caveats regarding the prediction interval
All the analyses we perform as part of a meta-analysis (or any analysis, for
that matter) require that some assumptions be met. If these assumptions are
violated, the results may not be reliable. In the case of prediction intervals,
we need to keep the following in mind.
The interval will be reasonably accurate if it is based on enough data.
The minimum number of studies needed to compute a useful prediction
interval would depend on the extent of heterogeneity, but would likely be at
least ten in many cases (Hedges & Vevea, 1998). It would be reasonable to
have more faith in the accuracy of the interval as the number of studies
increases.
When computing the prediction interval, we typically assume that the
effects are normally distributed. However, in practice this will not always be
the case. For example, (Hackshaw, Law, & Wald, 1997) looked at the
The prediction interval 91
However, we also have the option of constructing a normal curve for the
prediction interval, as in Figure 31, which is also based on the ADHD
analysis. In this figure line [C] denotes the part of the curve which captures
the effect size in some 95% of all populations. The sections of the plot to the
left and right of line [C] correspond to the 5% of effects that fall outside the
95% prediction interval. Line [C] in Figure 31 is the same as line [C] in Figure
28. However, Figure 31 highlights the fact that most populations will have
an effect size toward the center of the curve, with relatively few near the
extremes.
The web site includes an Excel spreadsheet that can be used to create this
plot. To use the plot, the user needs to enter only the mean effect size, the
upper limit of the confidence interval, Tau-squared, and the number of
studies. Since all programs report these values, the spreadsheet can be used
as an adjunct to any software for meta-analysis.
As noted above, the prediction interval will not be reliable when based on a
small number of studies. To be clear, the problem of trying to estimate the
prediction interval with too few studies applies also to the other indices,
including T2, T, and I2. So, if we are concerned that we do not have enough
studies, switching to one of those indices is not a useful option. Ironically,
the poor precision for T2 and I2 has few practical problems because people do
not actually use those values in any meaningful way. By contrast, the
prediction interval does present information in an intuitive format, and so
reporting incorrect values for this interval can have real repercussions. For
The prediction interval 93
that reason, it might be best to only report the interval when we have enough
studies to ensure that the estimate is reasonably precise.
Summary
The one statistic that does provide this information is the prediction
interval. The prediction interval tells us the range of effects in the same
metric that we use for the effect size, so that we understand the range of
dispersion. Critically, it tells us the range of effects on an absolute scale,
so we know (for example) if the impact ranges from moderate to large, or
from trivial to moderate, or from harmful to helpful.
9.4.1. Mistake
9.4.2. Details
The confidence interval and the prediction interval are two entirely separate
indices. They address two entirely distinct issues.
• One goal is to estimate the mean effect size. The confidence interval is
an index of precision, and tells us how precisely we have estimated the
mean. A confidence interval of 40 to 60 tells us that the mean effect size
in the universe of comparable populations falls somewhere in this range.
(More accurately, in 95% of all meta-analyses the mean effect size will
fall within the confidence interval).
• A second goal is to estimate the dispersion of effects. The prediction
interval is an index of dispersion. A prediction interval of 25 to 75 tells
us that the true effect size will be as low as 25 in some populations, and
as high as 75 on others.
At the bottom of the plot are two diamonds. The first diamond shows the
confidence interval for the fixed-effect model, while the second diamond
shows the confidence interval for the random-effects model. The first
diamond has a width of 7.5 points while the second has a width of 20 points.
Researchers sometimes assume that the span for the random-effects model
tells us that the effects are dispersed over this (wider) range. This is incorrect
– both diamonds speak only to the precision of the estimate for the mean.
• The confidence interval labeled “FE” is based on the standard error for
the fixed-effect model or the fixed-effects model. If all studies are
sampled from the same population (fixed effect) or if we are reporting the
mean for the studies in the analysis only and not for a wider universe of
comparable studies (fixed effects), in 95% of all analyses this confidence
interval will include the true effect size for the population(s) in question.
This interval has a width of 7.5 points. This is also labeled [A] in keeping
with the conventions of this volume (see section 5).
• The confidence interval labeled “RE” is based on the standard error for
the random-effects model. If the studies are sampled from different
populations, and we are generalizing to the universe of comparable
populations, in 95% of all analyses this confidence interval will include
96 MISTAKES RELATED TO HETEROGENEITY
the true mean effect size for the universe. This interval has a width of 20
points. This is also labeled [B] in keeping with the conventions of this
volume.
• The confidence interval for the fixed-effect model [A] tells us that the
mean prevalence in this set of thirty studies falls in the range of 0.235 to
0.257.
• The confidence interval for the random-effects model [B] tells us that the
mean prevalence in the universe of comparable populations falls in the
range of 0.194 to 0.272.
Prediction interval vs. confidence interval 97
• The prediction interval [C] tells us that the prevalence in any single
population is as low as 0.082 in some, and as high as 0.500 in others.
In this example, the random-effects confidence interval [B] spans eight points
while the prediction interval [C] spans forty-two points. Clearly, to conflate
one with the other would be a serious mistake.
Taylor, Smith, Gee, and Nielsen (2012) looked at the impact of augmenting
clozapine with a second antipsychotic (Figure 34). The effect size index is
the standardized mean difference (d).
98 MISTAKES RELATED TO HETEROGENEITY
• The confidence interval for the fixed-effect model extends 0.151 on either
side of the mean [A]. This tells us that the mean effect in this specific set
of fifteen studies falls in the range of −0.349 to −0.052.
• The confidence interval for the random-effects model extends 0.213 on
either side of the mean [B]. This tells us that the mean effect in the
universe of comparable populations falls in the range of −0.452 to −0.026.
• The prediction interval extends 0.590 on either side of the mean [C]. This
tells us that the effect size in any one population will could be as low as
−0.83 (improving function by 0.83 units) or as high as +0.35 (harming
function by 0.35 units).
We can say that the mean effect is “Helpful” on average since the
confidence interval for the mean falls entirely to the left of zero. However, in
any single population the effect could be either helpful or harmful since the
prediction interval includes values on both sides of zero. What should be
clear, is that the confidence interval and the prediction interval are addressing
two entirely distinct issues, and to conflate one with the other would be a
serious mistake.
Prediction interval vs. confidence interval 99
Figure 35 | GLP-1 mimetics and diastolic BP | Mean difference < 0 favors treatment
• The prediction interval extends 3.65 units on either side of the mean [C].
This tells us that the effect size in any given population will usually fall
with 3.65 units of the mean, in the range of −4.08 to +3.13.
9.4.7. Formulas
The confidence interval is based on the mean effect size and the standard
error of the mean effect size. By contrast, the prediction interval is based on
the mean effect size and the standard deviation of the effect size. The
confidence interval for the mean may be computed as
CI M= M ± 1.96( SE ) , (5)
where M is the sample mean and SE is the standard error of the mean. By
contrast, the prediction interval may be computed as
= M ± 1.96(T ) ,
PI (6)
The formula for the confidence interval (5) is the same for the fixed-
effect and the random-effects model, in that both are based on the mean and
the standard error of the mean. Where they differ is in the computation of the
standard error (SE). For the fixed-effect model, the SE reflects sampling error
based on within-study variance, whereas for the random-effects model, the
SE reflects sampling error based on within-study variance and between-study
variance. In the case where the effect size is the score in one group, the
within-study variance is the same for all studies, the standard error for the
fixed-effect model is
V
SE = , (7)
N
V T2
SE
= + , (8)
N k
Summary
9.5.1. Mistake
It is widely believed that the I2 statistic tells us how much the effect size varies
across studies. In some cases, this belief is codified, with I2 values of 25%,
50%, and 75% taken to reflect low, moderate, and high amounts of dispersion.
While this interpretation of I2 is ubiquitous, it is nevertheless incorrect, and
reflects a fundamental misunderstanding of this index.
9.5.2. Details
• The prediction interval, which corresponds to line [C] in the plot, tells us
that the true effect size in 95% of all populations will fall in the
approximate range of 0.10 to 0.90. This is what we have in mind when
we ask about heterogeneity.
• By contrast, the I2 statistics tells us about the relationship between the two
distributions. Concretely, I2 is 47%, which tells us that the variance of
true effects (the inner curve) is 47% as large as the variance of observed
effects (the outer curve). This information is relevant for other purposes,
but is tangential to the question of how much the effect size varies.
I present two sets of examples to illustrate this point. The first set uses
the standardized mean difference as the effect size index. The second set uses
the risk ratio as the effect size index. Aside from that, the two sets of
examples are parallel to each other, and the reader should feel free to focus
on either one.
the Crime analysis (bottom panel) I2 is 92% and the effects vary over 40
points. Thus, the higher value of I2 corresponds to smaller amount of
dispersion.
The fact that the higher value of I2 corresponds to the smaller amount of
dispersion will be confusing to researchers who assume that I2 tells us how
much the effect size varies. However, it will make sense for researchers who
understand that I2 is a proportion, not an absolute value. This becomes clear
with reference to Figure 38. This is similar to Figure 37, but now each panel
has two curves rather than one. The inner curve is identical to the one in the
prior plot, and corresponds to the dispersion of true effects. But here, we have
added an outer curve which corresponds to the dispersion of observed effects.
The top panel in Figure 38 shows the ADHD analysis. To quantify the
difference between the inner and outer curves we can pick any point on the
distribution and compare the width of one curve vs. the other. At line [C] the
inner curve covers 77 points, whereas the outer curve covers 113 points. The
ratio of inner to outer is thus 68% in linear units or 47% in squared units. This
106 MISTAKES RELATED TO HETEROGENEITY
``
Figure 38 | I2 and Prediction interval for two meta-analyses
the top panel the true effect size varies from roughly 0.10 in some populations
to 0.90 in others, as indicated by line [C]. In the bottom panel the true effect
size varies from −0.10 in some populations to +0.30 in others, as indicated by
line [C]. When we are asking about the utility of an intervention, we almost
invariably are interested in the amount of variance, not the proportion. As
such, we are asking about the prediction interval, and not about I2.
Finally, it might be helpful to show the relationship between these
numbers and the actual forest plot for the two analyses.
92%). However, it should be clear from Figure 41 that the opposite is true,
since the distribution of effects for the Stents analysis is obviously wider than
the distribution of effects for the Smoking analysis.
In each panel, line [C] corresponds to the prediction interval, which tells
us the dispersion of true effects in the metric of the effect-size index. In the
Stents analysis (top panel) I2 is 56% and the effects vary over 86 points. In
the Smoking analysis (bottom panel) I2 is 92% and the effects vary over 25
points. Thus, the higher value of I2 corresponds to the smaller amount of
dispersion.
The fact that the higher value of I2 corresponds to the smaller amount of
dispersion will be confusing to researchers who assume that I2 tells us how
much the effect size varies. However, it will make sense for researchers who
understand that I2 is a proportion, not an absolute value. This becomes clear
with reference to Figure 42. This is similar to Figure 41, but now each panel
has two curves rather than one. The inner curve is identical to the one in the
prior plot, and corresponds to the dispersion of true effects. But here, we have
added an outer curve which corresponds to the dispersion of observed effects.
110 MISTAKES RELATED TO HETEROGENEITY
The top panel in Figure 42 shows the Stents analysis. To quantify the
difference between the inner and outer curves we can pick any point on the
distribution and compare the width of one curve vs. the other. At line [C] the
inner curve covers 86 points, whereas the outer curve covers 140 points. The
ratio of inner to outer in squared units in the log metric is 56%. This is the
meaning of I2, which is defined as ratio of true to total variance (Appendix
VIII).
Similarly, the bottom panel in Figure 42 shows the Smoking analysis.
To quantify the difference between the inner and outer curves we can pick
any point on the distribution and compare the width of one curve vs. the other.
At line [C] the inner curve covers 25 points, whereas the outer curve covers
27 points. The ratio of inner to outer in squared units in the log metric is 92%
(Appendix VIII). This is the meaning of I2, which is defined as ratio of true
to total variance.
``
Figure 42 | I2 and Prediction interval for two meta-analyses
inner curve to the outer curve. In the top panel the ratio is 56% and in the
bottom panel the ratio is 92%. (In the bottom panel the two lines are so close
to each other, they might appear to be a single line). This is what I2 tells us.
However, if we want to know how much the effect size varies, the answer
is provided by the width of the inner curve on the metric of the analysis. In
the top panel the true risk ratio varies from roughly 0.08 in some populations
to 0.96 in others, as indicated by line [C]. In the bottom panel the true effect
size varies from 0.76 in some populations to 1.01 in others, as indicated by
line [C]. This is what the prediction interval tells us. When we are asking
about the utility of an intervention, we almost invariably are interested in the
amount of variance, not the proportion. As such, we are asking about the
prediction interval, and not about I2.
Finally, it might be helpful to show the relationship between these
numbers and the actual forest plot for the two analyses.
Figure 43 shows the Stents analysis. The general sense conveyed by the
plot is that there is substantial dispersion in the observed effects, but also
substantial sampling error (as reflected in the width of the confidence interval
about most the effect sizes). The sampling error can explain some 44% of the
observed variance, and the remaining 56% reflects variance in true effects.
This 56%, the ratio of true to total variance, is I2. As a separate matter, if we
want to know the dispersion of effects on an absolute scale we turn to line
[C]. This corresponds to the prediction interval, and tells us that true effects
vary from around 0.08 in some populations to 0.96 in others. This is the same
as line [C] in the top panel Figure 42.
112 MISTAKES RELATED TO HETEROGENEITY
Cunill, Castells, Tobias, and Capellà (2016) looked at the impact of drugs on
ADHD. They write “Between-study heterogeneity was assessed using
Cochran’s Q test (Cochran 1954) jointly with the I2 index (Higgins et al.
2003), which enables the percentage of variation in the combined estimate
that can be attributed to heterogeneity to be established (< 25%: low
heterogeneity; 25 to 50 %: moderate; 50-75%: high; > 75%: very high).” The
first part of the sentence defines I2 as a percentage of variance. The part in
parentheses suggests that I2 is an index of absolute variance (e.g., “low
heterogeneity”). These are two different things. If I2 is the first (which it is)
then logically it cannot also be the second.
focus of this paper is on the heterogeneity in effects, and so the fact that they
use the wrong index to discuss heterogeneity is especially problematic.
9.5.8. In context
While I2 does not tell us how much the effect size varies, it is useful for the
following purposes (Borenstein et al., 2017; J. P. Higgins & Thompson, 2002;
J. P. Higgins et al., 2003).
The original papers on I2 are (J. P. Higgins & Thompson, 2002; J. P. Higgins
et al., 2003). For a more detailed discussion of the issues raised in this section,
see (Borenstein et al., 2017). For related papers see (Borenstein, 2019; Coory,
2010; J. P. Higgins, 2008; Huedo-Medina, Sanchez-Meca, Marin-Martinez,
& Botella, 2006; Ioannidis, 2008a; Patsopoulos, Evangelou, & Ioannidis,
2008; Rucker, Schwarzer, Carpenter, & Schumacher, 2008).
Summary
When we ask about heterogeneity, we intend to ask how much the true
effect size varies across studies. This question is addressed by the
prediction interval which tells us (for example) that the true effect size in
most populations will fall in the range of 0.05 to 0.95. It is not addressed
by the I2 statistic. The I2 statistic tells us what proportion of the variance
in observed effects reflects variation in true effects, rather than sampling
error. It does not tell us how much variation there is.
116 MISTAKES RELATED TO HETEROGENEITY
9.6.1. Mistake
9.6.2. Details
Since I2 does not tell us how much the effects vary, it obviously cannot
be used to classify analyses as having a low, moderate, or high amount of
variation. However, there is an additional point to be made. Let us assume
for a moment that I2 actually told us the amount of variation. What does it
mean to say that a particular amount of dispersion is low, moderate, or high,
unless we put that dispersion in the context of a specific outcome? Consider
the following four examples.
Islam et al. (2017) published the protocol for a meta-analysis to assess the
prevalence of pelvic-floor disorders in women in low and middle-income
countries. The effect size index is the prevalence of the disorder. They plan
to use values of I2 to classify the heterogeneity as being low, moderate, or
high.
9.6.7. In context
Summary
Second, the idea that we can classify heterogeneity into categories without
additional context is silly, since an amount of heterogeneity that would be
considered high in one context would be considered low in another.
Using the p-value as an index of heterogeneity 121
9.7.1. Mistake
9.7.2. Details
If there are many studies (and/or large studies) the p-value might be
statistically significant even if the amount of heterogeneity is trivial.
Conversely, if there are few studies (and/or small studies) the p-value might
not be statistically significant even if the amount of heterogeneity is
substantial. For this reason, the p-value cannot serve as a surrogate for the
amount of variation.
Two examples will make this clear.
122 MISTAKES RELATED TO HETEROGENEITY
Summary
9.8.1. Mistake
9.8.2. Details
If there are many studies (and/or large studies) the Q-value might be high
even if the amount of observed heterogeneity is trivial. Conversely, if there
are few studies (and/or small studies) the Q-value might be low even if the
amount of heterogeneity is substantial. For this reason, the Q-value cannot
serve as a surrogate for the amount of variation.
To assume that the Q-value tells us something about the extent of
dispersion in a meta-analysis is analogous to assuming that the sum of squares
tells us something about the extent of dispersion in a primary study. In a
primary study, the sum of squares (by itself) does not provide that
information. In a meta-analysis the value of Q (by itself) does not provide
that information.
The two examples in the immediately prior section (9.7) can serve here
as well.
Using the Q-value as an index of heterogeneity 127
the Q-value, they might assume that there was an exceptional amount of
heterogeneity.
However, that it not the case here. In fact, the amount of heterogeneity
is modest. The prediction interval [C] is 0.75 to 1.02. This tells us that in
some populations, the treatment reduces the risk of a bad outcome by 25%,
while in others it increases the risk of a bad outcome by 2%.
The Q-value is a function of (1) the amount of observed dispersion, (2)
the number of studies and (3) the precision of those studies. In this case, our
best estimate is that there is only modest dispersion, but the Q-value is high
primarily because there are many studies, and many of these are precise.
The Q-value does provide one item of information about the heterogeneity.
If Q is less than the degrees of freedom (the number of studies minus one),
the variance will be estimated as zero. Conversely, if Q exceeds the degrees
of freedom, the variance will be estimated as positive. However, that is the
only information we can get directly from Q and the degrees of freedom. To
press Q into service as an index of dispersion would be a mistake.
Summary
9.9.1. Mistake
9.9.2. Details
2(VM + τ 2 ) 2
σ τ2 =
2 , (9)
k −1
where VM is the within-study error variance (assumed to be the same for all
studies), τ2 is the true between-study variance, and k is the number of studies.
It follows that if VM and/or τ2 are non-trivial, the estimate of τ2 will have poor
precision unless we have a substantial number of studies.
The same issue applies to all the statistics that we employ to quantify
heterogeneity, including T2, T, I2, and the prediction interval. Thus, we cannot
mitigate this problem by switching to an alternate index. When we expect
that the heterogeneity is non-trivial and we have a small number of studies,
the best course of action is to report the extent to which our estimates are
unreliable.
Ironically, while this lack of precision affects all the statistics, the
practical implications of this problem are most serious for the prediction
interval. Since researchers generally misinterpret the meaning of I2 and T2, if
we estimate these values incorrectly, there is little additional harm done. By
contrast, researchers do understand the prediction interval, and if this interval
is wrong, researchers may reach the wrong conclusions. For this reason, it is
probably best to report the prediction interval only if it is based on at least ten
studies.
Summary
9.10.1. Mistake
Some computer programs report statistics for Q, I2 and T2, on the line for the
fixed-effect analysis. Researchers sometimes assume that these statistics
apply to the fixed-effect analysis, and then wonder where they can find these
values for the random-effects analysis. This is a mistake.
9.10.2. Details
There is only one estimate for the Q-value reported in a meta-analysis. Based
on this estimate we generate various statistics, some of which apply to the
fixed-effect model and some of which apply to the random-effects model.
The p-value applies to the fixed-effect model. This model requires that
all studies share a common effect size, and if the p-value is statistically
significant we conclude that this assumption has been violated.
While the p-value applies to the fixed-effect model, all estimates of
variance (T2, T, and I2) apply to the random-effects model. Importantly, these
estimates apply only to the random-effects model, since under the fixed-effect
model these are all zero by definition.
The reason that some computer programs display these statistics adjacent
to the fixed-effect estimates is because the statistics are computed using a
model where T2 is zero, and this happens to correspond to the weights used
for the fixed-effect model. The decision to display these statistics in one
section or another is of no consequence.
Summary
When we discuss the utility of the drug, this is what we have in mind.
Some might suggest that the drug should be recommended for general use
only if the dispersion looks like Figure 53, while others might suggest that it
should be recommended immediately even if the dispersion looks like Figure
54 or Figure 55. What should be clear, though, is that this discussion should
be based on the dispersion represented in these figures.
The one statistic that directly addresses this dispersion is the prediction
interval. In this example the prediction interval is 0.05 to 0.95. This tells us
that the effect size varies from as low as 0.05 in some populations to as much
as 0.95 in others (corresponding roughly to Figure 55). The prediction
interval addresses this question using the same scale as the effect size, so the
information is unambiguous. It tells us not only how much the effect size
varies, but also reports the interval on a meaningful scale. Not only does it
tell us that the effects vary over 90 points. It also tells us that it varies from
0.05 to 0.95 rather than (for example) −0.45 to +0.45 or 0.50 to 1.40.
136 MISTAKES RELATED TO HETEROGENEITY