Evidence-Based Policy: in Search of A Method: Ray Pawson

02Pawson (JB/D) 4/29/02 1:15 PM Page 157
Evaluation
Copyright © 2002
SAGE Publications (London,
Thousand Oaks and New Delhi)
[1356–3890 (200204)8:2; 157–181; 024512]
Vol 8(2): 157–181
Evidence-based Policy: In Search of a

Method1
R AY PAW S O N
Queen Mary University of London, UK
Evaluation research is tortured by time constraints. The policy cycle revolves

quicker than the research cycle, with the result that ‘real time’ evaluations
often have little influence on policy making. As a result, the quest for
evidence-based policy has turned increasingly to systematic reviews of the
results of previous inquiries in the relevant policy domain. However, this
shifting of the temporal frame for evaluation is in itself no guarantee of
success. Evidence, whether new or old, never speaks for itself. Accordingly,
there is debate about the best strategy of marshalling bygone research
results into the policy process. This article joins the imbroglio by examining
the logic of the two main strategies of systematic review: ‘meta-analysis’ and
‘narrative review’. Whilst they are often presented as diametrically opposed
perspectives, this article argues that they share common limitations in their
understanding of how to provide a template for impending policy decisions.
This review provides the background for Part II of the article (to be
published in the next issue, Evaluation 8[3]), which considers the merits of a
new model for evidence-based policy, namely ‘realist synthesis’.
K E Y WO R D S : evidence-based policy; meta-analysis; methodology; narrative

review; realism; systematic review
Introduction
The topic here is ‘learning from the past’ and the contribution that empirical
research can make. Whether policy makers actually learn from or simply repeat
past mistakes is something of a moot point. What is clear is that policy research
has much to gain by following the sequence whereby social interventions are
mounted in trying, trying, and then trying again to tackle the stubborn problems
that confront modern society. This is the raison d’être behind the current explo-
sion of interest in evidence-based policy (EBP). Few major public initiatives these
days are mounted without a sustained attempt to evaluate them. Rival policy ideas
are thus run through endless trials with plenty of error and it is often difficult to
know which of them really have withstood the test of time. It is arguable, there-
fore, that the prime function of evaluation research should be to take the longer
157
Evaluation 8(2)
view. By building a systematic evidence base that captures the ebb and flow of
programme ideas we might be able to adjudicate between contending policy claims
and so capture a progressive understanding of ‘what works’. Such ‘resolutions of
knowledge disputes’, to use Donald Campbell’s (1974) phrase, are the aim of the
research strategies variously known as ‘meta-analysis’, ‘review’ and ‘synthesis’.
Standing between this proud objective and its actual accomplishment lies the
minefield of social science methodology. Evaluation research has come to under-
stand that there is no one ‘gold standard’ method for evaluating single social
programmes. My starting assumption is that exactly the same maxim will come
to apply to the more global ambitions of systematic review. It seems highly likely
that EBP will evolve a variety of methodological strategies (Davies, 2000). This
article seeks to extend that range by offering a critique of the existing orthodoxy,
showing that the lessons currently learnt from the past are somewhat limited.
These omissions are then gathered together in a second part of the article to form
the objectives of a new model of EBP, which I call ‘realist synthesis’.
The present article offers firstly a brief recapitulation of the overwhelming case
for the evaluation movement to dwell on and ponder evidence from previous
research.
It goes on to consider the main current methods for doing so, which have been
split into two perspectives: ‘numerical meta-analysis’ and ‘narrative review’. A
detailed critique of these extant approaches is offered. The two perspectives
operate with different philosophies and methods and are often presented in
counterpoint. There is a heavy whiff of the ‘paradigm wars’ in the literature
surrounding them and much mud slinging on behalf of the ‘quantitative’ versus the
‘qualitative’, ‘positivism’ versus ‘phenomenology’, ‘outcomes’ versus ‘process’ and
so on. This article eschews any interest in these old skirmishes and my critique is,
in fact, aimed at a common ambition of the two approaches. In their different ways,
both aim to review a ‘family of programmes’ with the goal of selecting out the ‘best
buys’ to be developed in future policy making. I want to argue that, despite the
arithmetic clarity and narrative plenitude of their analysis, they do not achieve
decisive results. The former approach makes no effort to understand how
programmes work and so any generalizations issued are insensitive to differences
in programme theories, subjects and circumstances that will crop up in future
applications of the favoured interventions. The latter approach is much more
attuned to the contingent conditions that make for programme success but has no
formal method of abstracting beyond these specifics of time and place, and so is
weak in its ability to deliver transferable lessons. Both approaches thus struggle to
offer congruent advice to the policy architect, who will always be looking to
develop new twists to a body of initiatives, as well as seeking their application in
fresh fields and to different populations.
‘Meta’ is Better
Whence EBP? The case for using systematic review in policy research rests on a
stunningly obvious point about the timing of research vis-à-vis policy: in order
to inform policy, the research must come before the policy. To figure this out does
158
Pawson: Evidence-based Policy: In Search of a Method

not require an ESRC-approved methodological training, nor a chair in public
policy or years of experience in Whitehall. Yet, curiously, this proposition does
not correspond to the sequence employed in the procurement of most evaluation
research. I depict the running order of a standard piece of evaluation (if there is
such a thing) in the upper portion of Figure 1.
The most significant manifestation of public policy in the modern era is the
‘programme’, the ‘intervention’ or the ‘initiative’. The gestation of such schemes
follows the familiar ‘design → implementation → impact’ sequence, each phase
being associated with a different group of stakeholders, namely programme
architects, practitioners and participants. As a general rule, one can say that
evaluation researchers are normally invited to be involved in the early phases of
programme implementation. Another law of research timing is that the
researchers are usually asked to report on programme impact somewhat before
the intervention has run its course. Evaluations, therefore, operate with a rather
narrow bandwidth and the standard ‘letting and reporting’ interval is depicted in
Figure 1.
i. Evaluation Research (standard mode)
Design Implementation Impact
Research Research
starts ends
here here
ii. Meta-analysis/Synthesis/Review
Design Implementation Impact
Research Research
ends starts
here here
Feedback
Figure 1. The Research/Policy Sequence
159
Evaluation 8(2)
Much ink and vast amounts of frustration have flowed in response to this
sequencing. There is no need for me to repeat all the epithets about ‘quick and
dirty’, ‘breathless’ and ‘brownie point’ evaluations. The key point I wish to under-
score here is that, under the traditional running order, programme design is often
a research-free zone. Furthermore, even if an evaluation manages to be
‘painstaking and clean’ under present conditions, it is still often difficult to trans-
late the research results into policy action. The surrounding realpolitik means
that, within the duration of an evaluation, the direction of political wind may well
change, with the fundamental programme philosophy being (temporarily)
discredited, and thus not deemed worthy of further research funding. Moreover,
given the turnover in and the career ambitions of policy makers and practitioners,
there is always a burgeoning new wave of programme ideas waiting their turn for
development and evaluation. Under such a regime, we never get to Campbellian
‘resolution to knowledge disputes’ because there is rarely a complete revolution
of the ‘policy-into-research-into-policy’ cycle.
Such is the case for the prosecution. Let us now turn to a key manoeuvre in
the defence of using the evidence base in programme design. The remedy often
suggested for all this misplaced, misspent research effort is to put research in its
appropriate station (at the end of the line) and to push many more scholars back
where they belong (in the library). This strategy is illustrated in the lower portion
of Figure 1. The information flow therein draws out the basic logic of systematic
review, which takes as its starting point the idea that there is nothing entirely new
in the world of policy making and programme architecture. In the era of global
social policy, international programmes and cross-continental evaluation
societies, one can find few policy initiatives that have not been tried and tried
again, and researched and researched again. Thus, the argument goes, if we begin
inquiry at that point where many similar programmes have run their course and
the ink has well and truly dried on all of the research reports thereupon, we may
then be in a better position to offer evidence-based wisdom on what works and
what does not.
On this model, the key driver of research application is thus the feedback loop
(see Figure 1) from past to present programming. To be sure, this bygone
evidence might not quite correspond to any current intervention. But since policy
initiatives are by nature mutable and bend according to the local circumstances
of implementation, then even real-time research (as we have just noted) has
trouble keeping apace.
Like all of the best ideas, the big idea here is a simple one – that research
should attempt to pass on collective wisdom about the successes and failures of
previous initiatives in particular policy domains. The prize is also a big one in that
such an endeavour could provide the antidote to policy making’s frequent lapses
into crowd pleasing, political pandering, window dressing and god acting. I
should add that the apparatus for carrying out systematic reviews is also by now
a big one, the recent mushrooming of national centres and international consortia
for EBP being the biggest single change on the applied research horizon for many
a year.2
Our scene is thus set. I turn now to an examination of the methodology of EPB
160

and enter a field with rather a lot of aliases – meta-analysis, narrative review,
research synthesis, evaluation synthesis, overview, systematic review, pooling,
structured review and so on. A complete review of review methodology is
impossible other than at book length, so I restrict my attention here entirely to
the analytical matter of how the evidence is synthesized. The present article thus
pays no heed to the vital issue of the quality of the primary research materials,
nor to the way that the evidence, once assembled, may find its way into use. And
as for modes of analysis, I again simplify from the menu of strategies available
in order to contrast two approaches, one of which sees synthesis as a statistical
task as opposed to another which assumes knowledge cumulates in narrative.
Numerical Meta-analysis
The numerical strategy of EBP, often just referred to as ‘meta-analysis’, is based
on a three-step model: ‘classify’, ‘tally’ and ‘compare’. The basic locus of analysis
is a particular ‘family of programmes’ targeted at a specific problem. Attention
is thus narrowed to the initiatives developed within a particular policy domain
(be it ‘HIV/AIDS-prevention schemes’ or ‘road-safety interventions’ or ‘neigh-
bourhood-watch initiatives’ or ‘mental health programmes’ or whatever). The
analysis begins with the identification of sub-types of the family, with the classifi-
cation based normally on alternative ‘modes of delivery’ of that programme.
Since most policy making is beset with rival assertions on the best means to
particular ends, meta-evaluation promises a method of comparing these contend-
ing claims. This is accomplished by compiling a database examining existing
research on programmes comprising each sub-type, and scrutinizing each case for
a measure of its impact (net effect). The overall comparison is made by calcu-
lating the typical impact (mean effect) achieved by each of the sub-types within
the overall family. This strategy thus provides a league table of effectiveness and
a straightforward measure of programme ‘best buy’. By following these steps,
and appending sensible caveats about future cases not sharing all of the features
of the work under inspection, the idea is to give policy architects some useful
pointers to the more promising areas for future development.
A typical illustration of numerical strategy is provided in Table 1, which comes
from Durlak and Wells’s (1997) meta-analysis of 177 Primary Prevention Mental
Health (PPMH) programmes for children and adolescents carried out in the US.
According to the authors, PPMH programmes ‘may be defined as an intervention
intentionally designed to reduce incidence of adjustment problems in currently
normal populations as well as efforts directed at the promotion of mental health
functioning’. This broad aim of enhancing psychological well-being by promoting
the capacity to cope with potential problems can be implemented through a variety
of interventions, which were classified by the researchers; see column one of Table
1. Firstly there is the distinction between person-centred programmes (which use
counselling, social learning and instructional approaches) and environment-
centred schemes (modifying school or home conditions to prepare children for life
change). Then there are more specific interventions targeted at ‘milestones’ (such
as the transitions involved in parental divorce or teenage pregnancy and so on). A
161
Evaluation 8(2)
Table 1. The ‘Classify’, ‘Tally’ and ‘Compare’ Model of Meta-analysis (Durlak and
Wells, 1997: 129)
Type of Program n Mean effect
Environment-centred
School-based 15 0.35
Parent training 10 0.16a
Transition programmes
Divorce 7 0.36
School entry/change 8 0.39
First-time mothers 5 0.87
Medical/dental procedure 26 0.46
Person-centred programmes
Affective education
Children 2–7 8 0.70
Children 7–11 28 0.24
Children over 11 10 0.33
Interpersonal problem solving
Children 2–7 6 0.93
Children 7–11 12 0.36
Children over 11 0 –
Other person-centred programmes
Behavioural approach 26 0.49
Non-behavioural approach 16 0.25
Note: a = Non-significant result. Many thanks to Joe Durlak and Anne Wells for their kind
permission to reprint the above.
range of further distinctions may also be discerned in terms of programmes with

specific target groups, such as those focused on specific ‘developmental levels’
(usually assigned by ‘age ranges’) or aimed at ‘high risk’ groups (assigned, for
instance, as children from ‘low-income’ homes or with ‘alcoholic parents’ etc.).
A standard search procedure (see Durlak and Wells, 1997: 120) was then
carried out to find research on cases within these categories, resulting in the tally
of initiatives in column two of the table. For each study the various outcomes
assessing the change in the child/adolescent subjects were unearthed and ‘the
effect sizes (ESs) were computed using the pooled standard deviation of the
intervention and control group’. After applying corrections for the small sample
sizes in play in some of the initiatives, the mean effect of each class of
programmes was calculated. This brings us finally to the objective of meta-
analysis – a hierarchy of effectiveness – as in column three.
Let us now move to a critique of this strategy, remembering that I am seeking
out weaknesses in the logic of the whole genre and not just this example. The
basic problem concerns the nature and number of the simplifications that are
necessary to achieve the column of net effects. This – the key output of meta-
analysis – is, has to be, and is intended to be a grand summary of summaries.
Each individual intervention is boiled down to a single measure of effectiveness,
162

which is then drawn together in an aggregate measure for that sub-class of
programmes, which is then compared with mean effect for other intervention
categories. The problem with this way of arriving at conspectus is that it squeezes
out vital explanatory content about the programmes in action, in a way that
renders the comparisons much less rigorous than the arithmetic appears on
paper. This process of compression occurs at three points:
1. the melding of programme mechanisms;
2. the oversimplification of programme outcomes;
3. the concealment of programme contexts.
Melded Mechanisms
The first difficulty follows from an absolutely taken-for-granted presumption
about the appropriate locus of comparison in meta-analysis. I have referred to
this above as a ‘programme family’. This method simply assumes that the source
of family resemblance is the ‘policy domain’. In one sense this is not unreason-
able; modern society parcels up social ills by administrative domain and assem-
bles designated institutes, practitioners, funding regimes and programmes to
tackle each problem area. Moreover, it is a routine feature of problem solving
that within each domain there will be some disagreement about the best way of
tackling its trademark concerns. Should interventions be generic or targeted at
sub-populations? Should they be holistic or problem specific? Should they be
aimed at prevention or cure? Should they have an individual focus, or be insti-
tution-centred, or be area-based? Such distinctions affect the various
professional specialities and rivalries that feature in any policy sector. This in turn
creates the context for meta-evaluation, which is asked to judge which way of
tackling the domain problem is most successful. In the case at hand, we are
dealing with the activities of a professional speciality, namely ‘community
psychology’, which probably has a greater institutional coherence in the US than
the UK, but whose sub-divisions into ‘therapists’, ‘community educators’ and so
on would be recognized the world over.
This is the policy apparatus that generates the programme alternatives that
generate the meta-analysis question. Whether it creates a commensurable set of
comparisons and thus a researchable question is, however, a moot point. This
brings me to the crux of my first critique, which is to cast doubt on whether such
an assemblage of policy alternatives constitutes a comparison of ‘like with like’.
Any classification system must face the standard methodological expectations
that it be unidimensional, totally inclusive, mutually exclusive and so forth. We
have seen how the different modes of programme delivery and the range of
targets form the basis of Durlak and Wells’s classification of PPMH programmes.
But the question is, does this classification system provide us with programme
variations on a common theme that can be judged by the same measure, or are
they incommensurable interventions that should be judged in their own terms?
An insight to this issue can be gained by considering the reactions of authors
whose studies have been ‘grist to the meta-analysis mill’. Weissberg and Bell
(1997) were responsible for 3 out of the 12 studies reviewed on ‘interpersonal
163
Evaluation 8(2)
problem solving for 7–11-year-olds’ and their barely statistically significant
efforts are thus to be found down in the lower section of the net-effects league
table. They protest that their three inquiries were in fact part of a ‘developmental
sequence’, which saw their intervention change from one with 17 to 42 modules,
and as programme conceptualization, curriculum, training and implementation
progressed, outcome success also improved in the three trials. They would like
‘work-in-progress’ not to be included in the review, and also point out that
programmes frequently outgrow their meta-analytic classification. Their ‘person-
centred’ intervention really begins to be effective, they claim, when it influences
whole teaching and curriculum regimes and thus becomes ‘environment-centred’.
The important point for the latter pair of authors is that the crux of their
programme, its ‘identifier’, is the programme theory. In their case, the proposed
mechanism for change was the simultaneous transformation of both setting and
teaching method so that the children’s thinking became more oriented to
problem solving.
Weissberg and Bell also offer some special pleading for those programmes (not
of their own) that come bottom of the PPMH meta-evaluation scale. Parenting
initiatives, uniquely amongst this programme set, attempt to secure improve-
ments in children’s mental health through the mechanisms of enhancing child-
rearing practices and increasing child-development knowledge of parents and
guardians. This constitutes a qualitatively different programme theory, which
may see its pay-off in transformations in daily domestic regimes and in long-term
development in the children’s behaviour. This theory might also be particularly
sensitive to the experience of the parent-subjects (first-time parents rather than
older mums and dads being readier to learn new tricks) and may suffer from
problems of take-up (such initiatives clearly have to be voluntary and the most
needful parents might be the hardest to reach).
Such a programme stratagem, therefore, needs rather sophisticated testing.
Ideally this would involve long-term monitoring of both family and child; it would
involve close scrutiny of parental background; and, if the intervention were intro-
duced early in the children’s lives, it would have to do without before-and-after
comparisons of their attitudes and skills. When it comes to meta-analysis, no such
flexibility of the evidence base is possible and Durlak and Wells’s studies all had
to face the single, standard test of success via measures monitoring pre–post-
intervention changes in the child’s behaviour. A potential case for the defence of
parenting initiatives thus remains, on the grounds that what is being picked up in
meta-analysis in these cases is methodological heavy-handedness rather than
programme failure.
The point here is not about ‘special pleading’; I am not attempting to defend
‘combined’ programmes or ‘parenting’ programmes or any specific members of
the PPMH family. Like the typical meta-analyst, I am not close enough to the
original studies to make judgements on the rights and wrongs of these particular
disputes. The point I am underlining is that programme sub-categories cannot
simply be taken on trust. They are not just neutral, natural labels that present
themselves to the researcher. They do not just sit side by side innocently awaiting
inspection against a common criterion. This is especially so if the sub-types follow
164

bureaucratic distinctions, which provide for rival programme cultures, which in
turn imbue the classification framework with different programme theories.
If the way classifications are defined results in different programme ideas or
mechanisms being included within the classification, it is important that
initiatives therein are judged by appropriate standards and this can be beyond
the capability of any uniform measurement apparatus. If different programme
mechanisms are introduced into meta-analytic comparisons, then differences in
intervention duration and development, practitioner experience and training,
subject self-selection and staying power, and so forth will be more acknowledged,
as they are likely to influence the outcomes. In short, the categories of meta-
analysis have the potential to hide, within and between themselves, very many
significant capacities for generating personal and social change and we thus need
to be extremely careful, therefore, about making any causal imputations to any
particular ‘category’.
Oversimplified Outcomes
Such a warning looms even more vividly when we move from cause to effect, i.e.
to programme outcomes. This is the second problem with numerical meta-
analysis, concealed in that rather tight-fisted term – the ‘mean effect’. The crucial
point to recall as we cast our eyes down the outputs of meta-analysis, such as
column three in Table 1, is that the figures contained therein are means of means
of means of means! It is useful to travel up the chain of aggregation to examine
how exactly the effect calculations are performed for each sub-category of a
programme. Recall that PPMH programmes carry the broad aims of increasing
‘psychological well-being’ and tackling ‘adjustment problems’. These objectives
were tested within each evaluation by the standard method of performing before-
and-after calculations on indicators of the said concepts.
Mental health interventions have a long pedigree and so the original evalu-
ations were able to select indicators of change from a whole battery of ‘attitude
measures’, ‘psychological tests’, ‘self-reports’, ‘behavioural observations’,
‘academic performance records’, ‘peer-approval ratings’, ‘problem-solving
vignettes’, ‘physiological assessments of stress’ and so on. As well as varying in
kind, the measurement apparatus for each intervention also had a potentially
different time dimension. Thus ‘post-test’ measures will have had different prox-
imity to the actual intervention and, in some cases but not others, were applied
in the form of third and subsequent ‘follow-ups’. Further, hidden diversity in
outcome measures follows from the possibility that certain indicators (cynics may
guess which!) may have been used but have gone unreported in the original
studies, given limitations on journal space and pressures to report successful
outcomes.
These then are the incredibly assorted raw materials through which meta-
analysis traces programme effects. Now, it is normal to observe some variation
in programme efficacy across the diverse aspects of personal change sought in an
initiative (with it being easier to shift, say, ‘self-reported attitudes’ than ‘anti-
social behaviour’ than ‘academic achievement’). It is also normal to see
programme effects change over time (with many studies showing that early
165
Evaluation 8(2)
education gains may soon dissipate but that interpersonal-skills gains are more
robust, e.g. McGuire, Stein and Rosenberg, 1997). It is also normal in multi-goal
initiatives to find internal variation in success across the different programme
objectives (with the school-based programmes having potentially more leverage
on classroom-based measures about ‘discipline referrals’ and ‘personal
competence’ rather than on indicators shaped more from home and neighbour-
hood, such as ‘absenteeism’ and ‘drop-out rates’). In short, programmes always
generate multiple outcomes and much is to be learnt about how they work by
comparing their diverse impacts within and between programmes and over time.
There is little opportunity for such flexibility in meta-analysis, however,
because any one study becomes precisely that – namely ‘study x’ of, for instance,
the 15 school-based studies. Outcome measurement is a tourniquet of compres-
sion. It begins life by observing how each individual changes on a particular
variable and then brings these together as the mean effect for the programme
subjects as a whole. Initiatives normally have multiple effects and so its various
outcomes, variously measured, are also averaged as the ‘pooled effect’ for that
particular intervention. This, in turn, is melded together with the mean effect
from ‘study y’ from the same subset, even though it may have used different indi-
cators of change. The conflation process continues until all the studies are
gathered in within the sub-type, even though by then the aggregation process
may have fetched in an even wider permutation of outcome measures. Only then
do comparisons begin as we eyeball the mean, mean, mean effects from other
sub-categories of programmes. Meta-analysis, in short, will always generate its
two-decimal-place mean effects – but since it squeezes out much ground-level
variation in the outcomes, it remains open to the charge of spurious precision.
Concealed Contexts
My third critique assesses, by means of the ‘like with like’ test, a further element
of all social programmes, namely the subjects who, and situations which, are on
the receiving end of the initiatives. No individual-level intervention works for
everyone. No institution-level intervention works everywhere. The net effect of
any particular programme is thus made up of the balance of successes and failures
of individual subjects and locations. Thus any ‘programme outcome’ – single,
pooled or mean – depends not merely upon ‘the programme’ but also on its
subjects and its circumstances. These contextual variations are yet another
feature that is squeezed out of the picture in the aggregation process of meta-
analysis.
Some PPMH programmes, as we have seen, tackle longstanding issues of the
day like educational achievement. Transforming children’s progress in this
respect has proved difficult, not for want of programmes, but because the
educational system as a whole is part of a wider range of social and cultural
inequalities. The gains made under a specific initiative are thus always limited by
such matters as the class and racial composition of the programme subjects and,
beyond that, by the presence or absence of further educational and job oppor-
tunities. The same programme may thus succeed or fail according to how advan-
tageous is its school and community setting. At the aggregate level, this
166

proposition generates a rival hypothesis about success in the ‘league tables’.
Unless we collect information about the prevailing contextual circumstances of
the programme, it might be that the accumulation of ‘soft targets’ rather than
successful programmes per se could produce a winning formula.
The same point also applies even to the seemingly ‘targeted’ initiatives. For
instance, the ‘milestone’ programmes in Durlak and Wells’s analysis make use of
a ‘modelling’ mechanism. The idea is that passing on the accounts of children
who have survived some trauma can be beneficial to those about to face the
‘same’ transition. The success of that hypothesis depends, of course, on how
authentic and how evocative are the ‘models’. And this may be expected to vary
according to the background of the child facing the transition. Different children
may feel that the predicaments faced in the models are not quite the same as their
own, that the backgrounds of the previous ‘victims’ are not quite parallel to their
own, that the programme accounts do not square with other advice they are
experiencing, and so on. It is quite possible (indeed quite routine) to have a
programme that works well for one class of subjects but works against another
and whose differential effects will be missed in the aggregation of outputs. This
points to the importance of the careful targeting of programmes as a key lesson
to be grasped in accumulating knowledge about programmes efficacy. But this is
rather unlikely in meta-analysis, which functions in accordance with the rival
philosophy that the best programmes are those that work most widely.
Again, I must emphasize that the hypotheses expressed in the above two para-
graphs are necessarily speculative. They highlight the need for a careful examin-
ation of the subject and contextual differences in terms of who succeeds and who
fails within any programme. The point, of course, is that this is denied to a meta-
analysis, which needs unequivocal and uniform base-line indicators for each case.
At the extreme, we can still learn from a negative net effect since the application
of an initiative to the wrong subjects and in the wrong circumstances can leave
behind vital clues about what might be the right combination. No such subtlety
is available in an approach that simply extracts the net effect from one study and
combines it with other similar programmes to generate an average effect for that
sub-class of programmes. Vital explanatory information is thus once again
squeezed out automatically in the process of aggregation.
I will now draw together the lessons of the three critiques of meta-analysis. I
have tried to cast doubt on the exactitude of the mean-effect calculations. Arith-
metically speaking, the method always ‘works’ in that it will produce mechani-
cally a spread of net effects of the sort we see in Table 1. The all-important
question is whether the league table of efficacy produced in this manner should
act as an effective guide to future policy making. Should the PPMH meta-analyst
advise, for instance, on cutting parenting programmes and increasing affective
education and interpersonal problem-solving schemes, especially for the under-
sevens? My answer would be an indubitable ‘no’. The net effects we see in
column three do not follow passively from application of that sub-class of
programmes but are generated by unmonitored differences in the underlying
programme mechanisms, in the contexts in which the programme is applied and
in the measures which are used to tap the outcome data.
167
Evaluation 8(2)
The message, I trust, is clear: what has to be resisted in meta-analysis is the
tendency for making policy decisions on the casting of an eye down a net-effects
column such as in Table 1. The contexts, mechanisms and outcomes that consti-
tute each set of programmes are so diverse that it is improbable that like gets
compared with like. So whilst it might seem objective and prudent to make policy
by numbers, the results may be quite arbitrary. This brings me to a final remark
on Durlak and Wells and to some good news (and some bad). The PPMH meta-
analysis is not in fact used to promote a case for or against any particular subset
of programmes. So, with great sense and caution, the authors avoid presenting
their conclusions as an endorsement of ‘interpersonal problem solving for the
very young’, or denunciation of ‘parenting programmes’, and so on. Indeed they
view the results in Table 1 as ‘across the board’ success and perceive that the
figures support the extension of programmes and further research in PPMH as a
whole.
Their reasoning is not so clear when it comes to step two in the argument. This
research was done against the backdrop of the US Institute of Medicine’s 1994
decisions to exclude mental health promotion from their official definitions of
preventative programmes, with a consequent decline in their professional status.
Durlak and Wells protest that their finding (ESs of 0.24 to 0.93) compare
favourably with the effect sizes reported routinely in psychological, educational
and behavioural treatments (they report that one overview of 156 such meta-
analyses came up with a mean of means of means of means of means of 0.47).
Additionally, ‘the majority of mean effects for many successful medical treat-
ments such as bypass surgery for coronary heart disease, chemotherapy to treat
certain cancers . . . also fall below 0.5’. If one compares for just a second the
nature, tractability and seriousness of the assorted problems and the colossal
differences of programme ideas, subjects, circumstances and outcome measures
involved in this little lot, one concludes that we are being persuaded, after all,
that chalk can be compared to cheese.
Let me make it abundantly clear that my case is not just against this particu-
lar example. There are, of course, more sophisticated examples of meta-analysis
than the one analysed here, not to mention many far less exacting studies. There
are families of interventions for which meta-analysis is more feasible than the
case considered here and domains where it is a complete non-starter. The most
prevalent usage of the method lies in synthesizing the results of clinical trials. In
evidence-based medicine (EBM), one is often dealing with a singular treatment
(the application of a drug) as well as unidimensional and rather reliable outcome
measures (death rates). Here meta-analysis might give important pointers on
what to do (give clot-busting agents such as streptokinase after a coronary) and
what not to do (subject women to unnecessary gynaecological treatment). Of
course, problem ‘heterogeneity’ remains in EBM but it is not of the same order
as in EBP, where interventions work though reasoning subjects rather than
blinded patients.
Even in EBP, meta-analysis is under continuous development. Since I have
space here for only a single shot at a rapidly moving target, I close this section
with some clarification of the scope of the critique. Firstly I do not suppose that
168

some of the specific points made will wound meta-analysis mortally or even come
as a shock to exponents of its arts. For instance, the ‘apples and oranges’ problem
has bedevilled meta-analysis since the originating study, namely Smith and
Glass’s (1977) review of the efficacy of psychotherapeutic treatments. Gallo
(1978) attacked the very prototype for pooling a ‘hodgepodge’ of outcome
measures and debate has followed ever since about whether meta-analysis should
operate at the level of ‘fruit’ or get down to the differences between the
‘honeydew’ and ‘cantaloupe’ melons. Like all methodologies, meta-analysis is
self-reflecting and its folk language carries some rueful, ironic expressions about
its tendency to meld to mean, with the process of arriving at a summary effect
size being known in the trade as ‘smelting’. Accordingly, methods of calculating
this overall effect have shifted from the early days of ‘vote counting’ (how many
studies show success and how many find failure) to the ‘weighted averages’ used
in Durlak and Wells’s study, to controversial attempts to weigh the ‘quality’ of
study before tallying its recalculated effect size into the overall pile (Hunt, 1997).
In short, there are many technical problems to be overcome, and I am not
attempting a technical critique; rather it is a critique of the basic ‘logic’ of meta-
analysis. The central objective for evidence-into-policy linkage on this strategy is
to go in search of the stubborn empirical generalization. More precisely, the goal
is to discover what works by the replication of positive results for a class of
programmes in a range of different situations. This is rather like the misidentifi-
cation of ‘universal laws’ and ‘empirical generalizations’ in natural science
(Kaplan, 1964). The discovery of the former does not imply or require the exist-
ence of the latter. To arrive at a universal law we do not require that ‘X-is-always-
followed-by-Y’. Rather, the expectation is that ‘Y’ always follows ‘X’ in certain
prescribed situations and we need a programme of theory and research to delimit
those situations. Thus gunpowder does not always explode if a match is applied
– but that consequence depends (on other things) on it being dry; the pressure
of a gas does not always increase linearly with temperature – unless there is a
fixed mass; falling bodies like cannon balls and feathers don’t follow the laws of
motion – unless we take into account friction, air resistance and so on. The meta-
analytic search for ‘heterogeneous replication’ (Shadish et al., 1991) is rather akin
to the search for the brute empirical generalization in that it seeks programme
success without sufficient knowledge of the conditions of success.
I extend the critique to what is sometimes called ‘second-level’ meta-analysis
(Hunt, 1997). This approach goes beyond the idea of finding mean effects for a
particular class of treatment. It assumes, correctly, that the average is made up
of a range of different effects, coloured by such factors as: 1) the characteristics
of the clients; 2) the length or intensity of the treatment; and even 3) the type of
design in the original study. Studies attempting to identify such ‘mediators’ and
‘moderators’ (Shadish and Sweeney, 1991) do make an undoubted leap forward,
with some recent meta-analyses of medical trials identifying significant differ-
ences in the efficacy of treatment for different sub-groups of patients. Perhaps
not surprisingly, the same ‘dosage’ turns out to have differential effects accord-
ing to the strain and advance of the disease as well as the age, weight, sex and
metabolism of the subject. The trouble is that such lists have a habit of going on
169
Evaluation 8(2)
and on, with one meta-analysis study of the efficacy of a vaccine for tuberculosis
discovering that its effects improved the further was the distance of the study site
from the equator (Colditz et al., 1995).
Rather than a league table of effect sizes, second-level meta-analysis produces
as its outcome a ‘meta-regression’ – a causal model identifying associations
between study or subject characteristic and the outcome measure. Much else in
the analysis will remain the same, however. A limited rage of mediators and
moderators is likely to be selected according to both the limits of the researchers’
imaginations (not too many have chosen line of latitude) and by the lack of
consistent information on potential mediators from primary study to primary
study. Additionally if, for example, subject characteristics are the point of
interest, the other dilemmas illustrated above, such as the melding of programme
mechanism and the pooling of outcome measures, are likely to remain evident.
The result is that meta-regression offers a more complex summary of certain
aspects of programme efficacy but one that never accomplishes the privileged
status of ‘generalization’ or ‘law’. The result according to one rather authoritative
source is:
In many a meta-analysis we have a reliable, useful, causal description but without any
causal explanation. I think that the path models have a heuristic value but often seduce
the reader and the scholar into giving them more weight than they deserve. (Thomas
Cook quoted in Hunt, 1997: 79)
From the point of view of the earlier analysis, this has to be the correct diag-
nosis. For the realist, programmes do not have causal power. Interventions offer
subjects resources, which they then accept or reject, and whether they do so
depends on their characteristics and circumstances. As I have attempted to show
with the PPMH examples, there is an almighty range of different mechanisms
and contexts at work, which produce a vast array of outcome patterns. The
problem is the classic one of the over-determination of evidence by potential
theories. This suggests that rather than being a process of interpolation and esti-
mation, systematic review should attempt the task of extrapolation and expla-
nation. This line of argument is picked up in the second part of the article.
Any further discussion of meta-analysis must be curtailed at this stage. Readers
interested in the latest debate on its second and even third levels should compare
the points of view of Sutton et al. (2001) and Robertson (2001). The attention
here now turns to causal interpretations of a quite different ilk.
Narrative Reviews
We move to the second broad perspective of EBP, which comes in a variety of
shapes and sizes and also goes by an assortment of names. In an attempt to use
a catch-all expression, I shall refer to them as ‘narrative reviews’. Again, I will
discuss a number of examples, but once again note that my real objective is to
capture and criticize the underlying ‘logic’ of the strategy.
The overall aim of the narrative approach is, in fact, not so different to that of
the numerical strategy; a family of programmes is examined in the hope of finding
170

those particular approaches that are most successful. Broadly speaking, one can
say that there is a slightly more pronounced explanatory agenda within the
strategy, with the reviews often including information on why the favoured inter-
ventions were in fact successful. In methodological terms, however, the signifi-
cant difference between the two core approaches is with respect to the following
(related) questions:
• What information should be extracted from the original studies?
• How should the comparison between different types of initiatives be
achieved?
On the first issue, the contrast is often drawn between the numerical approach,
which is said to be very focused upon outcomes, whereas narrative reviews are
pledged to preserve a ‘ground-level view’ of what happened with each
programme. The meaning of such a commitment is not always clear and this
accounts for some variations in the way it is put into practice. One detectable
strategy utilized is the tell-it-like-it-is philosophy of attempting to preserve the
‘integrity’ or ‘wholeness’ of the original studies. Another way of interpreting the
desideratum is that the review should extract enough of the ‘process’ of each
programme covered so that its ‘outcome’ is rendered intelligible to the reader of
the review.
The interesting thing about these goals is that they are very much the instincts
of ‘phenomenological’ or ‘case study’ or even ‘constructivist’ approaches to
applied social research. The great difficulty facing narrative review is to preserve
these ambitions over and over again as the exercise trawls through scores or
indeed hundreds of interventions. The same problem, incidentally, haunts
comparative historical inquiry as it attempts to preserve ‘small n’ historical detail
in the face of ‘large N’ international comparisons (Ragin, 1987).
The result is that data extraction in narrative reviews is always something of a
compromise. The difficulty can be highlighted by concentrating briefly on a
‘worse case scenario’ from the genre. Walker’s Injury Prevention for Young
Children: A Research Guide (1996) provides a particularly problematic example
since, despite its sub-title, it is little more than an annotated bibliography. The
format is indeed an index of inquiries – taking the shape of what look suspiciously
like the paraphrased abstracts of 370 studies. The working assumption, one
supposes, is that abstracts are supposed to provide the ‘essence’ of the
programme and the ‘gist’ of the research findings, and these are the vital raw
ingredients of review. The summaries are then subdivided into nine sub-sections
dealing with different forms of injury (asphyxiation to vehicle injury). Each
subsection begins with brief, paragraph-length introductions, which go no further
than describing the broad aims of the programmes therein.
Whilst these pocket research profiles provide enough material to spark interest
in particular entries and indifference with respect to others, such a process hardly
begins the job of comparing and contrasting effectiveness. Seemingly, this task is
left to the reader, who is in a hopeless position to do so systematically, since the
entries merely log the concerns of the original authors and thus contain utterly
heterogeneous information. I am not slaying a straw man in the name of a whole
171
Evaluation 8(2)
strategy here, merely pointing out an acute form of underlying tension between
the goals of ‘revealing the essence of each case study’ and ‘affecting a compari-
son of all case studies’.
A giant step on from here, within the narrative tradition, is what is sometimes
called the ‘descriptive-analytical’ method. Studies are examined in relation to a
common analytical framework, so the same template of features is applied to
each study scrutinized. A model example, from the same field of childhood
accident prevention, is to be found in the work of Towner et al. (1996). Appendix
H in that study supplies an example of a ‘data extraction form’, alas too long to
reproduce here, which is completed for each study reviewed, collecting infor-
mation as follows:
1. Author, year of publication, place;
2. Target group, age range, setting;
3. Intervention aims and content;
4. Whether programme is educational, environmental or legislative;
5. Whether alliances of stakeholders were involved in programme implemen-
tation;
6. Methodology employed;
7. Outcome measures employed;
8. Summary of important results;
9. Rating of the ‘quality of evidence’.
This represents a move from trying to capture the essence of the original studies
via an ‘abstract/summary’ to attempting to locate their key aspects on a ‘data
matrix’. The significant point, of course, is that, being a narrative approach, the
cell entries in these tabulations are composed mainly of text. This text can be as
full or as brief as research time and inclination allow. In its raw form, the infor-
mation on any single study can thus easily comprise up to two or three pages.
This may include the extraction of a key quotation from the original authors, plus
some simple tick-box information on, say, the age range of the target group. The
record of findings can range from effect sizes to the original author’s thoughts on
the policy implications of their study. Furthermore, the review may also include
the reactions of the reviewer (e.g. ‘very good evaluation – no reservations about
the judgements made’). In short, the raw data of narrative review can take the
form of quite a mélange of different types of information.
Normally such reviews will also provide appendices with the entire matrix on
view in a condensed form, so one can literally ‘read across’ from case to case,
comparing them directly on any of the chosen features. A tiny extract from Towner,
Dowswell and Jarvis’s (1996) summary of 56 road-safety initiatives is reproduced
in Table 2 in order to provide a glimpse of the massive information matrix.
There is little doubt that such a procedure provides an incomparable
overview of ‘what is going on’ in a family of evaluation studies. There is one
clear advantage over the numerical approach: it admits a rather more sophisti-
cated understanding of how programmes work. Meta-analysis is trapped by its
‘successionist’ understanding of causality. The working assumption is that it is
the programme ‘x’ that causes the outcome ‘y’, and the task is to see which
172

sub-type of initiative (x1, x2, x3 etc.) or which mediator (m1, m2, m3 etc.) has the
most significant impact.
In narrative review, the programme is not seen as a disembodied feature with
its own causal powers. Rather, programmes are successful when (imagine reading
across the data matrix/extraction form) the right intervention type, with clear and
pertinent objectives, comes into contact with an appropriate target group, and is
administered and researched by an effective stakeholder alliance, working to
common goals, in a conducive setting, and so on. This logic utilizes a ‘configura-
tional’ approach to causality, in which outcomes are considered to follow from
the alignment, within a case, of a specific combination of attributes. Such an
approach to causation is common amongst narrative inquiry of all kinds and
features strongly, once again, in the methodology of comparative historical
research (Ragin, 1987: 125). There is no need to labour the point here but there
is a clear parallel, for instance, in the logic used for explaining programme
efficacy and that applied in charting the success of a social movement. Expla-
nations for why England experienced an early industrial revolution attempt to
identify the combination of crucial conditions such as ‘technological innovation’,
‘weak aristocracy’, ‘commercialized agriculture’, ‘displaced peasantry’, ‘exploit-
able empire’ and so on (Moore, 1966: ch. 1).
However, the capacity of the narrative approach to highlight such a ‘holistic’
understanding of programme success by no means encompasses the entire task
of accomplishing a review. The next and crucial question concerns how compari-
sons are made with such configurational data. How does one draw transferable
lessons when one set of attributes might have been responsible for success in
programme ‘A’ and a different combination might account for the achievements
of programme ‘B’? My own impression is that the ‘logic of comparison’ has not
been fully articulated in the narrative tradition, and so there is considerable vari-
ation in practice on this matter. Simplifying, I think there are three discernible
‘analytic strategies’ in use, as discussed in the following paragraphs. All three are
often found in combination, though there is a good case for arguing that the
second approach (below) is the most characteristic.
Let the Reviews Speak for Themselves

Systematic review is a long and laborious business, which itself occupies the full
research cycle. It begins with making difficult decisions about the scope of the
programmes to be brought under review. Then the sample of designated cases is
located by many different means, from keyword searches to word-of-mouth
traces. Then comes the wrestling match to find a common template on which to
code the multifarious output, be it learned, grey or popular. Then comes the
matter of data extraction itself, with its mixture of mechanical coding and guess-
work, as the researchers struggle to translate the prose styles of the world into a
common framework. Then, if the business is conducted properly, comes the
reliability check, with its occasionally shocking revelations about how research
team members can read the same passage to opposite effects. Then, finally, comes
the quart-into-pint-pot task of presenting the mass of data in an intelligible set
of summary matrices and tables.
173
02Pawson (JB/D)
174
Evaluation 8(2)
4/29/02
1:15 PM
Table 2. The ‘Summary Matrix’ in Narrative Review – Road Safety Education (Towner, Dowswell and Jarvis, 1996)
Page 174
Road-Safety Education – Experimental Programmes
Author and date Country Injury target Study type Type of Intervention Healthy Alliance Outcome Outcome(s)
of publication of study group (age Measure(s)
in years)
Yeaton and Bailey USA 5–9 Before-and-after One-to-one real life Crossing patrol, Observed Skills improved from 48%
1978 study demonstrations to schools behaviour to 97% and 21% to
teach six street 96% Maintained at
crossing skills year one
Nishioka et al. Japan 4–6 Before-and-after Group training aimed at No details Reported Improvements in
1991 study with 2 dashing-out behaviour behaviour behaviour on level of
comparison training. 40% of
groups children shown to be
unsafe AFTER training
Plus 7 more experimental programmes

02Pawson (JB/D)
Road-Safety Education – Operational Programmes
4/29/02
Author and date Country Injury target Study type Type of Intervention Healthy Alliance Outcome Outcome(s)
of publication of study group (age Measure(s)
in years)
1:15 PM
Schioldborg 1976 Norway Preschool Before-and-after Traffic club Parents Injury rates, No effect on traffic
study with observed behaviour. Reported
control group behaviour 20% reduction in
casualty rates and 40%
Page 175
in Oslo

Anataki et al. 1986 UK 5 Before-and-after Road-safety education Schools, road- Knowledge All children improved on
study with using Tufty materials safety officers test score over the
control group 6-month period.
However, children
exposed to the Tufty
measures performed
no better than non-
intervention group
Plus 4 more operational programmes

Road-Safety Education – Area-Wide Engineering Programmes
10 programmes
Road-Safety Education – Cycle-helmet studies
10 programmes
Road-Safety Education – Enforcement legislation
3 programmes
Road-Safety Education – Child-restraint loan schemes
9 programmes
Road-Safety Education – Seat-belt use
9 programmes
Source: Towner et al., 1996
175
Evaluation 8(2)
Little wonder, then, that there is a tendency in some systematic reviews to draw
breath and regard the completion of this sequence as ‘the job done’. On this view,
EBP is seen largely as a process of condensation, a reproduction in miniature of
the various incarnations of a policy idea. The evidence has been culled painstak-
ingly to enable a feedback loop into fresh policy making – but the latter is a task
to be completed by others.
However sympathetic one might be to the toils of information science, this
‘underlabourer’ notion of EBP must be given very short shrift. Evidence, new or
old, numerical or narrative, diffuse or condensed, never speaks for itself. The
analysis and usage of data is a sense-making exercise and not a mechanical one.
Social interventions are descriptively inexhaustible. Of the infinite number of
events, actions and thoughts that comprise a programme rather few are recorded
in research reports and fewer still in research review. It is highly unlikely, for
instance, that the evidence base will contain data on the astrological signs of
project participants or the square footage of the project headquarters. It is rela-
tively unlikely that the political allegiance of the programme architects or size of
the intervention funding will be systematically reviewed. In other words, certain
explanations for programme success or failure are favoured or disqualified auto-
matically – according to the particular information selected for extraction.
Since it cannot help but contain the seeds of explanation, systematic review has
to acknowledge and foster an interpretative agenda. If, contrariwise, the
evidence base is somehow regarded as ‘raw data’ and the interpretative process
is left incomplete and unhinged, this will simply reproduce the existing division
of labour in which policy makers do the thinking and other stakeholders bear the
consequences. EBP has to be bolder than this or it will provide a mere decora-
tive backwash to policy making, a modern-day version of the old ivory-tower
casuistry, ‘here are some assorted lessons of world history, prime minister – now
go rule the country’.
Pick Out Exemplary Programmes for an ‘Honourable Mention’

However modestly, most narrative reviews do in fact strive to make the extra step
into policy recommendations and we turn next to the main method of extracting
the implications of the evidence base. This is the narrative version of the ‘best
buy’ approach and it takes the form of identifying ‘exemplary programmes’. In
practice this often involves using the ‘recommendations for action’ or ‘executive
summary’ section of the review to re-emphasize the winning qualities of the
chosen cases. The basis for such a selection has been covered already – success-
ful programmes follow from the combination and compatibility of a range of
attributes captured in the database. A de facto comparison is involved in such
manoeuvres in that less successful programmes obviously do not possess such
configurations. These are entirely legitimate ambitions for a narrative review. But
the following questions remain. How is the policy community able to benefit from
the merits of exemplary programmes? What are the lessons for future program-
ming and practice? What theory of ‘generalization’ underlies this approach?
The answer to the latter question is captured in the phrase ‘proximal similarity’
(Shadish et al., 1991). The aim is to learn from review by imitating the successful
176

programmes. What makes them successful is the juxtaposition of many attributes
and so the goal for future programme design is to imitate the programme as a
whole or at least try to gather in as many similarities as possible. In the varied
world of narrative reviews this, I believe, is the closest we get to a dominant
guiding principle. (It is almost the opposite of the idea of heterogeneous repli-
cation – it is a sort of ‘homogeneous replication’.)
There are several problems with using the proximal similarity principle as the
basis for drawing transferable policy lessons. The first is that it is impossible to
put into practice (quite a drawback!). As soon as one shifts a programme to
another venue, inevitably there are differences in infrastructure, institutions,
practitioners and subjects. Implementation details (the stuff of narrative reviews)
are thereby subtly transformed, making it almost impossible to plan for or to
create exact similarities. Furthermore, if one creates only ‘partial similarities’ in
new versions of the programmes, one doesn’t know whether the vital causal
configurations on which success depends are broken. Such a dismal sequence of
events, alas, is very well known in the evaluation community. Successful ‘demon-
stration projects’ have often proved the devil to replicate, precisely because
programme outcomes are the sum of their assorted, wayward, belligerent parts
(Tilley, 1996).
A second problem with a configurational understanding of programme efficacy
stems from its own subtlety. There is no assumption that the making of success-
ful programmes follows only the single ‘recipe’. Interventions may get to the
same objectives by different means. Yet again, the parallel with our example from
comparative historical method rears its head. Having identified the configuration
of conditions producing the first industrial revolution does not mean, of course,
that the second and subsequent industrial nations follow the same pathway (Tilly,
1975: ch. 2). What counts in history and public policy is having the right idea for
the apposite time and place.
Wise as this counsel is, it is a terribly difficult prospect to identify the poten-
tial synergy of an intervention and its circumstances on the basis of the materi-
als identified in a narrative review. As we have seen, programme descriptions in
narrative reviews are highly selective. Programme descriptions are also ‘onto-
logically flat’ (Sayer, 2000). This is quite literally the case as one follows the
various elements of each intervention across the data matrix. What Sayer and
other realists actually have in mind with the use of the phrase, however, is that
the various ‘properties’ extracted in review are just that – they are a standard set
of observable features of programme. Change actually takes place in an ‘onto-
logically deep’ social world in which the dormant capacities of individuals are
awaked by new ideas, from which new ways of acting emerge.
However painstaking a research review, such ‘causal powers’ and ‘emergent
properties’ of interventions are not the stuff of the data extraction form. From
the standard mode of presenting narrative reviews (recall Table 2), it is possible
to gain an idea of the particulars of each intervention, without intellectually
grasping the way the programmes work. For instance, several reviews of road-
safety education, including the aforementioned work by Towner, have indicated
that children can better recall rules for crossing busy roads if they are learned
177
Evaluation 8(2)
using roadside ‘mock-ups’ rather than through diagrams and pictures. Why this
might be the case is presumably to do with the benefits of learning practical skills
‘in situ’ rather than ‘on paper’. In other words, we are not persuaded that one
particular programme configuration works better than another because of the
mere juxtaposition of its constituent properties. Rather what convinces is our
ability to draw upon an implicit, much-used and widely useful theory.
My criticism is not to do with missing evidence or wrong-headed interpretation.
I am arguing that the extraction of exemplary cases in narrative overview is not
simply a case of ‘reviewing the evidence’ but depends, to a considerable degree,
on the tacit testing of submerged theories. Why this state of affairs is so little
acknowledged is something of a mystery. Perhaps it is the old empiricist fear of
the ‘speculative’ and the ‘suprasensible’. Be that as it may, such an omission sits
very oddly with the design of interventions, which is all about the quest for
improvement in ‘programme theory’. No doubt the quest for verisimilitude in
road-safety training came from the bright idea that learning from actual practice
(if not through mistakes!) was the best plan of action. The moral of the tale,
however, is clear. The review process needs to acknowledge this vital character-
istic of all programmes and thus appreciate that the evidence base is theory-laden.
Meld Together Some Common Denominators of Programme Success

As indicated above, the broad church of narrative review permits several means
to the policy ends. So in addition to the piling high of descriptive data and the
extraction of exemplars, there are a range of hybrid strategies which owe some-
thing to the method of meta-analysis. I will mention just one variant on this theme.
The more extensive the review, the more likely is it that a range of good practice
has cropped up across the various sub-families of initiatives contained therein.
Instead of handling these ‘one at a time’ for flattery by means of imitation, there
are also attempts to extract some ‘collective lessons’ in narrative form.
Whilst the cell entries take the form of text rather than numbers (Table 2
again), the basic database of a narrative review is still in the form of a matrix.
This allows one to pose questions about which manifestations of the various
programme properties tend to be associated with programme success. Given
narrative form of data, this cannot be put as a multivariate question about the
respective correlates of programme success. It is possible however to pick upon
each programme property in turn and try to discover what the exemplars share
in common with respect to that particular ‘attribute’. Such an exercise tends to
arrive in the ‘summary and conclusions’ section as a series of recommendations
about good practice in programme design and implementation. This is, of course,
very much in keeping with the emphasis on intervention processes that pervades
the qualitative evaluation traditions.
Thus of programme development the message of such a review will often be
something like ‘inter-agency collaboration is essential to develop different
elements of a local campaign’ (Towner et al., 1996). Of the programme targets,
the shared quality of successful interventions might be ‘initiatives must be sensi-
tive to individual differences’ (Towner et al., 1996). In other words, the theory of
generalization in this narrative sub-strategy has shifted to what might be called
178

the search for the ‘common denominators of implementation success’. This is a
more confusing and confused project because it cross-cuts the ‘heterogeneous-
replication’ and the ‘proximal-similarities’ principles. I find it very hard to fault
the collected wisdom of recommendations such as the above. But, due to them
combining the disembodied, variable-based method of meta-analysis with a
narrative emphasis on process, they often seem rather trite. Often we are
presented with bureaucratic truisms, with a sub-set of the recommendations for
programme effectiveness applying, no doubt, to organizing the bus trip to
Southend.
Conclusion
I will begin by clarifying what I have attempted to say and tried not to say. I have
disavowed what I think is the usual account, a preference for the numerical or
the narrative. Neither, though I might be accused of doing so, have I declared a
plague on both houses. The key point is this: there are different ways of explain-
ing why a particular programme has been a success or failure. Any particular
evaluation will thus capture only a partial account of the efficacy of an inter-
vention. In relation to the collective evaluation of whole families of programmes,
the lessons learned become even more selective.
Accordingly, any particular model of EBP will be highly truncated in its expla-
nation of what has and has not worked. In this respect, I hope to have shown that
meta-analysis ends up with de-contextualized lessons and that narrative review
concludes with over-contextualized recommendations. My main ambition has
thus been to demonstrate that plenty of room remains in the middle for an
approach that is sensitive to the local conditions of programme efficacy but then
renders such observations into transferable lessons.3
Such an approach must include all the basic informational mechanics, along
with sensitivity to outcome variation of meta-analysis, and to process variation
of narrative review. Bringing all of these together is a rather tough task. The
intended synthesis will not follow from some mechanical blending of existing
methods (such an uneasy compromise was highlighted in the final example).
What is required, above all, is a clear logic of how research review is to underpin
policy prognostication. The attentive reader will have already read my thoughts
on the basic ingredients of that logic. I have mentioned how systematic review
often squeezes out attention to programme ‘mechanisms’, ‘contexts’ and
‘outcome patterns’. I have already mentioned how ‘middle-range theories’ lurk
tacitly in the selection of best buys. These concepts are the staples of realist
explanation (Pawson and Tilley, 1997), which leads me to suppose there might
be promise in a method of ‘realist synthesis’.
Notes
The author would like to thank two anonymous referees for their reminder on the dangers
of over-simplification in methodological debate. All remaining over-simplifications, of
course, are all my own work.
179
Evaluation 8(2)
1. This article and its companion in the next issue are the preliminary sketches for a new
book seeking to establish a realist methodology for systemic reviews.
2. For details of, and access to, EBP players such as the Cochrane Collaboration,
Campbell Collaboration, EPPI-Centre etc., see the Resources section of the ESRC UK
Centre for Evidence Based Policy and Practice website at www.evidencenetwork.org.
3. I arrive at my conclusion by very different means, but it mirrors much recent thinking
about the need for a ‘multi-method’ strategy of research review and a ‘triangulated’
approach to the evidence base. It is in the area of evidence-based healthcare that this
strategy has received most attention. There are any number of calls for this amalga-
mation of approaches (Popay et al., 1998) as well as methodological pieces (Dixon-
Woods et al., 2001) on the possible roles for qualitative evidence in the combined
approaches. Indeed the ‘first attempt to include and quality-assess process evaluations
as well as outcome evaluations in a systematic way’ was published in 1999 by Harden
et al. Alas it is beyond the scope of this article to assess these efforts, which will be
the subject of a forthcoming article. A glance at the conclusion to Harden et al.’s
review (p. 129), however, will show that these authors have not pursued the realist
agenda.
References
Campbell, D. (1974) ‘Evolutionary Epistemology’, in P. Schilpp (ed.) The Philosophy of
Karl Popper. La Salle: Open Court.
Colditz, G. A., T. F. Brewer, C. S. Berkey et al. (1995) ‘Efficacy of BCG Vaccine in the
Prevention of Tuberculosis: Meta-analysis of the Published Literature’, Journal of the
American Medical Association 271: 698–702.
Davies, P. (2000) ‘The Relevance of Systematic Review to Educational Policy and
Practice’, Oxford Review of Education 26: 365–78.
Dixon-Woods, M., R. Fitzpatrick and K. Roberts (2001) ‘Including Qualitative Research
in Systematic Reviews’, Journal of Evaluation in Clinical Practice 7(2): 125–33.
Durlak, J. and A. Wells (1997) ‘Primary Prevention Mental Health Programs for Children
and Adolescents: A Meta-Analytic Review’, American Journal of Community Psychol-
ogy 25: 115–52.
Gallo, P. S. (1978) ‘Meta-analysis – a Mixed Metaphor’, American Psychologist 33: 515–17.
Harden, A., R. Weston and A. Oakley (1999) ‘A Review of the Effectiveness and Appro-
priateness of Peer-delivered Health Promotion Interventions for Young People’, EPPI-
Centre Research Report. London: Social Science Research Unit, Institute of Education,
University of London.
Hunt, M. (1997) How Science Takes Stock. New York: Russell Sage Foundation.
Kaplan, A. (1964) The Conduct of Inquiry. New York: Chandler.
McGuire, J. B., A. Stein and W. Rosenberg (1997) ‘Evidence-based Medicine and Child
Mental Health Services’, Children and Society 11(1): 89–96.
Moore, B. (1966) Social Origins of Dictatorship and Democracy. New York: Peregrine.
Pawson, R. and N. Tilley (1997) Realistic Evaluation. London: Sage.
Popay, J., A. Rogers and G. Williams (1998) ‘Rationale and Standards for the Systematic
Review of Qualitative Literature in Health Service Research’, Qualitative Health
Research 8(3): 341–51.
Ragin, C. (1987) The Comparative Method. Berkeley, CA: University of California Press.
Robertson, J. (2001) ‘A Critique of Hypertension Trials and of their Evaluation’, Journal
of Evaluation in Clinical Practice 7(2): 149–64.
Sayer, A. (2000) Realism and Social Science. London: Sage.
180

Shadish, W. and R. Sweeney (1991) ‘Mediators and Moderators in Meta-analysis’, Journal
of Counselling and Clinical Psychology 59(6): 883–93.
Shadish, W., T. Cook and L. Levison (1991) Foundations of Program Evaluation. Newbury
Park, CA: Sage.
Smith, M. and G. Glass (1977) ‘Meta-analysis of Psychotherapeutic Outcome Studies’,
American Psychologist 32: 752–60.
Sutton, A., K. Abrams and D. Jones (2001) ‘An Illustrated Guide to Meta-analysis’,
Journal of Evaluation in Clinical Practice 7(2): 135–48.
Tilley, N. (1996) ‘Demonstration, Exemplification, Duplication and Replication in Evalu-
ation Research’, Evaluation 2(1): 35–50.
Tilly, C. (1975) Big Structures, Large Processes, Huge Comparisons. New York: Russell
Sage Foundation.
Towner, E., T. Dowswell and S. Jarvis (1996) Reducing Childhood Accidents – The Effec-
tiveness of Health Promotion Interventions: A Literature Review. London: Health
Education Authority.
Walker, B. (1996) Injury Prevention for Young Children. Westport, CT: Greenwood Press.
Weissberg, R. and D. Bell (1997) ‘A Meta-analytic Review of Primary Prevention
Programs for Children and Adolescents: Contributions and Caveats’, American Journal
of Community Psychology 25(2): 207–14.
RAY PAWSON is the author (with N. Tilley) of Realistic Evaluation. He is currently

visiting fellow at the ESRC UK Centre for Evidence Based Policy and Practice,
Queen Mary University of London (www.evidencenetwork.org). Please address
correspondence to: Department of Sociology and Social Policy, University of
Leeds, Leeds, LS2 9JT, UK. [email: [email protected]]
181

Evidence-Based Policy: in Search of A Method: Ray Pawson

Uploaded by

Copyright:

Available Formats

Evidence-Based Policy: in Search of A Method: Ray Pawson

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evidence-Based Policy: in Search of A Method: Ray Pawson

Uploaded by

Copyright:

Available Formats

02Pawson (JB/D) 4/29/02 1:15 PM Page 157

Evidence-based Policy: In Search of a

Evaluation research is tortured by time constraints. The policy cycle revolves

K E Y WO R D S : evidence-based policy; meta-analysis; methodology; narrative

Pawson: Evidence-based Policy: In Search of a Method

i. Evaluation Research (standard mode)

Design Implementation Impact

Design Implementation Impact

Figure 1. The Research/Policy Sequence

Pawson: Evidence-based Policy: In Search of a Method

range of further distinctions may also be discerned in terms of programmes with

Pawson: Evidence-based Policy: In Search of a Method

Pawson: Evidence-based Policy: In Search of a Method

Pawson: Evidence-based Policy: In Search of a Method

Pawson: Evidence-based Policy: In Search of a Method

Pawson: Evidence-based Policy: In Search of a Method

Pawson: Evidence-based Policy: In Search of a Method

Let the Reviews Speak for Themselves

Plus 7 more experimental programmes

Pawson: Evidence-based Policy: In Search of a Method

Plus 4 more operational programmes

Pick Out Exemplary Programmes for an ‘Honourable Mention’

Pawson: Evidence-based Policy: In Search of a Method

Meld Together Some Common Denominators of Programme Success

Pawson: Evidence-based Policy: In Search of a Method

Pawson: Evidence-based Policy: In Search of a Method

RAY PAWSON is the author (with N. Tilley) of Realistic Evaluation. He is currently

You might also like