Evidence-Based Policy: in Search of A Method: Ray Pawson
Evidence-Based Policy: in Search of A Method: Ray Pawson
Evidence-Based Policy: in Search of A Method: Ray Pawson
Evaluation
Copyright © 2002
SAGE Publications (London,
Thousand Oaks and New Delhi)
[1356–3890 (200204)8:2; 157–181; 024512]
Vol 8(2): 157–181
Introduction
The topic here is ‘learning from the past’ and the contribution that empirical
research can make. Whether policy makers actually learn from or simply repeat
past mistakes is something of a moot point. What is clear is that policy research
has much to gain by following the sequence whereby social interventions are
mounted in trying, trying, and then trying again to tackle the stubborn problems
that confront modern society. This is the raison d’être behind the current explo-
sion of interest in evidence-based policy (EBP). Few major public initiatives these
days are mounted without a sustained attempt to evaluate them. Rival policy ideas
are thus run through endless trials with plenty of error and it is often difficult to
know which of them really have withstood the test of time. It is arguable, there-
fore, that the prime function of evaluation research should be to take the longer
157
02Pawson (JB/D) 4/29/02 1:15 PM Page 158
Evaluation 8(2)
view. By building a systematic evidence base that captures the ebb and flow of
programme ideas we might be able to adjudicate between contending policy claims
and so capture a progressive understanding of ‘what works’. Such ‘resolutions of
knowledge disputes’, to use Donald Campbell’s (1974) phrase, are the aim of the
research strategies variously known as ‘meta-analysis’, ‘review’ and ‘synthesis’.
Standing between this proud objective and its actual accomplishment lies the
minefield of social science methodology. Evaluation research has come to under-
stand that there is no one ‘gold standard’ method for evaluating single social
programmes. My starting assumption is that exactly the same maxim will come
to apply to the more global ambitions of systematic review. It seems highly likely
that EBP will evolve a variety of methodological strategies (Davies, 2000). This
article seeks to extend that range by offering a critique of the existing orthodoxy,
showing that the lessons currently learnt from the past are somewhat limited.
These omissions are then gathered together in a second part of the article to form
the objectives of a new model of EBP, which I call ‘realist synthesis’.
The present article offers firstly a brief recapitulation of the overwhelming case
for the evaluation movement to dwell on and ponder evidence from previous
research.
It goes on to consider the main current methods for doing so, which have been
split into two perspectives: ‘numerical meta-analysis’ and ‘narrative review’. A
detailed critique of these extant approaches is offered. The two perspectives
operate with different philosophies and methods and are often presented in
counterpoint. There is a heavy whiff of the ‘paradigm wars’ in the literature
surrounding them and much mud slinging on behalf of the ‘quantitative’ versus the
‘qualitative’, ‘positivism’ versus ‘phenomenology’, ‘outcomes’ versus ‘process’ and
so on. This article eschews any interest in these old skirmishes and my critique is,
in fact, aimed at a common ambition of the two approaches. In their different ways,
both aim to review a ‘family of programmes’ with the goal of selecting out the ‘best
buys’ to be developed in future policy making. I want to argue that, despite the
arithmetic clarity and narrative plenitude of their analysis, they do not achieve
decisive results. The former approach makes no effort to understand how
programmes work and so any generalizations issued are insensitive to differences
in programme theories, subjects and circumstances that will crop up in future
applications of the favoured interventions. The latter approach is much more
attuned to the contingent conditions that make for programme success but has no
formal method of abstracting beyond these specifics of time and place, and so is
weak in its ability to deliver transferable lessons. Both approaches thus struggle to
offer congruent advice to the policy architect, who will always be looking to
develop new twists to a body of initiatives, as well as seeking their application in
fresh fields and to different populations.
‘Meta’ is Better
Whence EBP? The case for using systematic review in policy research rests on a
stunningly obvious point about the timing of research vis-à-vis policy: in order
to inform policy, the research must come before the policy. To figure this out does
158
02Pawson (JB/D) 4/29/02 1:15 PM Page 159
Research Research
starts ends
here here
ii. Meta-analysis/Synthesis/Review
Research Research
ends starts
here here
Feedback
159
02Pawson (JB/D) 4/29/02 1:15 PM Page 160
Evaluation 8(2)
Much ink and vast amounts of frustration have flowed in response to this
sequencing. There is no need for me to repeat all the epithets about ‘quick and
dirty’, ‘breathless’ and ‘brownie point’ evaluations. The key point I wish to under-
score here is that, under the traditional running order, programme design is often
a research-free zone. Furthermore, even if an evaluation manages to be
‘painstaking and clean’ under present conditions, it is still often difficult to trans-
late the research results into policy action. The surrounding realpolitik means
that, within the duration of an evaluation, the direction of political wind may well
change, with the fundamental programme philosophy being (temporarily)
discredited, and thus not deemed worthy of further research funding. Moreover,
given the turnover in and the career ambitions of policy makers and practitioners,
there is always a burgeoning new wave of programme ideas waiting their turn for
development and evaluation. Under such a regime, we never get to Campbellian
‘resolution to knowledge disputes’ because there is rarely a complete revolution
of the ‘policy-into-research-into-policy’ cycle.
Such is the case for the prosecution. Let us now turn to a key manoeuvre in
the defence of using the evidence base in programme design. The remedy often
suggested for all this misplaced, misspent research effort is to put research in its
appropriate station (at the end of the line) and to push many more scholars back
where they belong (in the library). This strategy is illustrated in the lower portion
of Figure 1. The information flow therein draws out the basic logic of systematic
review, which takes as its starting point the idea that there is nothing entirely new
in the world of policy making and programme architecture. In the era of global
social policy, international programmes and cross-continental evaluation
societies, one can find few policy initiatives that have not been tried and tried
again, and researched and researched again. Thus, the argument goes, if we begin
inquiry at that point where many similar programmes have run their course and
the ink has well and truly dried on all of the research reports thereupon, we may
then be in a better position to offer evidence-based wisdom on what works and
what does not.
On this model, the key driver of research application is thus the feedback loop
(see Figure 1) from past to present programming. To be sure, this bygone
evidence might not quite correspond to any current intervention. But since policy
initiatives are by nature mutable and bend according to the local circumstances
of implementation, then even real-time research (as we have just noted) has
trouble keeping apace.
Like all of the best ideas, the big idea here is a simple one – that research
should attempt to pass on collective wisdom about the successes and failures of
previous initiatives in particular policy domains. The prize is also a big one in that
such an endeavour could provide the antidote to policy making’s frequent lapses
into crowd pleasing, political pandering, window dressing and god acting. I
should add that the apparatus for carrying out systematic reviews is also by now
a big one, the recent mushrooming of national centres and international consortia
for EBP being the biggest single change on the applied research horizon for many
a year.2
Our scene is thus set. I turn now to an examination of the methodology of EPB
160
02Pawson (JB/D) 4/29/02 1:15 PM Page 161
Numerical Meta-analysis
The numerical strategy of EBP, often just referred to as ‘meta-analysis’, is based
on a three-step model: ‘classify’, ‘tally’ and ‘compare’. The basic locus of analysis
is a particular ‘family of programmes’ targeted at a specific problem. Attention
is thus narrowed to the initiatives developed within a particular policy domain
(be it ‘HIV/AIDS-prevention schemes’ or ‘road-safety interventions’ or ‘neigh-
bourhood-watch initiatives’ or ‘mental health programmes’ or whatever). The
analysis begins with the identification of sub-types of the family, with the classifi-
cation based normally on alternative ‘modes of delivery’ of that programme.
Since most policy making is beset with rival assertions on the best means to
particular ends, meta-evaluation promises a method of comparing these contend-
ing claims. This is accomplished by compiling a database examining existing
research on programmes comprising each sub-type, and scrutinizing each case for
a measure of its impact (net effect). The overall comparison is made by calcu-
lating the typical impact (mean effect) achieved by each of the sub-types within
the overall family. This strategy thus provides a league table of effectiveness and
a straightforward measure of programme ‘best buy’. By following these steps,
and appending sensible caveats about future cases not sharing all of the features
of the work under inspection, the idea is to give policy architects some useful
pointers to the more promising areas for future development.
A typical illustration of numerical strategy is provided in Table 1, which comes
from Durlak and Wells’s (1997) meta-analysis of 177 Primary Prevention Mental
Health (PPMH) programmes for children and adolescents carried out in the US.
According to the authors, PPMH programmes ‘may be defined as an intervention
intentionally designed to reduce incidence of adjustment problems in currently
normal populations as well as efforts directed at the promotion of mental health
functioning’. This broad aim of enhancing psychological well-being by promoting
the capacity to cope with potential problems can be implemented through a variety
of interventions, which were classified by the researchers; see column one of Table
1. Firstly there is the distinction between person-centred programmes (which use
counselling, social learning and instructional approaches) and environment-
centred schemes (modifying school or home conditions to prepare children for life
change). Then there are more specific interventions targeted at ‘milestones’ (such
as the transitions involved in parental divorce or teenage pregnancy and so on). A
161
02Pawson (JB/D) 4/29/02 1:15 PM Page 162
Evaluation 8(2)
Table 1. The ‘Classify’, ‘Tally’ and ‘Compare’ Model of Meta-analysis (Durlak and
Wells, 1997: 129)
Type of Program n Mean effect
Environment-centred
School-based 15 0.35
Parent training 10 0.16a
Transition programmes
Divorce 7 0.36
School entry/change 8 0.39
First-time mothers 5 0.87
Medical/dental procedure 26 0.46
Person-centred programmes
Affective education
Children 2–7 8 0.70
Children 7–11 28 0.24
Children over 11 10 0.33
Interpersonal problem solving
Children 2–7 6 0.93
Children 7–11 12 0.36
Children over 11 0 –
Other person-centred programmes
Behavioural approach 26 0.49
Non-behavioural approach 16 0.25
Note: a = Non-significant result. Many thanks to Joe Durlak and Anne Wells for their kind
permission to reprint the above.
162
02Pawson (JB/D) 4/29/02 1:15 PM Page 163
Melded Mechanisms
The first difficulty follows from an absolutely taken-for-granted presumption
about the appropriate locus of comparison in meta-analysis. I have referred to
this above as a ‘programme family’. This method simply assumes that the source
of family resemblance is the ‘policy domain’. In one sense this is not unreason-
able; modern society parcels up social ills by administrative domain and assem-
bles designated institutes, practitioners, funding regimes and programmes to
tackle each problem area. Moreover, it is a routine feature of problem solving
that within each domain there will be some disagreement about the best way of
tackling its trademark concerns. Should interventions be generic or targeted at
sub-populations? Should they be holistic or problem specific? Should they be
aimed at prevention or cure? Should they have an individual focus, or be insti-
tution-centred, or be area-based? Such distinctions affect the various
professional specialities and rivalries that feature in any policy sector. This in turn
creates the context for meta-evaluation, which is asked to judge which way of
tackling the domain problem is most successful. In the case at hand, we are
dealing with the activities of a professional speciality, namely ‘community
psychology’, which probably has a greater institutional coherence in the US than
the UK, but whose sub-divisions into ‘therapists’, ‘community educators’ and so
on would be recognized the world over.
This is the policy apparatus that generates the programme alternatives that
generate the meta-analysis question. Whether it creates a commensurable set of
comparisons and thus a researchable question is, however, a moot point. This
brings me to the crux of my first critique, which is to cast doubt on whether such
an assemblage of policy alternatives constitutes a comparison of ‘like with like’.
Any classification system must face the standard methodological expectations
that it be unidimensional, totally inclusive, mutually exclusive and so forth. We
have seen how the different modes of programme delivery and the range of
targets form the basis of Durlak and Wells’s classification of PPMH programmes.
But the question is, does this classification system provide us with programme
variations on a common theme that can be judged by the same measure, or are
they incommensurable interventions that should be judged in their own terms?
An insight to this issue can be gained by considering the reactions of authors
whose studies have been ‘grist to the meta-analysis mill’. Weissberg and Bell
(1997) were responsible for 3 out of the 12 studies reviewed on ‘interpersonal
163
02Pawson (JB/D) 4/29/02 1:15 PM Page 164
Evaluation 8(2)
problem solving for 7–11-year-olds’ and their barely statistically significant
efforts are thus to be found down in the lower section of the net-effects league
table. They protest that their three inquiries were in fact part of a ‘developmental
sequence’, which saw their intervention change from one with 17 to 42 modules,
and as programme conceptualization, curriculum, training and implementation
progressed, outcome success also improved in the three trials. They would like
‘work-in-progress’ not to be included in the review, and also point out that
programmes frequently outgrow their meta-analytic classification. Their ‘person-
centred’ intervention really begins to be effective, they claim, when it influences
whole teaching and curriculum regimes and thus becomes ‘environment-centred’.
The important point for the latter pair of authors is that the crux of their
programme, its ‘identifier’, is the programme theory. In their case, the proposed
mechanism for change was the simultaneous transformation of both setting and
teaching method so that the children’s thinking became more oriented to
problem solving.
Weissberg and Bell also offer some special pleading for those programmes (not
of their own) that come bottom of the PPMH meta-evaluation scale. Parenting
initiatives, uniquely amongst this programme set, attempt to secure improve-
ments in children’s mental health through the mechanisms of enhancing child-
rearing practices and increasing child-development knowledge of parents and
guardians. This constitutes a qualitatively different programme theory, which
may see its pay-off in transformations in daily domestic regimes and in long-term
development in the children’s behaviour. This theory might also be particularly
sensitive to the experience of the parent-subjects (first-time parents rather than
older mums and dads being readier to learn new tricks) and may suffer from
problems of take-up (such initiatives clearly have to be voluntary and the most
needful parents might be the hardest to reach).
Such a programme stratagem, therefore, needs rather sophisticated testing.
Ideally this would involve long-term monitoring of both family and child; it would
involve close scrutiny of parental background; and, if the intervention were intro-
duced early in the children’s lives, it would have to do without before-and-after
comparisons of their attitudes and skills. When it comes to meta-analysis, no such
flexibility of the evidence base is possible and Durlak and Wells’s studies all had
to face the single, standard test of success via measures monitoring pre–post-
intervention changes in the child’s behaviour. A potential case for the defence of
parenting initiatives thus remains, on the grounds that what is being picked up in
meta-analysis in these cases is methodological heavy-handedness rather than
programme failure.
The point here is not about ‘special pleading’; I am not attempting to defend
‘combined’ programmes or ‘parenting’ programmes or any specific members of
the PPMH family. Like the typical meta-analyst, I am not close enough to the
original studies to make judgements on the rights and wrongs of these particular
disputes. The point I am underlining is that programme sub-categories cannot
simply be taken on trust. They are not just neutral, natural labels that present
themselves to the researcher. They do not just sit side by side innocently awaiting
inspection against a common criterion. This is especially so if the sub-types follow
164
02Pawson (JB/D) 4/29/02 1:15 PM Page 165
Oversimplified Outcomes
Such a warning looms even more vividly when we move from cause to effect, i.e.
to programme outcomes. This is the second problem with numerical meta-
analysis, concealed in that rather tight-fisted term – the ‘mean effect’. The crucial
point to recall as we cast our eyes down the outputs of meta-analysis, such as
column three in Table 1, is that the figures contained therein are means of means
of means of means! It is useful to travel up the chain of aggregation to examine
how exactly the effect calculations are performed for each sub-category of a
programme. Recall that PPMH programmes carry the broad aims of increasing
‘psychological well-being’ and tackling ‘adjustment problems’. These objectives
were tested within each evaluation by the standard method of performing before-
and-after calculations on indicators of the said concepts.
Mental health interventions have a long pedigree and so the original evalu-
ations were able to select indicators of change from a whole battery of ‘attitude
measures’, ‘psychological tests’, ‘self-reports’, ‘behavioural observations’,
‘academic performance records’, ‘peer-approval ratings’, ‘problem-solving
vignettes’, ‘physiological assessments of stress’ and so on. As well as varying in
kind, the measurement apparatus for each intervention also had a potentially
different time dimension. Thus ‘post-test’ measures will have had different prox-
imity to the actual intervention and, in some cases but not others, were applied
in the form of third and subsequent ‘follow-ups’. Further, hidden diversity in
outcome measures follows from the possibility that certain indicators (cynics may
guess which!) may have been used but have gone unreported in the original
studies, given limitations on journal space and pressures to report successful
outcomes.
These then are the incredibly assorted raw materials through which meta-
analysis traces programme effects. Now, it is normal to observe some variation
in programme efficacy across the diverse aspects of personal change sought in an
initiative (with it being easier to shift, say, ‘self-reported attitudes’ than ‘anti-
social behaviour’ than ‘academic achievement’). It is also normal to see
programme effects change over time (with many studies showing that early
165
02Pawson (JB/D) 4/29/02 1:15 PM Page 166
Evaluation 8(2)
education gains may soon dissipate but that interpersonal-skills gains are more
robust, e.g. McGuire, Stein and Rosenberg, 1997). It is also normal in multi-goal
initiatives to find internal variation in success across the different programme
objectives (with the school-based programmes having potentially more leverage
on classroom-based measures about ‘discipline referrals’ and ‘personal
competence’ rather than on indicators shaped more from home and neighbour-
hood, such as ‘absenteeism’ and ‘drop-out rates’). In short, programmes always
generate multiple outcomes and much is to be learnt about how they work by
comparing their diverse impacts within and between programmes and over time.
There is little opportunity for such flexibility in meta-analysis, however,
because any one study becomes precisely that – namely ‘study x’ of, for instance,
the 15 school-based studies. Outcome measurement is a tourniquet of compres-
sion. It begins life by observing how each individual changes on a particular
variable and then brings these together as the mean effect for the programme
subjects as a whole. Initiatives normally have multiple effects and so its various
outcomes, variously measured, are also averaged as the ‘pooled effect’ for that
particular intervention. This, in turn, is melded together with the mean effect
from ‘study y’ from the same subset, even though it may have used different indi-
cators of change. The conflation process continues until all the studies are
gathered in within the sub-type, even though by then the aggregation process
may have fetched in an even wider permutation of outcome measures. Only then
do comparisons begin as we eyeball the mean, mean, mean effects from other
sub-categories of programmes. Meta-analysis, in short, will always generate its
two-decimal-place mean effects – but since it squeezes out much ground-level
variation in the outcomes, it remains open to the charge of spurious precision.
Concealed Contexts
My third critique assesses, by means of the ‘like with like’ test, a further element
of all social programmes, namely the subjects who, and situations which, are on
the receiving end of the initiatives. No individual-level intervention works for
everyone. No institution-level intervention works everywhere. The net effect of
any particular programme is thus made up of the balance of successes and failures
of individual subjects and locations. Thus any ‘programme outcome’ – single,
pooled or mean – depends not merely upon ‘the programme’ but also on its
subjects and its circumstances. These contextual variations are yet another
feature that is squeezed out of the picture in the aggregation process of meta-
analysis.
Some PPMH programmes, as we have seen, tackle longstanding issues of the
day like educational achievement. Transforming children’s progress in this
respect has proved difficult, not for want of programmes, but because the
educational system as a whole is part of a wider range of social and cultural
inequalities. The gains made under a specific initiative are thus always limited by
such matters as the class and racial composition of the programme subjects and,
beyond that, by the presence or absence of further educational and job oppor-
tunities. The same programme may thus succeed or fail according to how advan-
tageous is its school and community setting. At the aggregate level, this
166
02Pawson (JB/D) 4/29/02 1:15 PM Page 167
167
02Pawson (JB/D) 4/29/02 1:15 PM Page 168
Evaluation 8(2)
The message, I trust, is clear: what has to be resisted in meta-analysis is the
tendency for making policy decisions on the casting of an eye down a net-effects
column such as in Table 1. The contexts, mechanisms and outcomes that consti-
tute each set of programmes are so diverse that it is improbable that like gets
compared with like. So whilst it might seem objective and prudent to make policy
by numbers, the results may be quite arbitrary. This brings me to a final remark
on Durlak and Wells and to some good news (and some bad). The PPMH meta-
analysis is not in fact used to promote a case for or against any particular subset
of programmes. So, with great sense and caution, the authors avoid presenting
their conclusions as an endorsement of ‘interpersonal problem solving for the
very young’, or denunciation of ‘parenting programmes’, and so on. Indeed they
view the results in Table 1 as ‘across the board’ success and perceive that the
figures support the extension of programmes and further research in PPMH as a
whole.
Their reasoning is not so clear when it comes to step two in the argument. This
research was done against the backdrop of the US Institute of Medicine’s 1994
decisions to exclude mental health promotion from their official definitions of
preventative programmes, with a consequent decline in their professional status.
Durlak and Wells protest that their finding (ESs of 0.24 to 0.93) compare
favourably with the effect sizes reported routinely in psychological, educational
and behavioural treatments (they report that one overview of 156 such meta-
analyses came up with a mean of means of means of means of means of 0.47).
Additionally, ‘the majority of mean effects for many successful medical treat-
ments such as bypass surgery for coronary heart disease, chemotherapy to treat
certain cancers . . . also fall below 0.5’. If one compares for just a second the
nature, tractability and seriousness of the assorted problems and the colossal
differences of programme ideas, subjects, circumstances and outcome measures
involved in this little lot, one concludes that we are being persuaded, after all,
that chalk can be compared to cheese.
Let me make it abundantly clear that my case is not just against this particu-
lar example. There are, of course, more sophisticated examples of meta-analysis
than the one analysed here, not to mention many far less exacting studies. There
are families of interventions for which meta-analysis is more feasible than the
case considered here and domains where it is a complete non-starter. The most
prevalent usage of the method lies in synthesizing the results of clinical trials. In
evidence-based medicine (EBM), one is often dealing with a singular treatment
(the application of a drug) as well as unidimensional and rather reliable outcome
measures (death rates). Here meta-analysis might give important pointers on
what to do (give clot-busting agents such as streptokinase after a coronary) and
what not to do (subject women to unnecessary gynaecological treatment). Of
course, problem ‘heterogeneity’ remains in EBM but it is not of the same order
as in EBP, where interventions work though reasoning subjects rather than
blinded patients.
Even in EBP, meta-analysis is under continuous development. Since I have
space here for only a single shot at a rapidly moving target, I close this section
with some clarification of the scope of the critique. Firstly I do not suppose that
168
02Pawson (JB/D) 4/29/02 1:15 PM Page 169
169
02Pawson (JB/D) 4/29/02 1:15 PM Page 170
Evaluation 8(2)
and on, with one meta-analysis study of the efficacy of a vaccine for tuberculosis
discovering that its effects improved the further was the distance of the study site
from the equator (Colditz et al., 1995).
Rather than a league table of effect sizes, second-level meta-analysis produces
as its outcome a ‘meta-regression’ – a causal model identifying associations
between study or subject characteristic and the outcome measure. Much else in
the analysis will remain the same, however. A limited rage of mediators and
moderators is likely to be selected according to both the limits of the researchers’
imaginations (not too many have chosen line of latitude) and by the lack of
consistent information on potential mediators from primary study to primary
study. Additionally if, for example, subject characteristics are the point of
interest, the other dilemmas illustrated above, such as the melding of programme
mechanism and the pooling of outcome measures, are likely to remain evident.
The result is that meta-regression offers a more complex summary of certain
aspects of programme efficacy but one that never accomplishes the privileged
status of ‘generalization’ or ‘law’. The result according to one rather authoritative
source is:
In many a meta-analysis we have a reliable, useful, causal description but without any
causal explanation. I think that the path models have a heuristic value but often seduce
the reader and the scholar into giving them more weight than they deserve. (Thomas
Cook quoted in Hunt, 1997: 79)
From the point of view of the earlier analysis, this has to be the correct diag-
nosis. For the realist, programmes do not have causal power. Interventions offer
subjects resources, which they then accept or reject, and whether they do so
depends on their characteristics and circumstances. As I have attempted to show
with the PPMH examples, there is an almighty range of different mechanisms
and contexts at work, which produce a vast array of outcome patterns. The
problem is the classic one of the over-determination of evidence by potential
theories. This suggests that rather than being a process of interpolation and esti-
mation, systematic review should attempt the task of extrapolation and expla-
nation. This line of argument is picked up in the second part of the article.
Any further discussion of meta-analysis must be curtailed at this stage. Readers
interested in the latest debate on its second and even third levels should compare
the points of view of Sutton et al. (2001) and Robertson (2001). The attention
here now turns to causal interpretations of a quite different ilk.
Narrative Reviews
We move to the second broad perspective of EBP, which comes in a variety of
shapes and sizes and also goes by an assortment of names. In an attempt to use
a catch-all expression, I shall refer to them as ‘narrative reviews’. Again, I will
discuss a number of examples, but once again note that my real objective is to
capture and criticize the underlying ‘logic’ of the strategy.
The overall aim of the narrative approach is, in fact, not so different to that of
the numerical strategy; a family of programmes is examined in the hope of finding
170
02Pawson (JB/D) 4/29/02 1:15 PM Page 171
171
02Pawson (JB/D) 4/29/02 1:15 PM Page 172
Evaluation 8(2)
strategy here, merely pointing out an acute form of underlying tension between
the goals of ‘revealing the essence of each case study’ and ‘affecting a compari-
son of all case studies’.
A giant step on from here, within the narrative tradition, is what is sometimes
called the ‘descriptive-analytical’ method. Studies are examined in relation to a
common analytical framework, so the same template of features is applied to
each study scrutinized. A model example, from the same field of childhood
accident prevention, is to be found in the work of Towner et al. (1996). Appendix
H in that study supplies an example of a ‘data extraction form’, alas too long to
reproduce here, which is completed for each study reviewed, collecting infor-
mation as follows:
1. Author, year of publication, place;
2. Target group, age range, setting;
3. Intervention aims and content;
4. Whether programme is educational, environmental or legislative;
5. Whether alliances of stakeholders were involved in programme implemen-
tation;
6. Methodology employed;
7. Outcome measures employed;
8. Summary of important results;
9. Rating of the ‘quality of evidence’.
This represents a move from trying to capture the essence of the original studies
via an ‘abstract/summary’ to attempting to locate their key aspects on a ‘data
matrix’. The significant point, of course, is that, being a narrative approach, the
cell entries in these tabulations are composed mainly of text. This text can be as
full or as brief as research time and inclination allow. In its raw form, the infor-
mation on any single study can thus easily comprise up to two or three pages.
This may include the extraction of a key quotation from the original authors, plus
some simple tick-box information on, say, the age range of the target group. The
record of findings can range from effect sizes to the original author’s thoughts on
the policy implications of their study. Furthermore, the review may also include
the reactions of the reviewer (e.g. ‘very good evaluation – no reservations about
the judgements made’). In short, the raw data of narrative review can take the
form of quite a mélange of different types of information.
Normally such reviews will also provide appendices with the entire matrix on
view in a condensed form, so one can literally ‘read across’ from case to case,
comparing them directly on any of the chosen features. A tiny extract from Towner,
Dowswell and Jarvis’s (1996) summary of 56 road-safety initiatives is reproduced
in Table 2 in order to provide a glimpse of the massive information matrix.
There is little doubt that such a procedure provides an incomparable
overview of ‘what is going on’ in a family of evaluation studies. There is one
clear advantage over the numerical approach: it admits a rather more sophisti-
cated understanding of how programmes work. Meta-analysis is trapped by its
‘successionist’ understanding of causality. The working assumption is that it is
the programme ‘x’ that causes the outcome ‘y’, and the task is to see which
172
02Pawson (JB/D) 4/29/02 1:15 PM Page 173
173
02Pawson (JB/D)
174
Evaluation 8(2)
4/29/02
1:15 PM
Table 2. The ‘Summary Matrix’ in Narrative Review – Road Safety Education (Towner, Dowswell and Jarvis, 1996)
Page 174
Road-Safety Education – Experimental Programmes
Author and date Country Injury target Study type Type of Intervention Healthy Alliance Outcome Outcome(s)
of publication of study group (age Measure(s)
in years)
Yeaton and Bailey USA 5–9 Before-and-after One-to-one real life Crossing patrol, Observed Skills improved from 48%
1978 study demonstrations to schools behaviour to 97% and 21% to
teach six street 96% Maintained at
crossing skills year one
Nishioka et al. Japan 4–6 Before-and-after Group training aimed at No details Reported Improvements in
1991 study with 2 dashing-out behaviour behaviour behaviour on level of
comparison training. 40% of
groups children shown to be
unsafe AFTER training
4/29/02
Author and date Country Injury target Study type Type of Intervention Healthy Alliance Outcome Outcome(s)
of publication of study group (age Measure(s)
in years)
1:15 PM
Schioldborg 1976 Norway Preschool Before-and-after Traffic club Parents Injury rates, No effect on traffic
study with observed behaviour. Reported
control group behaviour 20% reduction in
casualty rates and 40%
Page 175
in Oslo
Evaluation 8(2)
Little wonder, then, that there is a tendency in some systematic reviews to draw
breath and regard the completion of this sequence as ‘the job done’. On this view,
EBP is seen largely as a process of condensation, a reproduction in miniature of
the various incarnations of a policy idea. The evidence has been culled painstak-
ingly to enable a feedback loop into fresh policy making – but the latter is a task
to be completed by others.
However sympathetic one might be to the toils of information science, this
‘underlabourer’ notion of EBP must be given very short shrift. Evidence, new or
old, numerical or narrative, diffuse or condensed, never speaks for itself. The
analysis and usage of data is a sense-making exercise and not a mechanical one.
Social interventions are descriptively inexhaustible. Of the infinite number of
events, actions and thoughts that comprise a programme rather few are recorded
in research reports and fewer still in research review. It is highly unlikely, for
instance, that the evidence base will contain data on the astrological signs of
project participants or the square footage of the project headquarters. It is rela-
tively unlikely that the political allegiance of the programme architects or size of
the intervention funding will be systematically reviewed. In other words, certain
explanations for programme success or failure are favoured or disqualified auto-
matically – according to the particular information selected for extraction.
Since it cannot help but contain the seeds of explanation, systematic review has
to acknowledge and foster an interpretative agenda. If, contrariwise, the
evidence base is somehow regarded as ‘raw data’ and the interpretative process
is left incomplete and unhinged, this will simply reproduce the existing division
of labour in which policy makers do the thinking and other stakeholders bear the
consequences. EBP has to be bolder than this or it will provide a mere decora-
tive backwash to policy making, a modern-day version of the old ivory-tower
casuistry, ‘here are some assorted lessons of world history, prime minister – now
go rule the country’.
176
02Pawson (JB/D) 4/29/02 1:15 PM Page 177
177
02Pawson (JB/D) 4/29/02 1:15 PM Page 178
Evaluation 8(2)
using roadside ‘mock-ups’ rather than through diagrams and pictures. Why this
might be the case is presumably to do with the benefits of learning practical skills
‘in situ’ rather than ‘on paper’. In other words, we are not persuaded that one
particular programme configuration works better than another because of the
mere juxtaposition of its constituent properties. Rather what convinces is our
ability to draw upon an implicit, much-used and widely useful theory.
My criticism is not to do with missing evidence or wrong-headed interpretation.
I am arguing that the extraction of exemplary cases in narrative overview is not
simply a case of ‘reviewing the evidence’ but depends, to a considerable degree,
on the tacit testing of submerged theories. Why this state of affairs is so little
acknowledged is something of a mystery. Perhaps it is the old empiricist fear of
the ‘speculative’ and the ‘suprasensible’. Be that as it may, such an omission sits
very oddly with the design of interventions, which is all about the quest for
improvement in ‘programme theory’. No doubt the quest for verisimilitude in
road-safety training came from the bright idea that learning from actual practice
(if not through mistakes!) was the best plan of action. The moral of the tale,
however, is clear. The review process needs to acknowledge this vital character-
istic of all programmes and thus appreciate that the evidence base is theory-laden.
178
02Pawson (JB/D) 4/29/02 1:15 PM Page 179
Conclusion
I will begin by clarifying what I have attempted to say and tried not to say. I have
disavowed what I think is the usual account, a preference for the numerical or
the narrative. Neither, though I might be accused of doing so, have I declared a
plague on both houses. The key point is this: there are different ways of explain-
ing why a particular programme has been a success or failure. Any particular
evaluation will thus capture only a partial account of the efficacy of an inter-
vention. In relation to the collective evaluation of whole families of programmes,
the lessons learned become even more selective.
Accordingly, any particular model of EBP will be highly truncated in its expla-
nation of what has and has not worked. In this respect, I hope to have shown that
meta-analysis ends up with de-contextualized lessons and that narrative review
concludes with over-contextualized recommendations. My main ambition has
thus been to demonstrate that plenty of room remains in the middle for an
approach that is sensitive to the local conditions of programme efficacy but then
renders such observations into transferable lessons.3
Such an approach must include all the basic informational mechanics, along
with sensitivity to outcome variation of meta-analysis, and to process variation
of narrative review. Bringing all of these together is a rather tough task. The
intended synthesis will not follow from some mechanical blending of existing
methods (such an uneasy compromise was highlighted in the final example).
What is required, above all, is a clear logic of how research review is to underpin
policy prognostication. The attentive reader will have already read my thoughts
on the basic ingredients of that logic. I have mentioned how systematic review
often squeezes out attention to programme ‘mechanisms’, ‘contexts’ and
‘outcome patterns’. I have already mentioned how ‘middle-range theories’ lurk
tacitly in the selection of best buys. These concepts are the staples of realist
explanation (Pawson and Tilley, 1997), which leads me to suppose there might
be promise in a method of ‘realist synthesis’.
Notes
The author would like to thank two anonymous referees for their reminder on the dangers
of over-simplification in methodological debate. All remaining over-simplifications, of
course, are all my own work.
179
02Pawson (JB/D) 4/29/02 1:15 PM Page 180
Evaluation 8(2)
1. This article and its companion in the next issue are the preliminary sketches for a new
book seeking to establish a realist methodology for systemic reviews.
2. For details of, and access to, EBP players such as the Cochrane Collaboration,
Campbell Collaboration, EPPI-Centre etc., see the Resources section of the ESRC UK
Centre for Evidence Based Policy and Practice website at www.evidencenetwork.org.
3. I arrive at my conclusion by very different means, but it mirrors much recent thinking
about the need for a ‘multi-method’ strategy of research review and a ‘triangulated’
approach to the evidence base. It is in the area of evidence-based healthcare that this
strategy has received most attention. There are any number of calls for this amalga-
mation of approaches (Popay et al., 1998) as well as methodological pieces (Dixon-
Woods et al., 2001) on the possible roles for qualitative evidence in the combined
approaches. Indeed the ‘first attempt to include and quality-assess process evaluations
as well as outcome evaluations in a systematic way’ was published in 1999 by Harden
et al. Alas it is beyond the scope of this article to assess these efforts, which will be
the subject of a forthcoming article. A glance at the conclusion to Harden et al.’s
review (p. 129), however, will show that these authors have not pursued the realist
agenda.
References
Campbell, D. (1974) ‘Evolutionary Epistemology’, in P. Schilpp (ed.) The Philosophy of
Karl Popper. La Salle: Open Court.
Colditz, G. A., T. F. Brewer, C. S. Berkey et al. (1995) ‘Efficacy of BCG Vaccine in the
Prevention of Tuberculosis: Meta-analysis of the Published Literature’, Journal of the
American Medical Association 271: 698–702.
Davies, P. (2000) ‘The Relevance of Systematic Review to Educational Policy and
Practice’, Oxford Review of Education 26: 365–78.
Dixon-Woods, M., R. Fitzpatrick and K. Roberts (2001) ‘Including Qualitative Research
in Systematic Reviews’, Journal of Evaluation in Clinical Practice 7(2): 125–33.
Durlak, J. and A. Wells (1997) ‘Primary Prevention Mental Health Programs for Children
and Adolescents: A Meta-Analytic Review’, American Journal of Community Psychol-
ogy 25: 115–52.
Gallo, P. S. (1978) ‘Meta-analysis – a Mixed Metaphor’, American Psychologist 33: 515–17.
Harden, A., R. Weston and A. Oakley (1999) ‘A Review of the Effectiveness and Appro-
priateness of Peer-delivered Health Promotion Interventions for Young People’, EPPI-
Centre Research Report. London: Social Science Research Unit, Institute of Education,
University of London.
Hunt, M. (1997) How Science Takes Stock. New York: Russell Sage Foundation.
Kaplan, A. (1964) The Conduct of Inquiry. New York: Chandler.
McGuire, J. B., A. Stein and W. Rosenberg (1997) ‘Evidence-based Medicine and Child
Mental Health Services’, Children and Society 11(1): 89–96.
Moore, B. (1966) Social Origins of Dictatorship and Democracy. New York: Peregrine.
Pawson, R. and N. Tilley (1997) Realistic Evaluation. London: Sage.
Popay, J., A. Rogers and G. Williams (1998) ‘Rationale and Standards for the Systematic
Review of Qualitative Literature in Health Service Research’, Qualitative Health
Research 8(3): 341–51.
Ragin, C. (1987) The Comparative Method. Berkeley, CA: University of California Press.
Robertson, J. (2001) ‘A Critique of Hypertension Trials and of their Evaluation’, Journal
of Evaluation in Clinical Practice 7(2): 149–64.
Sayer, A. (2000) Realism and Social Science. London: Sage.
180
02Pawson (JB/D) 4/29/02 1:15 PM Page 181
181