RafolsStirling OpeningUp20200901-Preprint

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Submitted to Dahler-Larsen, Peter (Editor) A research agenda for Evaluation.

Edward Elgar Publishing. To appear in 2021.


Version 2. 1st September 2020. Preprint available at SocArxiv 10.31235/osf.io/h2fxp

Designing indicators for opening up evaluation. Insights from


research assessment.
Ismael Rafols1, 2, 3 and Andy Stirling2
1
Centre for Science and Technology Studies (CWTS), Leiden University
2
Science Policy Research Unit (SPRU), University of Sussex
3
Ingenio (CSIC-UPV), Universitat Politècnica de València

Abstract

The use of indicators is generally associated with a reduction of perspectival diversity in


evaluation that often facilitates making decisions along dominant framings – effectively closing
down debate. In this chapter we will argue that while this is indeed often the case, indicators
can also be used to help support more plural evaluation and foster more productively critical
debate. In order to achieve this shift, it is necessary equally to change understandings, forms
and uses of indicators in decision making. These shifts involve, first, broadening out the range
of ‘inputs’ taken into account; and second, opening up the ‘outputs', in the sense of developing
methodologies for indicator-based analyses to help in considering plural perspectives. In
practice, this means a move towards more situated and participatory use of quantitative
evidence in evaluation, a shift from universal indicators to contextualised indicating.

1. Introduction

Over recent decades, indicators have become increasingly prominent in governance across a
variety of sectors. Through most of the 20th century, indicators were used as tools to inform,
support and justify decision-making. The advent of neoliberal governance and New Public
Management (NPM) in the 1980s brought about increased use of scalar quantification. By this,
we mean that issues of interest are not just represented quantitatively, but as single (notionally
definitive) numbers. Thus in both private and public spheres, NPM’ aims to help ensure greater
adoption of ‘best’ choices (i.e. more productive, more efficient, higher performance)
(Desrosières, 2015). This overall increase is associated with greater use of indicators in micro-
management of organisations and staff, underpinning more explicit incentive mechanisms for
fostering self-monitoring, self-auditing and external control towards improved ‘performance’
(Rottenburg and Merry, 2015, pp. 3-7; Dahler-Larsen, 2011). The internet and growing
computational capacity further fuelled these changes through ‘metrics’ developed from big
data analytics.

This expansion and growth of indicators has been seen as problematic. By contrast with more
‘plural and conditional’ numerical ‘mappings’ of views under diverse perspectives, scalar
quantifications encode particular understandings and interests on what counts and how it is
counted. Underpinning globally burgeoning information infrastructures, such conventional
indicators tend to privilege the perspective of those with financial and professional resources
(Rottenburg and Merry, 2015, p.4). Evaluations shaped by scalar indicators are likely to

1
privilege particular understandings of performance – at the expense of alternatives whose value
is not so easily captured by these indicators. Application of this kind of indicator-based
evaluation is likely to result in a narrowing of activities, with those not counting for the
indicators being suppressed (task reduction) and indicators themselves becoming a goal in their
own right (goal displacement) (De Rijcke et al., 2016). Such constitutive effects resulting from
indicator-centred evaluation are often perceived by many stakeholders as highly problematic:
for example, evaluation mainly based on bibliomentrics are perceived as often marginalizing
research with potential societal contributions – thus clashing with policies fostering societal
missions or challenges.

So deeply have these instruments become entrenched in contemporary governance, that it is


difficult realistically to envisage a drastic reduction in the short term. This is: first, because
existing indicators have become ‘naturalised’, fitting with the mainstream view among most
researchers that ‘research quality’ should be assessed on ‘internal’ scientific criteria rather than
discussing the relative value broader contributions (Weinberg, 1962). Second, because NPM
has become so strongly embedded in many governance settings (in evaluation machineries,
infrastructure and experts’ networks) that strong dependencies have developed in the
mobilisation of quantitative evidence. Third, because evaluations based on less rigidly
quantified expert judgement are not necessary less problematic. Experts also have particular
understandings and interests (i.e. biases), and those groups with resources or social capital tend
to be relatively over-represented in panels – as shown, for example in gender, linguistic, racial
geographical and class biases. Moreover, experts are likely to be influenced by informal use of
mainstream indicators, even if evaluations are formally based only on qualitative expertise
(Kelly and Burrows, 2012).

Therefore, rather than advocating avoidance of indicators, most recent reform movements
propose ‘responsible uses of quantification’: where various forms of indicators and modelling
are employed to support, but not substitute, for expert judgement (Hicks et al. 2015; Wilsdon
et al., 2015; Saltelli et al., 2020). The strategy we suggest is along the lines of statactivism
(Bruno et al., 2014): counteracting traditional and conventional scalar indicators with new
forms of quantification that illuminate the inconsistencies of narrow ‘performance indicators’
and offer more plural alternatives. Thus, our proposal subverts the view of indicators from tools
of control to tools of emancipation – thinking of indicator frameworks as Trojan horses than
can be planted in evaluations processes for opening up critical debates and perspectives
(Stirling, 2016).

This chapter explores how quantification can be developed and embedded in evaluation so that
it offers ‘plural and conditional’ perspectives, both to the evaluator and associated wider debate.
Such practice is ‘plural’ because quantification accommodates multiple perspectives in
symmetrical ways. It is ‘conditional’ because the resulting numbers are not presented as
definitively unqualified, but as inextricably dependent on their contexts. An evaluator is still
free to assesses according to whichever perspectives she (or other relevant actors) can justify
as being more appropriate. But this necessity for justification – and stimulus of wider critical
scrutiny – adds crucial additional dimensions of rigour and accountability.

In suggesting this ‘plural and conditional’ approach, we follow Andy Stirling’s wider advocacy
of practices for ‘opening up’ social appraisal (the means by which society at large comes to
apprehend alternative possible choices). This contends that both quantitative and qualitative
approaches to evaluation (and appraisal more generally) can be used either for opening up or
closing down debate (Stirling, 2008). Both functions are important, each is unavoidable and

2
(depending on context and perspective) either can have value, but a particular emphasis is
warranted on ‘opening up’ – under arguably any view – because powerful interests and
dynamics of justification tend to introduce such a strong bias towards ‘closing down’ (Stirling,
2012). Balancing these pressures therefore becomes a matter of rigour.

Accordingly, we argue that although indicators are currently mostly used to close down debate
and endorse assessments shaped by dominant framings, alternative usages of quantification can
also help foster higher quality public debates, make injustices more visible and enable
recognition for undervalued activities (Bruno et al., 2014; Rottenburg and Merry, 2015, p. 25;
Lehtonen et al., 2016).

In particular, we propose that more ‘plural and conditional’ indicators can help in making
visible that notions of performance are intrinsically and fundamentally conditional on the
particular perspectives or assumptions through which they are framed. It is therefore not just
the resulting numbers that are important, but also a qualitative appreciation for the values and
interests embedded within them (Pielke, 2007; Stirling, 2010).

We will focus here specifically on the context of research evaluation, although associated
arguments apply across a diversity of sectors and evaluative conditions in social appraisal. The
discussion is particularly relevant in situations in which scalar indicators have been applied to
complex issues. This narrowing of vision has happened in instances of evaluation in sectors as
disparate as research, environment, education or health: for example scientific forestry in 18th
century Prussia (Scott, 1998), police statistics in New York (Bruno et al., p. 205), clinical
practice in New Labour Britain (O’Mahony, 2017), or public health in the Global South
(Adams, 2016).

2. The uses of indicators in research evaluation

Research assessment is an arena in which growing practice of ‘governance by indicators’ has


been especially diverse and intense (Burrows, 2012). From university rankings (van Raan,
2005), to performance-based funding systems (Hicks, 2012), or to individual level assessments
(Wouters et al., 2013), the use of ‘metrics’ has become pervasive in research evaluation.

Thus, the professional incentive structure for researchers generally relies on a dominant
framing according to which research performance may satisfactorily be characterised almost
exclusively in relation to metrics of international publications. In these terms, ‘productivity’ is
associated simply with the number of publications churned out per researcher and research
‘quality’ is associated merely with the number of citations per paper. 1 As bibliometric
indicators became progressively established as a social institution and as infrastructure, the
most popular indicators, such as Journal Impact Factor and the Hirsch (h-) index, became
‘naturalized as instantiations of quality irrespective of the methodological critiques by
professional scientometricians’ (Wouters, 2014, p. 58).

1
It is worth remembering that the first (and then very controversial) studies using bibliometric indicators in
research assessments spelled out in the abstract that analysis had to be done at the group level (rather than
individual), that citations showed impact (rather than quality), that comparisons could only be made between
‘matched’ groups, and that indicators were ‘partial’ and only reliable when multiple indicators ‘converged’
(Martin and Irvine, 1983). Notice, though, that the emphasis nonetheless lay in producing a ‘convergent’ measure
of a focal notion (like ‘scientific impact’), rather than in illuminating contrasting perspectives or measures.

3
By implicitly insisting that research quality can be ‘measured’ in the same way all around the
world, these universalistic notions essentially assume that all research has the same purpose.
Yet research managers and evaluators have known for decades that research ‘quality’ is
understood differently across contrasting scientific communities and depends on the contextual
goals of the research (Weinberg, 1962; Roessner, 2000). Different notions of value apply, for
instance, to research variously aiming to: solve problems around local stakeholders’ living
conditions; provide policy advice on highly politicised social issues; foster public debates in
uncertain areas of technology policy; enhance understanding of divergent priorities and
interests in fields like education; or address narrow canonical disciplinary puzzles within
academic settings (Chavarro et al., 2017; Dahler-Larsen, 2019, p. 129). Notions of ‘quality’
may have as many meanings in research as in other areas of culture (Heuts and Mol, 2013;
Dahler-Larsen, 2019, p. 4).

It is for these reasons, that uncontextualised uses of S&T indicators have been widely criticised
(Feller, 2002 and 2012;, Weingart, 2005). Many reform initiatives have been launched,
including the San Francisco Declaration on Research Assessment (DORA), the Leiden
Manifesto (Hicks et al., 2015) and The Metric Tide (Wilsdon et al., 2015). As a result, research
assessment is an area where issues around pluralisation in the use of quantification has already
been widely discussed (Lepori et al., 2008; Barré, 2010, 2019; Rafols, 2019), with a number of
prominent experiments being carried out (Benedictus et al., 2016; Lebel and McLean, 2018).

The particular way in which most efforts have sought to improve the robustness of
measurements, has lain in broadening out the range of inputs used in evaluations. In pursuing
this, analysts have reverted to an early insight (subsequently neglected in much indicator
activity) that assessments should rely on multiple sources of data that may provide ‘converging
partial indicators’ (Martin and Irvine, 1983). The broadening of inputs is facilitated by an
avalanche of technical developments. First, the possibility of using different data sources
stemming from the multiplication of traces now left over the cyberspace – including new
publications databases such as Microsoft Academic or Dimensions (Visser et al., 2020), and
databases such as Altmetrics.com on uses or mentions of publications in social media or policy
documents (Wouters et al, 2019). Second, many new tools have emerged for data visualisation
(e.g. Hans Rosling’s Gapminder, commercial Tableau, or open source R statistics
visualisations 2 ), in particular for mapping large networks such as Gephi or VOSviewer 3
(Börner, 2010).

While this ‘broadening out’ of the range of data used as ‘inputs’ in evaluation is commendable,
we suggest that a second – complementary and independent – dimension should also to be
considered. This focuses not on inputs to appraisal (like research evaluation), but on the
‘outputs’ to evaluators and wider policy debates – attending to the extent to which these ‘open
up’ appreciation for contrasting conceptualisations of the phenomena under scrutiny. It is here
that more ‘plural and conditional’ communication of indicators can allow evaluators for more
rigorous attention to alternative strategic considerations (Stirling et al., 2007; Leach et al.,
2010).

2
See https://www.gapminder.org, https://www.tableau.com, http://r-statistics.co/Top50-Ggplot2-Visualizations-
MasterList-R-Code.html
3
See https://www.vosviewer.com and https://gephi.org

4
3. Opening up versus closing down in appraisal processes

Let us define appraisal as ‘the ensemble of processes through which knowledges are gathered
and produced in order to inform decision-making and wider institutional commitments’ (Leach
et al., 2010). In our case, these appraisal processes are carried out through tools, methodologies
and approaches – quantitative or qualitative, analytical or participatory – that inform, and thus
strongly shape, the outcomes of evaluations.

We can distinguish two dimensions in any appraisal process, as illustrated in Figure 1 (Stirling
et al., 2007). The first dimension, the ‘range of appraisal inputs’, refers to the scope, extent and
depth with which appraisal includes few or many different types of knowledge to describe the
phenomena under scrutiny. The second dimension, the ‘effect of appraisal outputs in decision
making’, refers to the degree to which the outputs of appraisal facilitate ‘closing down’ debate
or, on the contrary, provide plural interpretations of the phenomena and thus fosters ‘opening
up’ deliberation between contrasting options.

Although typically highly diverse in their potentialities, distinct cultures of practice serve to
lead different methods to occupy distinct spaces in Figure 1. Some methods build on smaller
or larger range of inputs. Some techniques facilitate ‘closing down’ appraisal by establishing
an absolute ranking of ‘best’ choices, while others foster ‘opening up’ by allowing evaluators
to compare and contrast how different assumptions in analysis may result in divergent rankings
of options.

Along the vertical axis, one may argue (ceteris paribus), that appraisals which ‘broaden out’
the range of inputs will tend on balance to be more comprehensive and thus more robust. If
resources allow, efforts to increase breadth are thus generally desirable. This expansion of the
range of inputs is particularly important when working with indicators, given that ‘the
increasing importance of quantitative evidence leads to a situation in which only those
operations which are counted and can be counted, count at all, and that qualitative and more
complex operations will receive less and less attention’ (Rottenburg and Merry, 2015, p. 20).

Along the horizontal axis, there is less a priori basis for normative preference. Policy processes
typically yield contrasting moments in particular settings for ‘opening up’ or ‘closing down’
debates during an evaluation. These may of course be viewed differently, with important
implications for the broad families of techniques that might legitimately be preferred in any
given context. Yet (as discussed in the introduction), these specific routine dynamics in
particular areas, take place against a wider backdrop in which deeper and broader pressures for
decision justification, lead to a general bias towards methods for closing down (Stirling, 2019).

It is not only the case, therefore, that deliberate attention must be given to ‘opening up’ the
issues and perspectives in question prior to policy closure within any particular setting. At least
equally important in the interests equally of rigour and accountability are that strenuous efforts
must also be made – and institutionalised – in order to balance the bias imposed by policy
incumbents towards closure. It is on this basis that one may argue also from a standpoint of
rigour, that where indicators have been used expediently to circumvent open scrutiny or
democratic agency, a particular premium emerges across diverse political perspectives, for a
premium on opening up (Dahler-Larsen, 2019, p. 217-218).

Despite countervailing technical potentials, the cultures around methods like cost-benefit
analysis tend lead these to consider fewer relevant issues and provide ranked outputs that

5
highlight the preferable choices thus facilitating to close down discussion across options. Thus
cost-benefit analyses (upper-left of Figure 1) are often used to justify infrastructural decisions
such as dams, by making some issues such as economic costs and benefits visible, while
neglecting aspects not easily amenable to quantifications, such effects on gender or cultural
identities (Leach et al., 2010).

Methods such as open hearings or unstructured interviews (upper-right of Figure 1) may rely
on small samples of views (thus are narrow), but they may have an opening up effect if they
introduce a diversity of perspectives. On the other end, consensus conferences (lower-left of
Figure 1), may provide a variety of disparate views on an issue, but by definition, the focus on
‘consensus’ means that the output is likely to facilitates making a decision, rather than further
debate. However, the position of methods in the space of ‘range of inputs vs. effect on outputs’
depends on the specific use that is made.

One particular way of opening up decision-making is to question the object of appraisal – i.e.
what is to be evaluated. For example, should the evaluation consider research quality according
to the immediate outputs (with indicators such as publications), intermediate outcomes (e.g.
with indicators related to use by stakeholders) or the societal impacts (with indicators such as
estimated contribution to health/wellbeing)? In methodology Research Quality Plus efforts
were made to make use of multiple understandings of the object of appraisal in order to judge
the quality of development research (Lebel and McLean, 2018).4 Rather than assuming the
mainstream indicator, it would be worth having an explicit discussion on choice or keep a
multidimensional description. Another issue concerns the units of analysis. For example,
quantitative clustering may question existing classifications: clustering of researchers might
show clusters very different from the institutional groups suggested following bureaucratic
guidelines.

Further one should consider the very different ways of using the same methods. Figure 2
provides the example of ‘decision analysis’ to illustrate how different designs and
implementations of a method can alter the breadth and openness of the method and thus change
its position in the scheme shown in Figure 1. For example, decision analysis may focus on
human safety as the only criterion to be considered, on the basis of scores provided by experts
without uncertainty range. In this case, the options can be clearly ranked and the method can
be located in the upper-left side of the graph (Stirling, 2015, p. 26). Yet, a decision analysis
process can also be implemented taking a wider range of impacts into consideration (human
safety, environmental impact, cultural impact on populations affected) and include the
uncertainty ranges, given the difficulties of estimating these impacts. As a result, as shown in
the middle of Figure 2, now the options are not clearly ranked. There is thus first a broadening
out in inputs as impacts beyond human safety have been considered, and second, an opening
up of outputs because by making uncertainties and ambiguities explicit there is no clear
preferred option and policy debates become more relevant.

Finally, the lower-right of Figure 2 gives the example of Multicriteria Mapping (MCM)
(Coburn and Stirling, 20165). This is a sophisticated hybrid quantitative-qualitative version of
decision analysis which, instead of aggregating participants’ views on the pros and cons of a
range of options. At every stage, MCM prioritises the agency of participants themselves to
frame issues and define the scope of appraisal in whichever ways they judge to be appropriate
4
We thank one reviewer for suggesting this example. See also: https://www.idrc.ca/en/research-in-
action/research-quality-plus.
5
See a dedicated website at https://www.multicriteriamapping.com

6
– thus broadening out appraisal to flexibly to accommodate a full range of salient ‘inputs’.
Crucially, however, MCM also prioritises various means to visualise each perspective
separately, and so explore specific reasons for differences. For instance, a comparison between
charts shows the divergent perspectives impacts and their uncertainties – thus highlighting
different values by experts yield different assessments. This helps enable the ‘opening up’ of
the outputs of appraisal, equally for decision makers and to wider policy debates.

Whether facilitated by MCM or some other method of this kind, it is this kind of approach that
is required in order to realise the quality of ‘plural and conditional’ appraisal discussed in the
introduction. The results obtained are explicitly ‘plural’ both because: first, each perspective is
encouraged to highlight its own uncertainties concerning option orderings (rather than
aggregate a single preference); and second, because contrasting such orderings of options are
clearly associated with divergent real-world perspectives, each meaningful in different ways to
the policy debate in question. And these results are rigorously ‘conditional’, because each
ordering is clearly associated with the subjective conditions which give it meaning, with rich
qualitative information in this regard available to deepen and qualify the quantitative picture.

For the purpose of our present discussion on the use of indicators in evaluation, it is important
to observe that, not only can similar methods occupy different positions depending on how they
are implemented, but also that expert-analytic and participatory-deliberative methods are quite
evenly distributed over Figure 1. These reflects longstanding appreciation that both analytic
(often quantitative) and participatory (often qualitative) methods can – equally and in different
ways – each be used alternatively to close down or to open up the policy processes that they
inform.

This observation offers an important corollary to longstanding historical evidence for the ways
in which analytical methods in particular (which tend to be quantitative) are often used to shut
down debate and justify decisions (Porter, 1995; Rottenburg and Merry, 2015). The clarity and
prominence of this evidence can sometimes lead to assumption that more participatory or
qualitative approaches are somehow intrinsically more suited to opening up, whilst analytical
and quantitative approaches are inexorably all about closing down. To be fair to quantitative
analysis, however, it should be pointed out that this is not necessarily the case (Stirling 2008).

Indeed, it may be that the it is more the epistemic authority of a quantitative-analytic idiom in
contemporary policy cultures, that makes these techniques preferable for interests wishing to
justify closure. In cases where more qualitative and deliberative methods are used in policy
making, pressures for closure are typically barely less evident – as reflected in emphasis on
particular interpretations of analysts, or on ostensibly prescriptive ‘verdicts’ and ‘consensus’
in much participatory practice. Of course, these qualitative-deliberative methods can be used
to illuminate contrasting interpretations or perspectives. But so too can analytic-quantitative
approaches be used to map out the plural implications of diverse assumptions or framings.
Experiences such as those narrated by the Statactivism movement (Bruno et al., 2014) show
how alternative forms of quantifications can challenge incumbent perspectives and open debate.

7
effect of appraisal ‘outputs’ on decision-making

closing-down opening-up

narrow cost-benefit
analysis open hearings

risk assessment
structured sensitivity
interviews analysis
range of
citizens’ juries
appraisals q-method
inputs consensus
(issues, perspectives,
conference decision
scenarios, methods) analysis
scenario
workshops

narrative-based
broad participant multi-criteria
observation mapping

Figure 1. Schematic representation of the breadth of inputs in appraisal and the effect of outputs on
decision-making. Conventional uses of methods tend to fall in certain areas.
Source: Stirling et al. (2007, p. 57).

Figure 2. Relative position of decision analysis in terms of breadth of inputs vs. effect of outputs, under
different uses of this method. This illustrate the conditional position of methods in the graph, although
conventional uses of methods tend to fall in certain areas as shown in Figure 1. Source: Stirling et al.
(2015, p. 26).

8
In summary, judgements over whether methods do, or aught to, open up or close down or not
depends strongly on the design of appraisal, its context and the perspective under which these
are viewed. As relevant to our present focus on quantitative evaluation as to other areas of
appraisal, Stirling and colleagues proposed the notion of ‘empowering designs’ for methods
that aim at eliciting and foregrounding perspectives that are otherwise relatively marginalized.
They contend that ‘inclusion’ should go beyond the participation of excluded groups and
extend to a symmetrical analytical treatment of alternative perspectives thus facilitating
processes of negotiation between actors on the values and the politics of appraisal (Leach et al.,
2010).

In the following section, we will present some examples to illustrate how quantitative
approaches can be used for opening up in research assessment.

4. ‘Broadening out’ and ‘opening up’ research evaluation with S&T indicators

Where are indicators in research evaluation ‘positioned’ in terms of the schematic


representation of breadth of inputs vs. openness of outputs introduced in Figure 1? We contend
that they often lie in the upper-left corner, perhaps slowly moving toward the centre-left, as
illustrated in Figure 3. Conventional indicators in research evaluation are generally based in
few inputs (mainly publications) and they are generally used as information to facilitate
expediency in decisions, i.e. to close down notions of performance and associated debates.

However, following the discussion in the previous section, we will argue that S&T indicators
can play a role in fostering pluralism rather than closing down perspectives. Three types of
shifts can support more emancipatory use of S&T indicators:

• Inclusion of more analytical dimensions (broadening out) while avoiding the use of
aggregative techniques such as (simplistic) composite indicators (Figure 3 top)
• Development of contrasting indicators (opening up) for analysing the same issue, thus
facilitating reflection on appropriate framing and analytical choices (Figure 3 bottom)
• Shift to participatory dynamics (from indicators to indicating) so that quantification is
contextualised in the goals, locations and values of the specific evaluation.

9
effect of appraisal ‘outputs’ on decision-making
closing-down opening-up

narrow
Conventional Inclusion of analytical
S&T indicators dimensions:

engagement with stakeholders,


range of #phd students, grant income,
appraisals mentions in policy docs, news,
inputs blogs, social media, etc.

Composite Multiple dimensions


S&T indicators of S&T indicators
broad

effect of appraisal ‘outputs’ on decision-making


closing-down opening-up

narrow
Conventional Contrasting
S&T indicators S&T indicators

range of
appraisals
Exploring divergent conceptualisations
inputs Creating heuristic to facilitate exploration

Not about the uniquely best method


but about sharing different perspectives

broad

Figure 3. Illustration of types of shifts towards more plural use of S&T indicators. Top: Inclusion of
more indicators, covering different analytical dimensions, leads to ‘broadening out’ of evaluation. This
leads to ‘opening up’ when these different dimensions are shown explicitly. However, there is no
significant ‘opening up’ when this is followed by aggregative techniques such as composite indicators.
Bottom: Another route to ‘opening up’ is create contrasting indicators of the same analytical dimension
under consideration, e.g. contrasting notions of bibliometric performance, of which convergence or
divergence of insights can be discussed.

10
4.1 Broadening out by including more analytical dimensions in indicators

As discussed in section 2, the problematic use of indicators in research evaluation has led to a
backlash in the use of the more simplistic indicators such Journal Impact Factor. This reaction
against conventional indicators prompted the search for indicators that would capture the blind
spots of the scientometric measures, such as indicators of the social contributions (or impact)
of research (Molas-Gallart et al., 2002) or indicators of Open Science (Wouters, et al, 2019).
A parallel boom in the use electronic platforms has led to a large expansion of data available
for assessing research activities, in particular proliferation of indicators capturing non-
conventional aspects of researcher performance (Pontille and Torny, 2013).

However, the availability of data for broadening out does necessarily translate into a
pluralisation of research evaluation – for example when the closing down of conventional
bibliometric indicators is substituted by the closing down of new Altmetric or Open Science
indicators that follow the same jntegrative productivist logic (Robinson-Garcia et al., 2018).

University ranking providers exemplify how analyses considering various dimensions do not
necessarily lead to more pluralistic understandings. Let us leave aside for a moment that the
data and the methodologies behind these rankings are, to put it mildly, rather problematic (Van
Raan, 2005). Most rankings are based on very distinct analytical dimensions, such as the quality
of education, of research, the international outlook or the industry income. Yet in the end, much
of the benefit for improved understanding that might arise from this broadening out of
consideration across more different dimensions, is then lost when all these dimensions are
folded into a single composite index. With contrasting equally-reasonable protocols for
aggregation typically yielding radically different index orderings, particular chosen parameter
structures will at best be arbitrary and at worst, vulnerable to gaming or capture.

This closing down in spite of richer information also occurs with more rigorous analysis like
the European Innovation Scoreboard. In 2017, it consisted of 10 analytical dimensions based
on 27 indicators (between 2 and 3 indicators per dimension) (Hollanders and Es-Sadki, 2017).
These 27 indicators were summarised in a single scalar score, effectively ‘closing down’
debates on performance by univocally emphasising a particular country as ‘most innovative’.
Such composite indices have been shown to be potentially misleading as ‘the scope for
manipulation of scoreboards by selection, weighing and aggregation is great’ (Grupp and
Moggee, 2004, p. 1382; Grupp and Schubert (2010). Yet, as shown in Figure 4, simply
displaying the analysis in radar charts, rather than in one dimension, allows appreciation for
the ways in which ostensibly similar aggregate scores may obscure very different profiles
(compare, for example of Denmark vs. Germany, or Italy vs. Hungary).

For our focus on research assessment, the development of Altmetrics indicators is paradigmatic
to cast to light on the challenges of broadening out given the political economy of research
assessment. The initial proponents of Altmetrics were genuinely eager to pluralise research
assessment with new ‘metrics’ that could report activities invisible in the conventional such
as blogging and data or code sharing (Priem et al. 2010, Priem, 2014). Indeed, in the last decade
there has been a blossoming of scientific traces in the cyberspace: repositories of data, preprints
and postprints, code, databases analysing mentions of academic work in social media, etcetera.
One might thus have expected that the analysis of these traces would lead to consolidation of
new indicators of social attention.

11
0,8

0,7

0,6
Innovation Performance

0,5

0,4

0,3

0,2

0,1

0,0

Poland
Greece
EU Average
Austria

Croatia

Romania
Malta
United Kingdom
Denmark

Germany

Hungary
Portugal

Italy
Czech Republic
Sweden

Finland

Ireland

Spain
Netherlands

Cyprus
France

Slovenia

Estonia

Lithuania

Slovakia
Latvia

Bulgaria
Luxembourg

Belgium

Denmark Germany

Human resources Human resources

Sales impacts Research systems Sales impacts Research systems

Employment Innovation-friendly Employment Innovation-friendly


impacts environment impacts environment

Intellectual assets Finance and support Intellectual assets Finance and support

Linkages Firm investments Linkages Firm investments

Innovators Innovators

Italy Hungary

Human resources Human resources


Sales impacts Research systems Sales impacts Research systems

Employment Innovation-friendly Employment Innovation-friendly


impacts environment impacts environment

Intellectual assets Finance and support Intellectual assets Finance and support

Linkages Firm investments Linkages Firm investments

Innovators Innovators

Figure 4. Visualising multiple dimensions in radar charts. The European Innovation Scoreboard is a
composite index of innovation ‘performance’ that aggregates multiple analytical dimensions. However,
its aggregate nature does not allow to see the different strengths by country. A simple radar chart makes
explicit the contrasting profiles even for countries with a similar aggregate performance, as shown
comparing Denmark against Germany, or Italy against Hungary. Source: Rafols (2019) based in Grupp
and Schubert (2010). Data source: European Innovation Scoreboard (2017).

12
As a matter of fact, Altmetric.com 6 has been successfully marketing data on the attention
generated in societal media by a publication as well as an indicator, the Altmetric Attention
Score, which is a composite index giving different weights to mentions in news, blogs, policy
documents, patents, Twitter, Facebook and Youtube (as shown in Figure 5). These metrics are
provided by Altmetric.com without standards for comparison, they have a very irregular
coverage and the meaning of their aggregate score is very unclear. Therefore, they have not
shown so far to be meaningful or reliable quantitative indicators for evaluation purposes
(Wilsdon et al., EC, 2016; Robinson-García et al., 2017; Sugimoto et al., 2017).

This said, it should be acknowledged that the information provided by Altmetric.com is rich
and can be very useful in tentatively exploring (by clicking tabs and digging into details),
whether and why a publication has generated interest. Not only does this help illuminate the
kinds of attention received (whether news, policy, blogs), it also allows users to search specific
instances. Thus, it is in arguably mostly in non-aggregated forms (as indicators within different
dimensions like news, blogs, policy documents, patents, etc.) that Altmetric data can best be
used to pluralise understandings on the part of research policy audiences (i.e. in the diagonal
shift to multiple dimensions in Figure 3).

Figure 5. Left: Example of the Altmetric Attention Score of an article by Altmetric.com. The score is in
the centre of the doughnut. Below the doughnut, you can see the actual counts in different dimensions
such as news, blogs, etc. By clicking on the tabs one can look up the actual news, blogs, or tweets in
which the article was mentioned. Right: The weights used by each dimension of the Altmetric Attention
Score. See details in https://dimensions.altmetric.com/details/3931894

In this sense, we view the Altmetric Attention Score as an example of one the main challenges
in broadening out. It is precisely when a wider range of inputs are included, that pressures from
managerial interests to perform a ‘hard’ composite index can lead evaluative practices unduly
to ignore problems of incommensurability in the component indicators and sensitivity of results
to aggregation protocols. Thus, in spite of undoubted good will (Priem et al., 2010), we view
Altmetrics as an interesting development for exploratory analysis (Costas et al., 2017; Noyons,

6
Altmetric.com is owned by Digital Science, a company of Holtzbrink group, which is also the owner of
Springer-Nature.

13
2019), but with very questionable impact so far in research assessment as a result of a
decontextualised implementation, in the same accountability (and ‘bean counting’) tradition of
conventional bibliometrics (Barré, 2019; Rafols, 2019).

4.2 Opening up by considering contrasting indicators of the same property

Let us now turn from practices of broadening out S&T indicator inputs (as shown in the bottom
of Figure 3), to challenges of opening up S&T evaluation without necessarily adding very large
arrays of data sources. How can it be possible to foster more plural analyses, even when
attention is dominated by particular sources – such as a specific bibliometric database? How
can quantitative studies capture and convey diverse perspectives on a given issue, even by
reference to the same body of data?

It is crucial to the distinction made here between (related but often effectively quite
independent) qualities of ‘broadening out’ and ‘opening up’, to appreciate that ‘opening up’
can be undertaken without necessarily ‘broadening out’. All that is required is an openness to
exploring contrasting operationalisations of some single property of interest. In other words,
even with narrow inputs, tools can be developed that help evaluators scrutinize how different
conceptualisations and associated mathematical operationalisations may yield contrasting
results with the same data. By investigating how different assumptions lead to different
methods and rankings (even using only a single indicator and dataset), the analyst can provide
‘plural and conditional’ advice.

By helping to cultivate a policy culture that is more generally reflective over the importance of
uncertainty and variability and more reflexivity over the normative – ethical and political –
aspects of apparently technical analytical choices, a particular exercise in opening up may even
help to nurture a greater general attentiveness and responsiveness even to parameters that were
not included in its own analysis. Both the practice of evaluation and associated policy debates
may thereby be made more rigorous and accountable.

Let us take the core notion of ‘research quality’ (which it may be recalled, Martin and Irvine
established in 1983 that bibliometrics could not address!). In conventional bibliometric analysis,
research quality is interpreted as referring only to the academic perceptions of value. This is
then operationalised using publication data, but in diverse ways: in terms of journal rankings
(a disciplinary list), in terms of Journal Impact Factor (JIF), in terms of citations, which in turn
can be normalised (made commensurable) according to field of the article receiving or giving
the citation (cited-side normalised or citing-side normalised). These yield radically different
understandings of ‘quality’. A journal ranking produced by an academy (e.g. the UK
Association of Business Schools) counts as quality publishing in the most prestigious journals
of a discipline. The JIF assumes that quality is related to the citation impact of the journal.
Cited-side normalisation considers that citations rather than journal define quality and that all
citations are equal. Citing-side normalisation considers that attracting citations from fields that
cite little is more valuable.

While these different conceptualisations and corresponding operationalisations can be easily


understood as diverse, most people will be surprised when shown, as in Figure 6, that these
different choices lead to strikingly different results. It is generally assumed that these choices
may change the details, but the relative order of performance will remain stable. In this case,
we compared the bibliometric performance of three interdisciplinary units of Innovation
Studies with three Schools of Business and Management (Rafols, Leydesdorff, et al., 2012).

14
Given that some of the units under analysis were highly interdisciplinary centres, the results
were greatly affected by specific operationalisations. This lack of congruence has been
sometimes studied regarding technical issues such field-normalisation techniques (Zitt et al.,
2005; Adams et al., 2008) or dimensions such as language (Leeuwen et al., 2001), but it is
seldom debated in evaluative applications – perhaps with the exception of the biases generated
by disciplinary (Hicks, 1999) and geographical coverage of databases (Vessuri et al., 2014;
Chavarro, 2017).

For concepts such as interdisciplinarity, that enjoy a conspicuous lack of consensus, exploring
contrasting indicators is even more important (Wang and Schneider, 2020). When results are
convergent, agreement provides robust evidence of insights (Rafols, Leydesdorff et al., 2012).
When results are divergent, interpretation is challenging and might be disconcerting against
some expectations (Digital Science, 2016). However, it should not be assumed that divergence
means that the contrasting indicators are necessarily invalid. Rather, it may be that different
interpretations of interdisciplinarity provide different insights. In this case, the actors concerned
with the evaluation need to engage with their particular understandings of ‘interdisciplinarity’
so as to choose the specific processes of operationalisation that they find relevant for their case
and context. In opening up the operationalisation of the indicators as a plural and conditional
process, we achieve the key step of moving from indicators to indicating (Marres and De Rijcke,
2020).

These bibliometric examples on operationalisations of ‘research quality’ illustrate that our


proposal for ‘opening up’ should not be seen as an impractical call for ever more inputs and
ever more outputs. One should not interpret that ‘opening up’ is about giving more indicators
– it is about adding the minimum number of indicators that will force decision-makers to
consider the relevant evaluative options rather than thoughtlessly grab the easiest naturalised
indicator (let’s say the Journal Impact Factor). Far from requiring postponement of decisions,
opening up can help avoid cost and delay caused by protracted controversies provoked by
unreasonable attempts to assert single indicators that fail to reflect from the outset, a requisite
range of salient issues.

Where there is only one indicator (and associated implicit framing) of a property that is widely
recognised as contentious, it will often be enough to add just a second indicator that provides
a contrasting perspective in order to avoid reflection on which of the two framings and
associated indicators are more appropriate in a given evaluative context. Let us be clear that
we believe that parsimony is very important in policy indicators in order to allow transparency.
But apparent simplicity should not be achieved through suppression or conflation of relevant
evaluation dimensions.

15
4 4

Journal Impact Factor


ABS Journal Ranking

3 3

2 2

1
1

0
0
ISSTI SPRU MIoIR Imperial WBS LBS
ISSTI SPRU MIoIR Imperial WBS LBS

5 0.2

Citing-side Normalised
Cited-side Normalised
Citations/Publication

Citations/Publication
4
0.15

0.1
2

1 0.05

0
ISSTI SPRU MIoIR Imperial WBS LBS 0
ISSTI SPRU MIoIR Imperial WBS LBS

Figure 6. Contrasting results using different measures of research performance of university units of
Science and Innovation Studies (left, in grey) and Business and Management schools (right, in black).
Source: Rafols, Leydesdorff et al. (2012).

5. From indicators to indicating: engagement for plural and conditional advice

To counter the use of indicators as rigid tools that capture only narrow understandings of the
issues evaluated and then marginalize certain options, we have proposed to build on Stirling’s
framework of ‘empowering designs’ (2007): this is to develop and apply quantification in ways
that i) broaden out the scope of knowledge gathered, ii) have a pluralising effect (i.e. open up)
in the evaluative process. Broadening out involves considering more analytical dimensions –
opening up consists in actively fostering more critical debate, rather than closing it down. Each
can occur quite independently, with a useful degree of opening up being possible even without
a corresponding broadening out. We have also argued, on the other hand, that broadening out
without opening up (e.g. in university rankings on in Altmetrics) does not result in a significant
pluralisation of evaluation.

By taking more analytical dimensions into account, and/or by exploring contrasting


perspectives on these dimensions, we are effectively expanding the potential insights gained
from indicators (or indicating) in the evaluation. How can the evaluator come to make decisions
under these more plural circumstances? How can evaluation proceed without a clear set of
indicators?

Pielke (2007) argued that under conditions of uncertainty and lack of consensus, it is not
possible in scientific advice to separate knowledge formation (in our case: the construction of

16
indicators), from decision making (in our case: evaluation). Experts on the world of indicators,
Rottenburg and Merry (2015, p. 30) reached a similar insight: ‘…it is impossible to separate
the concrete processes of measuring [the construction of the indicators] from the actual use of
the indicators [in decision-making]…’

This means that the choice of ostensibly objective indicators for a given evaluation is inevitably
related to underlying intrinsically subjective ‘valuations’. This is an intuition we are all familiar
when dealing with mundane objects such as tomatoes (Heuts and Mol, 2013) – different criteria
of ‘quality’ are applied depending on the expected usage and the preferences on taste, texture,
colour, etc. In other words that the choice of indicators is conditional on the values, which is
why experts analyst should offer ‘plural and conditional advice’ – with multiple indicators and
values under each explicitly conditional on assumptions appropriate under relevant values and
contexts. Whilst the apparent parsimony of aggregate indices may superficially look like a
virtue in scientific advice, this may conceal an intractable and volatile complexity of hidden
contingencies (Grupp and Mogee, 2004). In this sense, a simple general heuristic of ‘opening
up’ may offer a rigorous more robust form of parsimony (Stirling, 2010).

In a previous publication, one of us argued that from the adoption of this plural and conditional
framing, it follows that indicators have to be constructed with the participation of stakeholders
in the middle of the social world, perhaps during evaluations – what we called ‘indicators in
the wild’ (Rafols, 2019) after Callon’s ‘research in the wild’ (Callon et al., 2001). Waltman
and Van Eck (2016) proposed ‘contextualised scientometrics’ as a form of scientometrics that
would allow ‘users’ to shape the quantitative analysis with their contextual knowledge. Marres
and De Rijcke (2020) have pointed out that this shift from off-the-shelf, universal indicators to
tailored contextual indicators means that we move from a product (indicator) to a process
(indicating).

“We describe this approach as indicating to highlight something that each of the four terms
above [participatory, abductive, interactive and designed] have in common: they frame the
development and use of indicators as a process. This is key insofar as it enables us to
understand the assembly of communities of interpretation as an on-going process, one that
spreads out across the design and deployment of indicators.”

It is during this process of ‘indicating’ that closing down takes place. Thus, making efforts to
open up indicators does not mean that decisions themselves remain open. Decisions can still
be taken as needed. What is different is simply that the process of decision-making is enriched
by more reflective and explicit consideration to the rationales behind possible choices, without
expedient closure of indicators allowing the obscuring of decision accountability. Collective
deliberation might, in some cases, facilitate the construction of shared perspectives among
stakeholders. But decisions will nonetheless still likely need to take place in the face of
incommensurable perspectives and persistent stakeholder contention, each made explicit in
contrasting preferences for indicators. In this light, opening up does not impede decision
making, but merely enables decision makers to be more clear and transparent in explaining
their choices. Even though some individual decision-makers may sometimes prefer hiding
behind indicators that claim to offer the ‘best’ technocratic advice, wider public interests might
hold that this kind of rigour and accountability should be routinely expected in mature
democratic governance.

Of course, it does sometimes occur that decisions need to be made under conditions of
uncertainty. For example, bibliometric indicators of individual researchers, while informative,

17
are not reliable to fully assess fellowship candidates and reviewers do not agree on the ranking
of candidates. Under these conditions it is advisable for assessment to recognise uncertainty
and proceed with methods that embrace it (such as partially randomized selection7) rather than
using indicators, which is likely to lead to systematic biases (e.g. favouring men, basic and
fashionable topics) and indicator gaming (De Rijcke et al., 2016).

There are experiences of this shift towards participation of decision-makers and stakeholders
in the design and use of indicators. For example, the new evaluation framework developed by
the Utrecht Medical Centre was created after a process that involved public debates (Benedictus
et al., 2016). De Rijcke et al. (2019) advocate an approximation to evaluation (‘evaluative
inquiry’) that spouses adopting the methods to the particular needs of an evaluation. These
efforts towards participatory quantifications require the design of new methodologies that act
as interfaces between different actors (Marres and Gerlitz, 2016). Marres & De Rijcke (2020)
emphasizes the value of building on expertise in participaroty methods, user studies and design
research in order to develop these methodologies.

International development is one of the policy areas that pioneered this participatory turn
(Chambers, 1995), and where examples of practices of ‘participatory statistics’ might be sought
(see edited book by Holland, 2013, showing a variety of experiences). For example, the method
Participatory Impact Pathway Analysis (PIPA) involves a variety of stakeholders in deciding,
from the outset, what are the indicators that will be used during the monitoring and evaluation
of an intervention (Douthwaite et al., 2007).

6. Conclusions

The use of indicators in evaluation (as well as in other social spheres) has become both
pervasive and problematic. Conventional indicators facilitate closing down debate in
evaluative processes by valuing an activity according to dominant analytical perspectives, for
example publication productivity and citation impact in research evaluation. Indicators thus
play performative roles, incentivising and ‘guiding’ both evaluands and evaluators towards
particular understandings of ‘good’ performance that tend to align with power.

In this chapter we have argued that while it is indeed the case that conventional quantification
using scalar indicators has this blinkering effect, indicators can also be used to help support
more plural evaluation and foster more productively critical debate. To achieve this shift
towards indicators that foster perspectival diversity, we urge greater attention to two
dimensions of design in the process of indicating. The first dimension, ‘broadening out’,
concerns the range of ‘inputs’ taken into account in evaluation. The second, ‘opening up’
relates to the ‘outputs' of quantifications, encouraging methodologies that enable attention to
plural perspectives.

We have illustrated that even analytical tools as narrow as scientometric indicators leave room
for evaluative usage that is more explicit about the dependence of analytic outputs on normative
assumptions. We have shown that this ‘opening up’ is distinct (and complementary) to the
‘broadening out’ of the range of data inputs. We suggest that this move towards more situated
and participatory use of quantitative evidence in evaluation, implies a shift away from
notionally universal indicators (as products) to more contextualised indicating (as process).

7
See initiatives by the Volkswagen Foundation at https://www.volkswagenstiftung.de/en/funding/our-funding-
portfolio-at-a-glance/experiment/partially-randomized-procedure

18
If conventional scalar indicators hold the ‘capacity to produce constitutive effects in such a
way that conventional forms of democratic control are circumvented’ (Dahler-Larsen, 2019, p.
218), the designs of quantification proposed here aim to illuminate instead a more democratic
diversity of perspectives. We hope that these empowering designs can be creatively weaved
into new policy contexts that allow quantification to challenge scalar instrumentalism and
instead help foster democratic pluralism and accountability in evaluation.

Acknowledgements
This chapter builds on previous work and discussions with colleagues in the Science Policy
Research Unit (SPRU) at the University of Sussex, Centre for Science and Technology Studies
(CWTS) at Leiden University and at Ingenio (CSIC-UPV), Universitat Politècnica de València.
An earlier, much shorter version of this manuscript was first published in the Proceedings of
the 2012 S&T Indicators Conference (Rafols, Ciarli et al., 2012) and translated into Portuguese
in the Proceedings of the 2017 Brazilian Meeting of Bibliometrics and Scientometrics.

References

Adams, V. (Ed.) (2016). Metrics: What counts in global health. Duke University Press.
Adams, J., Gurney, K., Jackson, L., 2008. Calibrating the zoom – a test of Zitt’s hypothesis.
Scientometrics 75 (1), 81–95.
Barré, R. (2010). Towards socially robust S&T indicators: indicators as debatable devices,
enabling collective learning. Research Evaluation, 19(3), 227-231.
Barré, R. (2019). Les indicateurs sont morts, vive les indicateurs! Towards a political economy
of S&T indicators: A critical overview of the past 35 years. Research Evaluation, 28(1),
2-6.
Benedictus, R., Miedema, F., & Ferguson, M. W. (2016). Fewer numbers, better science.
Nature, 538(7626), 453-455.
Börner, Katy. (2010) Atlas of Science. Vizualizing What We Know. Cambridge, MA, and
London: MIT Press.
Bruno, I., Didier, E., & Vitale, T. (2014). Statactivism: Forms of action between disclosure and
affirmation. The Open Journal of Sociopolitical Studies, 7(2), 198-220.
Burrows, R. (2012). Living with the h-index? Metric assemblages in the contemporary
academy. The sociological review, 60(2), 355-372.
Callon, M., Lascoumes, P., & Barthe, Y. (2001). Agir dans un monde incertain: essai sur la
démocratie technique. Seuil. English translation: Callon, M., Lascoumes, P., & Barthe,
Y. (2009). Acting in an uncertain world: An essay on technical democracy (Inside
technology). MIT press.
Chambers, R. (1995) ‘Poverty and Livelihoods: Whose Reality Counts?’, Environment and
Urbanization, 7/1: 173–204.
Chavarro, Diego (2017) Universalism and particularism: explaining the emergence and
growth of regional journal indexing systems. Doctoral thesis (PhD), University of
Sussex. Accessed on 11th July 2020 at http://sro.sussex.ac.uk/id/eprint/66409/.
Chavarro, D., Tang, P., & Ràfols, I. (2017). Why researchers publish in non-mainstream
journals: Training, knowledge bridging, and gap filling. Research policy, 46(9), 1666-
1680.
Coburn, Josie and Stirling, Andrew (2016) Multicriteria mapping manual - version 2.0. Manual.
SPRU - Science Policy Research Unit, Brighton. Accessed on 11th July 2020 at
http://sro.sussex.ac.uk/id/eprint/65615/

19
Costas, R., Honk, J. V., Calero-Medina, C., & Zahedi, Z. (2017). Exploring the descriptive
power of altmetrics: case study of Africa, USA and EU28 countries (2012-2014). STI
2017: Science, Technology and Innovation indicators.
Dahler-Larsen, P. (2011). The evaluation society. Stanford University Press.
Dahler-Larsen, P. (2019). Quality: from plato to performance. Springer.
Desrosières, A. (2015). Retroaction: How indicators feed back onto quantified acto. In
Rottenburg, R., Merry, S. E., Park, S. J., & Mugler, J. (Eds.). (2015). The world of
indicators: The making of governmental knowledge through quantification. Pp. 329-
353. Cambridge University Press.
Digital Science. (2016). Interdisciplinary research: Methodologies for identification and
assessment. Available at https://www.mrc.ac.uk/documents/pdf/assessment-of-
interdisciplinary-research/
DORA (2013) San Francisco Declaration on Research Assessment. Available at
https://sfdora.org/read/ Accessed July 10th 2020.
Douthwaite, B., Alvarez, S., Cook, S., Davies, R., George, P., Howell, J., Mackay, R. &
Rubiano, J. (2007). Participatory impact pathways analysis: a practical application of
program theory in research-for-development. The Canadian Journal of Program
Evaluation 22 (2), 127–159.
Feller, I. (2002). Performance measurement redux. American Journal of Evaluation, 23(4),
435-452.
Feller, I. (2013). Performance measures as forms of evidence for science and technology
policy decisions. The Journal of Technology Transfer, 38(5), 565-576.
Grupp, H., & Mogee, M. E. (2004). Indicators for national science and technology policy:
how robust are composite indicators?. Research Policy, 33(9), 1373-1384.
Grupp, H., & Schubert, T. (2010). Review and new evidence on composite innovation
indicators for evaluating national performance. Research Policy, 39(1), 67-78.
Heuts, F., & Mol, A. (2013). What is a good tomato? A case of valuing in practice. Valuation
Studies, 1(2), 125-146.
Hicks, D. (1999). The difficulty of achieving full coverage of international social science
literature and the bibliometric consequences. Scientometrics, 44(2), 193-215.
Hicks, D. (2012). Performance-based university research funding systems. Research policy,
41(2), 251-261.
Hicks, D., Wouters, P., Waltman, L., De Rijcke, S., & Rafols, I. (2015). Bibliometrics: the
Leiden Manifesto for research metrics. Nature, 520(7548), 429-431.
Holland, J. (Ed.). (2013). Who counts?: the power of participatory statistics. Rugby, UK:
Practical Action Publishing.
Hollanders, H. and Es-Sadki, N. (2017) European Innovation Scoreboard 2017. European
Commission, Brussels. Accessed on 22nd June 2020 at
https://ec.europa.eu/docsroom/documents/24829
Kelly, A., & Burrows, R. (2012). Measuring the value of sociology? Some notes on
performative metricization in the contemporary academy. The Sociological Review, 59,
130-150.
Leach, Melissa, Ian Scoones, and Andy Stirling. (2010) Dynamic Sustainabilities. Technology,
Environment, Social Justice. London and Washington D.C.: Earthscan.
Lebel, J., & McLean, R. (2018). A better measure of research from the global south. Nature,
559, 23-26. doi: 10.1038/d41586-018-05581-4.
Lehtonen, M., Sébastien, L., & Bauler, T. (2016). The multiple roles of sustainability indicators
in informational governance: between intended use and unanticipated influence.
Current Opinion in Environmental Sustainability, 18, 1-9.

20
Lepori, B., Barré, R., & Filliatreau, G. (2008). New perspectives and challenges for the design
and production of S&T indicators. Research Evaluation, 17(1), 33-44.
Leeuwen, T. N. Van, Moed, H. F., Tijssen, R. J., Visser, M. S., & Van Raan, A. F. (2001).
Language biases in the coverage of the Science Citation Index and its consequences for
international comparisons of national research performance. Scientometrics, 51(1),
335-346.
Martin, B. R., & Irvine, J. (1983). Assessing basic research: some partial indicators of scientific
progress in radio astronomy. Research Policy, 12(2), 61-90.
Marres, N., & Gerlitz, C. (2016). Interface methods: Renegotiating relations between digital
social research, STS and sociology. The Sociological Review, 64(1), 21-46.
Marres, N., & de Rijcke, S. (2020). From indicators to indicating interdisciplinarity: A
participatory mapping methodology for research communities in-the-making.
Quantitative Science Studies, 1041-1055.
Molas-Gallart, J., Salter, A., Patel, P., Scott, A., & Duran, X. (2002). Measuring third stream
activities. Final report to the Russell Group of Universities. Brighton: SPRU,
University of Sussex. Accessed on 11th July 2020 at http://ict-industry-
reports.com.au/wp-content/uploads/sites/4/2013/10/2002-Measuring-University-3rd-
Stream-Activities-UK-Russell-Report.pdf
Noyons, E. (2019). Measuring societal impact is as complex as ABC. Journal of data and
information science, 4(3), 6-21.
O'Mahony, S. (2017). Medicine and the McNamara fallacy. The journal of the Royal College
of Physicians of Edinburgh, 47(3), 281-287.
Pielke Jr, R. A. (2007). The honest broker: making sense of science in policy and politics.
Cambridge University Press.
Pontille, D., & Torny, D. (2013). La manufacture de l'évaluation scientifique. Réseaux, (1), 23-
61.
Priem, J., Taraborelli, D., Groth, P., & Neylon, C. (2010). Altmetrics: A manifesto. Accessed
on 22nd June 2020 at http://altmetrics.org/manifesto/
Priem, J. (2014) Altmetrics. In Cronin, B., & Sugimoto, C. R. (Eds.). (2014). Beyond
bibliometrics: Harnessing multidimensional indicators of scholarly impact. MIT Press.
Pp. 263-287.
Rafols, I., Ciarli, T., Van Zwanenberg, P., & Stirling, A. (2012). Towards indicators for
opening up S&T policy. STI Indicators Conference. Available in Arxiv.
Rafols, I., Leydesdorff, L., O’Hare, A., Nightingale, P., & Stirling, A. (2012). How journal
rankings can suppress interdisciplinary research: A comparison between innovation
studies and business & management. Research Policy, 41(7), 1262-1282.
Rafols, I. (2019). S&T indicators in the wild: Contextualization and participation for
responsible metrics. Research Evaluation, 28(1), 7-22.
Rijcke, S. D., Wouters, P. F., Rushforth, A. D., Franssen, T. P., & Hammarfelt, B. (2016).
Evaluation practices and effects of indicator use—a literature review. Research
Evaluation, 25(2), 161-169.
de Rijcke, S., Holtrop, T., Kaltenbrunner, W., Zuijderwijk, J., Beaulieu, A., Franssen, T., ... &
Wouters, P. (2019). Evaluative Inquiry: Engaging research evaluation analytically and
strategically. fteval Journal for Research and Technology Policy Evaluation, (48), 176-
182.
Robinson-García, N., Costas, R., Isett, K., Melkers, J., & Hicks, D. (2017). The unbearable
emptiness of tweeting—About journal articles. PloS one, 12(8), e0183551.
Robinson-García, N., van Leeuwen, T. N., & Rafols, I. (2018). Using altmetrics for
contextualised mapping of societal impact: From hits to networks. Science and Public
Policy, 45(6), 815-826.

21
Roessner, D. 2000. Quantitative and Qualitative Methods and Measures in the Evaluation of
Research. Research Evaluation 9 (2): 125-132.
Rottenburg, R., and Merry, S. E. (2015) A world of indicators: The making of governmental
knowledge through quantification. In Rottenburg, R., Merry, S. E., Park, S. J., &
Mugler, J. (Eds.). The world of indicators: The making of governmental knowledge
through quantification. Cambridge University Press, pp. 1-33.
Saltelli, A., Bammer, G., Bruno, I., Charters, E., Di Fiore, M., Didier, E., ... & Pielke Jr, R.
(2020). Five ways to ensure that models serve society: a manifesto.
Scott, J. C. (1998) Seeing like a State: How Certain Schemes to Improve the Human Condition
Have Failed. New Haven and London: Yale University Press.
Sugimoto, C. R., Work, S., Larivière, V., & Haustein, S. (2017). Scholarly use of social media
and altmetrics: A review of the literature. Journal of the Association for Information
Science and Technology, 68(9), 2037-2062.
Stirling, Andy, Melissa Leach, L. Mehta, Ian Scoones, Adrian Smith, Sigrid Stagl, and J.
Thompson. 2007. “Empowering Designs: Towards More Progressive Appraisal of
Sustainability”. STEPS Centre. Institute of Development Studies. Accessed on 11July
2020 at https://opendocs.ids.ac.uk/opendocs/handle/20.500.12413/2473
Stirling, Andy. 2008. “Opening Up” and “Closing Down”: Power, Participation, and Pluralism
in the Social Appraisal of Technology. Science, Technology & Human Values 33 (2):
262-294.
Stirling, A. (2010). Keep it complex. Nature, 468(7327), 1029-1031.
Stirling, A. (2012). Opening up the politics of knowledge and power in bioscience. PLoS Biol,
10(1), e1001233.
Stirling, Andy (2015) Developing ‘Nexus Capabilities’: towards transdisciplinary
methodologies. Discussion Paper. SPRU - Science Policy Research Unit, Brighton.
Accessed on 11th July 2020 at http://sro.sussex.ac.uk/id/eprint/69094/.
Stirling A. (2016) Knowing Doing Governing: Realizing Heterodyne Democracies. In: Voß
JP., Freeman R. (eds) Knowing Governance. Palgrave Studies in Science, Knowledge
and Policy. Palgrave Macmillan, London. https://doi.org/10.1057/9781137514509_12
Stirling, A. (2019). How deep is incumbency? A ‘configuring fields’ approach to redistributing
and reorienting power in socio-material change. Energy Research & Social Science, 58,
101239.
Van Raan, A. F. (2005). Fatal attraction: Conceptual and methodological problems in the
ranking of universities by bibliometric methods. Scientometrics, 62(1), 133-143.
Vessuri, H., Guédon, J. C., & Cetto, A. M. (2014). Excellence or quality? Impact of the current
competition regime on science and scientific publishing in Latin America and its
implications for development. Current sociology, 62(5), 647-665.
Visser, M., van Eck, N. J., & Waltman, L. (2020). Large-scale comparison of bibliographic
data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic.
arXiv preprint arXiv:2005.10732.
Waltman, L. and van Eck, (2016) The need for contextualized scientometric analysis: An
opinion paper. In Rafols, Ismael; Molas-Gallart, Jordi; Castro-Martinez, Elena;
Woolley, Richard (eds.). 21st Conference on Science & Technology Indicators.
València (Spain); 14-16 Sep 2016. pp. 541-549. Accessed on 17 August 2020 at
http://ocs.editorial.upv.es/index.php/STI2016/STI2016/paper/viewFile/4543/2327
Weinberg, A. M. (1962). Criteria for scientific choice. Minerva, 1(2), 159-171.
Weingart, Peter. 2005. ‘Impact of Bibliometrics Upon the Science System: Inadvertent
Consequences?’ Scientometrics 62 (1): 117-131.
Wilsdon, J., Bar-Ilan, J., Frodeman, R., Lex, E., Peters, I., & Wouters, P. F. (2017). Next-
Generation Metrics: Reponsible Metrics and Evaluation for Open Science. Report of

22
the European Commission Expert Group on Altmetrics. Accessed on 22 June 2020 at
https://op.europa.eu/en/publication-detail/-/publication/b858d952-0a19-11e7-8a35-
01aa75ed71a1/language-en/format-PDF
Wouters, P., Glänzel, W., Gläser, J., & Rafols, I. (2013). The dilemmas of performance
indicators of individual researchers–An urgent debate in bibliometrics. ISSI Newsletter,
9(3), 48-53.
Wouters, P. (2014). The citation: From culture to infrastructure. In Cronin, B. and Sugimoto,
C. (Eds) Beyond bibliometrics: Harnessing multidimensional indicators of scholarly
impact, pp. 47-66. MIT Press, Cambridge Massachussets.
Wouters, P., Zahedi, Z., & Costas, R. (2019). Social media metrics for new research evaluation.
In Springer handbook of science and technology indicators (pp. 687-713). Springer,
Cham.
Wouters, P., Rafols, I., Oancea, A., Kamerlin, L., Holbrook, B., & Jacob, M. (2019). Indicator
frameworks for fostering open knowledge practices in science and scholarship
Independent Expert Report for the European Commission. Accessed on 11th July 2020
at https://op.europa.eu/en/publication-detail/-/publication/b69944d4-01f3-11ea-8c1f-
01aa75ed71a1/language-en/format-PDF/source-108756824
Zitt, M., Ramanana-Rahary, S., & Bassecoulard, E. (2005). Relativity of citation performance
and excellence measures: From cross-field to cross-scale effects of field-normalisation.
Scientometrics, 63(2), 373-401.

23

You might also like