Jos/ffu 017
Jos/ffu 017
Jos/ffu 017
Scalar Diversity
BOB VAN TIEL
Radboud University Nijmegen
NATALIA ZEVAKHINA
National Research University
Higher School of Economics Moscow
BART GEURTS
Radboud University Nijmegen
Abstract
We present experimental evidence showing that there is considerable variation
between the rates at which scalar expressions from different lexical scales give rise to
upper-bounded construals. We investigated two factors that might explain the vari-
ation between scalar expressions: first, the availability of the lexical scales, which we
measured on the basis of association strength, grammatical class, word frequencies and
semantic relatedness, and, secondly, the distinctness of the scalemates, which we oper-
ationalized on the basis of semantic distance and boundedness. It was found that only
the second factor had a significant effect on the rates of scalar inferences.
1 INTRODUCTION
A speaker who says (1) usually implies that she did not eat all of the
cookies. The scalar expression ‘some’, whose logical meaning is just ‘at
least some’, receives an upper-bounded interpretation and thus comes to
exclude ‘all’.
(1) I ate some of the cookies.
To explain this scalar inference, it is often assumed that scalar expressions
evoke lexical scales whose members are ordered in terms of inform-
ativeness. For instance, ‘some’ evokes the scale hsome, alli, where ‘all’ is
more informative than ‘some’. A speaker who uses a less than maximally
informative scalar expression implies—at least in some situations—that
ß The Author 2014. Published by Oxford University Press. All rights reserved.
For Permissions, please email: [email protected]
2 of 39 Bob van Tiel et al.
she does not believe that one of the more informative scalar expressions
would have been appropriate.
There is no uncontroversial definition of lexical scales. However, it is
widely assumed that lexical scales contain expressions that are ordered in
terms of informativeness and lexicalized to the same degree (e.g. Horn
1972; Gazdar 1979; Atlas & Levinson 1981). In this article, we will
confine our attention to scales that meet these minimal conditions.
This means that we will not be concerned with ranked orderings or
ad hoc scales (e.g. Hirschberg 1991; Levinson 2000). All of the example
scales in Table 1 count as lexical scales according to the traditional
Category Examples
Adjectives hintelligent, brillianti hdifficult, impossiblei
Adverbs hsometimes, alwaysi hpossibly, necessarilyi
Connectives hor, andi
Determiners hsome, alli hfew, nonei
Nouns hmammal, dogi hvehicle, cari
Verbs hmight, musti hlike, lovei
consistently overlooked. Even within the classes that have been inves-
tigated, the variety of scalar expressions is limited. Apparently, the tacit
assumption underlying these experiments is that the scalar expressions in
Table 2, and especially ‘some’ and ‘or’, are representative for the entire
family of scalar expressions.
Until recently, this uniformity assumption, as we will call it, had not
been questioned, but it was put to the test by Doran and colleagues
(2009, 2012), following up on a study by the same group (Larson et al.
2009). Doran et al.’s findings suggest that there is significant variability
between the rates at which scalar terms of different grammatical cate-
gories give rise to upper-bounded inferences. However, as we will argue
in the following, there are a number of reasons for going over the same
4 of 39 Bob van Tiel et al.
John says:
She is intelligent.
Would you conclude from this that, according
to John, she is not brilliant?
Yes No
3.1 Experiment 1
3.1.1 Participants We posted surveys for 25 participants on Amazon’s
Mechanical Turk (mean age: 35 years; range: 21–63 years; 14 females).2
Only workers with an IP address from the USA were eligible for par-
ticipation. In addition, these workers were asked to indicate their native
language. Payment was not contingent on their response to this question.
2
Mechanical Turk is a website where workers perform the so-called ‘Human Intelligence Tasks’
for financial compensation. It has been shown that the quality of data gathered through Mechanical
Turk equals that of laboratory data (Schnoebelen & Kuperman 2010; Buhrmester et al. 2011;
Sprouse 2011).
8 of 39 Bob van Tiel et al.
3.1.3 Results and discussion One participant was excluded from the
analysis for making mistakes in three of the control items. Four out of a
total of 1250 answers were missing. Control items were answered cor-
rectly on 94% of the trials. The results for the target trials are shown in
Figure 2. It is evident from this graph that there was considerable vari-
ation among critical items, with positive responses ranging along a con-
tinuum from 4% (for seven adjective scales) to 100% (for hcheap, freei
and hsometimes, alwaysi). The results of our first experiment thus dis-
prove the uniformity assumption: different scalar expressions yield
widely different rates of scalar inferences.
In this experiment, we used materials that were as neutral as possible,
which was done mainly by using pronouns instead of complex noun
phrases, but also by using generic predicates. One potential drawback of
this approach is that it may have had a disorienting effect, leaving par-
ticipants to wonder who or what these pronouns referred to, which, in
its turn, may have affected our findings. Though it is difficult to see how
3
In a pilot experiment we gauged whether the number of control items had an effect on the
results of the inference task. We presented 50 participants (mean age: 35 years; range: 18–67 years;
30 females) on Mechanical Turk with 10 of the target items included in Experiment 1 alongside 32
control items. In 16 of the control trials, the target inference was clearly valid; in the remaining 16
controls, it was clearly not valid. The results of this pilot experiment correlated almost perfectly with
the results from Experiment 1 (r = 0.97, t(8) = 11.66, P < 0.01). Apparently, the number of control
items does not have a substantial effect on the contrasts between scales.
Scalar Diversity 9 of 39
+N N +N N
hcheap, freei 100 93 0 0 O 0.66 .19 5.52 +B
hsometimes, alwaysi 100 86 80 90 O 1.05 .60 5.70 +B
hsome, alli 96 89 67 87 C 0.12 .79 5.83 +B
hpossible, certaini 92 93 55 31 O 0.10 .42 5.65 +B
hmay, willi 87 89 83 80 C 0.68 .51 5.41 +B
hdifficult, impossiblei 79 96 13 10 O 0.46 .60 6.22 +B
hrare, extincti 79 79 40 34 O 1.05 .29 5.83 +B
hmay, have toi 75 71 83 80 C 1.22 .64 5.26 +B
hwarm, hoti 75 64 70 38 O 0.28 .51 5.00 B
hfew, nonei 75 54 20 30 C 0.75 .47 5.35 +B
3.2 Experiment 2
3.2.1 Participants We posted surveys for 30 participants on Amazon’s
Mechanical Turk (mean age: 32 years; range: 21–62 years; 14 females).
Only workers with an IP address from the USA were eligible for par-
ticipation. In addition, these workers were asked to indicate their native
John says:
This student is intelligent.
Would you conclude from this that, according
to John, she is not brilliant?
Yes No
3.2.3 Results and discussion One participant was excluded from the
analysis for making mistakes in four control items. Four out of a total of
1500 answers were missing. Figure 2 shows the mean acceptance rates
for each scale.
Paired chi-square tests showed that only two scales yielded different
rates of scalar inferences in the two experiments, namely hbelieve,
knowi, where the rate of positive responses increased from 20% to
60% (2(1) = 7.42, P = 0.01), and hfunny, hilariousi, where the rate of
positive responses went from 4% to 30% (2(1) = 4.05, P = 0.04).
Accordingly, the product-moment correlation between the proportions
of positive answers for corresponding items in the two experiments was
high (r = 0.91, t(41) = 13.98, P < 0.01). Overall, the rates of positive
responses (42% vs. 44%) did not differ significantly across the two ex-
periments (2(1) = 0.85, P = 0.37). Paired chi-square tests showed that
there was no pair of statements for any scale that yielded significantly
different rates of positive answers (though it should be noted that there
were at most 10 observations per statement).
Adding more content to the materials had a relatively small effect on
the overall results, and did not affect the general conclusions we drew
from the results of Experiment 1. This finding suggests that the general
pattern of responses is robust to changes in the sentential context. Given
our own data and Doran et al.’s, we can safely say that the uniformity
assumption is false: the rates at which scalar expressions yield upper-
bounding inferences could hardly fluctuate more.
Scalar Diversity 13 of 39
4 EXPLAINING DIVERSITY
To compute a scalar inference, one has to assume that the speaker
considered using a stronger scalemate of the scalar expression she used
in her utterance. Otherwise it would be mistaken to infer from the
speaker’s utterance that she believes the stronger scalar expression is
inappropriate. So perhaps the variable rates of scalar inferences are
caused by differences in the availability of lexical scales.
Doran et al. (2009) provide some evidence to suggest that lexical
scales are indeed available to different degrees. As discussed in
5 AVAILABILITY
5.1 Association strength
The most straightforward measure of the availability of a lexical scale is
the strength of association between the scalar expression used in the
speaker’s utterance and its stronger scalemate. The greater the association
strength, the more likely it is that the speaker considered using the
stronger scale member. So perhaps the differential rates of scalar infer-
ences can be explained in terms of differences in association strengths.
To illustrate, consider the scalar expressions ‘warm’ and ‘big’. The reason
that scalar inferences were more frequent for ‘warm’ than for ‘big’ might
be that the strength of association between ‘warm’ and ‘hot’ is much
greater than between ‘big’ and ‘enormous’. Thus, we arrive at the fol-
lowing hypothesis:
The availability of a lexical scale h, i is an increasing function
of the strength of association of with .
To test this hypothesis, we need to measure the strength of associ-
ation between two scalar expressions. To this end, we conducted a
modified cloze task. A standard cloze task, like the one we used to
obtain materials for Experiment 2, consists of sentences or text fragments
with certain words removed, where participants are asked to replace the
missing words. We modified this design by underlining instead of
removing words. Participants were asked to list three alternatives to a
given sentence [] by replacing the underlined scalar term with
whatever expression they saw fit. We assumed that the stronger the
association between and , the more likely it would be that partici-
pants replaced with .
16 of 39 Bob van Tiel et al.
She is intelligent .
She is
She is
She is
5.2 Experiment 3
She is angry.
4
Note that the neutral version included only 41 statements, the reason being that the statements
for hgood, excellenti and hgood, perfecti, on the one hand, and hmay, have toi and hmay, willi, on
the other, were identical in this version of the task. In the analysis reported below, we paired the
results for these statements with the results on the inference task for hgood, excellenti and hmay,
have toi, respectively. Changing this pairing did not have an effect on the results.
Scalar Diversity 17 of 39
beautiful, happy, married, and so on. We ask you to tell us the first
three alternative words that occur to you when you read these
sentences. We are interested in your spontaneous responses, so
don’t think too long about it.
In the second version, the first sample alternative (here ‘beautiful’)
was replaced with a scalar term that was stronger than the highlighted
expression (namely ‘furious’). We did this to control for the possibility
that mentioning or not mentioning a stronger expression in the instruc-
tions might have an effect on the responses. More precisely, participants
5.2.3 Results and discussion Seven out of a total of 2550 answers were
missing. We annotated our results in two different ways. For each trial,
we first coded if the participant mentioned the stronger scalar term we
used in the inference tasks. However, this measure may be too strict
because participants in the inference tasks might have computed a scalar
inference based on a different stronger scalar term. For instance, a par-
ticipant who associates ‘possible’ with ‘probable’, and computes a scalar
inference on the basis of the scale hpossible, probablei, thereby also infers
that it is not certain, even though she did not consider that particular
alternative. Therefore, we also determined for each trial in the modified
cloze task whether any stronger scalar term was mentioned. In this
measure, we did not include scalar expressions that were stronger than
the stronger scalar term we used in the inference tasks, such as ‘perfect’
for the hadequate, goodi scale and ‘freezing’ for the hcool, coldi scale.
After all, someone who infers from (9a) that, according to the speaker, it
is not perfect does not necessarily infer that it is not good. Similarly for
(9b): someone who infers that it is not freezing does not necessarily infer
that it is not cold.
(9) a. It is adequate.
b. That is cool.
The results of our analyses are summarized in Table 3. We start with the
strict coding scheme. We first conducted a loglinear analysis to test
whether the probability that the stronger scalar term used in the infer-
ence task was mentioned was affected by (a) whether or not the target
sentences were neutral (+N vs. N) and (b) whether or not a stronger
scalar expression was mentioned in the instructions (+S vs. S). A
summary of the effects of these factors is given in Table 4. Overall,
18 of 39 Bob van Tiel et al.
Parameter SE Z P R2
(Intercept) 2.80 1.73 1.62 0.104 –
Association strength 0.16 0.31 0.51 0.611 0.000
Grammatical class 0.38 0.74 0.52 0.606 0.001
Relative frequency 0.15 0.21 0.74 0.461 0.003
Semantic relatedness 0.1 0.1 0.93 0.355 0.006
Semantic distance 0.65 0.27 2.36 0.018 0.027
Boundedness 1.87 0.40 4.72 0.000 0.108
Table 5 Parameters of a mixed model with the results from Experiments 1 and 2 as dependent
variable, the strengths of association based on the lenient coding scheme (Experiment 3), open
modified cloze task. To illustrate, in the case of ‘snug’, nearly all par-
ticipants in Experiment 3 mentioned ‘tight’ as an alternative, but in
Experiments 1 and 2 the average rate of the scalar inference was only
16%; similar observations hold for hpretty, beautifuli and hdislike,
loathei. On the other hand, there was a substantial group of scales
that yielded high rates of scalar inferences, but for which stronger
scalar terms were rarely mentioned in Experiment 3, clear examples
being hcheap, freei, hhard, unsolvablei and hdifficult, impossiblei. In
sum, the findings of this experiment argue against the hypothesis that
rates of scalar inferences are determined by the strength of the connec-
closed grammatical classes than for open ones, and therefore it seems
plausible to suppose that lexical scales are more available when their
elements are from a closed grammatical class than from an open one.
The following hypothesis captures this explanation:
The availability of a lexical scale h, i is greater if and are
from a closed grammatical class.
To test this hypothesis, we subdivided the scalar expressions into
open and closed grammatical classes (Table 3). Although the average
rate of scalar inferences was higher for scales from closed (76%) than
these LSA values do not reflect how often a pair of words co-occur, but
rather how often they co-occur with the same words.
On the basis of Landauer et al.’s (1998) LSA implementation,
we obtained relatedness values for each pair of scalar terms through
pairwise, term-to-term comparisons with ‘general reading up to
first year of college’ as topic space. These relatedness values, listed in
Table 3, were used as an estimator of the results of Experiments 1 and 2.
LSA values were not a significant predictor of the rates of scalar infer-
ences ( = 0.01, SE = 0.01, Z < 1). We thus conclude that semantic
relatedness has no effect on the rates of scalar inferences that we
5.6 Conclusion
To compute a scalar inference, one has to assume that the speaker
considered the corresponding lexical scale. Otherwise it would be mis-
taken to attribute her choice for a weaker scalar expression to the belief
that the stronger scale member is inappropriate. Based on this observa-
tion, we hypothesized that the differential rates of scalar inferences in
Experiments 1 and 2 were caused by differences in availability. In the
foregoing sections, we operationalized the notion of availability by
means of association strength, grammatical class, word frequencies and
semantic relatedness. But none of these measures made a significant
contribution to the rates of scalar inferences. Availability thus plays at
best a marginal role in shaping the results of Experiments 1 and 2.
It might be objected that the absence of a significant contribution of
availability has a methodological cause. In our inference tasks, the ques-
tion participants had to answer contained a scale member that was
stronger than the one used in the target statement. One might suppose
that this feature caused all lexical scales to be rendered available, thereby
obviating the effect of intrinsic measures of availability like the ones
tested in the previous sections.
A number of observations speak against this explanation. First and
foremost, recall that Doran et al. (2009) made a comparison between
neutral, one-way contrastive, and two-way contrastive items. In the
neutral condition, Irene’s question did not contain scale members; in
the one-way contrastive condition, it contained one scale member that
was stronger than the one used in Sam’s answer; and in the two-way
contrastive condition, Irene, in effect, provided Sam with three scale
members to choose from. The items in our inference tasks most closely
resemble the items in Doran et al.’s one-way contrastive condition, since
both involve a question that contains a scale member stronger than the
24 of 39 Bob van Tiel et al.
6 DISTINCTNESS
6.1 Semantic distance
The notion of semantic distance was inspired by an observation by Horn
(1972: 90). Consider the following examples:
(11) a. Many of the senators voted against the bill.
b. Most of the senators voted against the bill.
c. All of the senators voted against the bill.
An utterance of (11a) is more likely to implicate the negation of (11c)
than the negation of (11b), since the negation of (11b) is logically stron-
ger than the negation of (11c). So whenever a listener infers that the
sentence with ‘most’ is false, she thereby also infers that the sentence with
‘all’ is false, but not vice versa. In more general terms, the likelihood of a
scalar inference is an increasing function of the relative semantic distance
between the scalar term used in the speaker’s utterance and the stronger
Scalar Diversity 25 of 39
6.2 Experiment 4
6.2.1 Participants We posted surveys for 25 participants on Amazon’s
Mechanical Turk (mean age: 33 years; range: 20–62 years; 15 females).
Only workers with an IP address from the USA were eligible for par-
ticipation. In addition, these workers were asked to indicate their native
language. Payment was not contingent on their response to this ques-
tion. One participant was excluded from the analysis because she was
not a native speaker of English. Two participants had also participated in
Experiment 1 or 2. We included these participants in the analysis.
Excluding them would not change the statistical significance of any of
the P-values we report.
1. She is intelligent.
2. She is brilliant.
Is statement 2 stronger than statement 1?
6.2.3 Results and discussion Eight out of a total of 1250 answers were
missing. One participant was excluded from the analysis because her
mean rating for the control items exceeded two standard deviations
from the grand mean for these items. The results of the experiment
are presented in Table 3. The mean distance for the synonymous control
Scalar Diversity 27 of 39
items was 2.81. The 95% confidence interval around this mean was
2.53–3.09. There was only one lexical scale whose mean distance fell
within that confidence interval: hsnug, tighti. This finding indicates that,
except for this outlier, participants were able to perceive a difference in
strength between scalemates.
The mean ratings on the distance task made a significant contribu-
tion to the rates of scalar inferences ( = 0.65, SE = 0.27, Z = 2.36,
P = 0.02). This finding confirms the prediction made by the distance
hypothesis. In Section 7, we discuss the variance explained by this and
other factors.
scales like hsome, mosti and hsometimes, ofteni are open even though
they consist of elements from a closed grammatical class.
It was found that bounded scales indeed licensed higher rates of scalar
inferences than non-bounded scales (62% vs. 25%). Boundedness made a
significant contribution to the rates of scalar inferences in Experiments 1
and 2 ( = 1.87, SE = 0.40, Z = 4.72, P < 0.01). The likelihood of a
scalar inference is predicted in part by the distinction between bounded
and non-bounded lexical scales. Section 7 discusses a measure of the
variance explained by boundedness.5
A. Target sentences
adequate/good: It I The food (5) j salary (1) j solution (1) is adequate.
allowed/obligatory: It I Copying (2) j Drinking (4) j Talking (2) is allowed.
attractive/stunning: She I That nurse (1) j This model (7) j The singer (2)
is attractive. believe/know: She believes it. The student (1) believes it
will work out (1). The mother (3) believes it will happen (1). The
teacher (6) believes it is true (1). big/enormous: It I That elephant (4)
j The house (1) j That tree (1) is big. cheap/free: It I The water (2) j
electricity (1) j food (5) is cheap. content/happy: She I This child (3) j
is snug. some/all: He saw some of them. The bartender (1) saw some of
the cars (2). The nurse (1) saw some of the signs (1). The mathematician
(1) saw some of the issues (1). sometimes/always: He is sometimes inside.
The assistant (1) is sometimes angry (3). The director (1) is sometimes
late (2). The doctor (2) is sometimes irritable (1). special/unique: It I
That dress (1) j That painting (1) j This necklace (1) is special. start/
finish: She I The athlete (1) j dancer (2) j runner (2) started. tired/
exhausted: He I The quarterback (1) j runner (1) j worker (3) is tired.
try/succeed: He I The candidate (1) j athlete (1) j scientist (1) tried. ugly/
hideous: It I The wallpaper (2) j That sweater (1) j That painting (3) is
B. Control sentences
clean/dirty: That I The table is clean. dangerous/harmless: It I The
soldier is dangerous. drunk/sober: He I The man is drunk. sleepy/
rich: He I The neighbor is sleepy. tall/single: She I The gymnast is tall.
ugly/old: It I The doll is ugly. wide/narrow: It I The street is wide.
Acknowledgements
We would like to thank Chris Cummins, Corien Bary, Ira Noveck, Judith Degen,
Matthijs Westera, Michael Franke, Paula Rubio-Fernández, Rob van der Sandt,
Sammie Tarenskeen, Yaron McNabb, Ye Tian and two anonymous reviewers for
their comments and questions. This research was supported by a grant from the
Netherlands Organization for Scientific Research (NWO), which is gratefully
acknowledged.
REFERENCES