Alvin Cheng-Hsien Chen
Alvin Cheng-Hsien Chen
Alvin Cheng-Hsien Chen
ABSTRACT
This study evaluates the development of L2 collocational competence in texts
written by learners of differing proficiency levels, compared to native speaker
collocation patterns from a reference corpus. We address: (1) whether learners
develop their collocation competence as their proficiency grows; and (2) How is
this development mediated by different aspects of collocability, i.e., exclusivity,
directionality, and dispersion? Effective quantitative metrics based on the native
corpus were assigned to each bigram type in L2 texts, covering important aspects
of collocability. Correlations between the text-based average scores of each metric
and L2 proficiency were analyzed to examine the development of collocability in
each dimension. Our results show that exclusivity increases with learner
proficiency. When directionality is considered, learners develop native-likeness in
forward-directed word selection across all levels; backward competence, however,
improves more markedly at advanced levels. Our analysis also suggests learners
start to use less deviant collocation patterns but more domain-specific bundles as
their proficiency grows.
INTRODUCTION
29
Alvin C.-H. Chen
which have often fallen under the cover term of collocation. It is suggested
that collocation competence is considered an essential component in
native-like mastery of an L2. Collocation represents an initial, yet crucial
stage, where learners start to develop their grammatical competence of
concatenating lexical items into longer sequences for more sophisticated
linguistic expression and social communication. Collocation itself,
however, is an ambiguous term which has been operationalized by
scholars from many different perspectives. A general definition of
collocation may be traced back to a general linguistic observation, which
says that some words tend to occur in the same neighborhood (Firth, 1957;
Sinclair, 1991). The recurrence of pairs of words has therefore been a
central criterion in defining collocations.
While recurrence may seem an intuitive criterion for defining
collocation, scholars differ in their approach to deriving a more restricted
set of qualifying features for collocations. For example, collocation is
sometimes used more restrictedly to refer to word combinations which
have little semantic transparency (Nation, 2001; Nesselhauf, 2005), such
as idioms (e.g., spill the beans) or fixed expressions (e.g., nuts and bolts).
Alternatively, collocations can also refer to word combinations of relative
semantic transparency, such as strong coffee, heavy smoker. They are
uniquely defined as collocations due to the fact that one of the words in
the combinations is highly constrained to this bundle with its unique
semantics (e.g., strong and heavy). Collocations can also be defined even
more broadly as word combinations that habitually co-occur, whose
semantics can be fairly transparent (Biber & Conrad, 1999; Laufer &
Waldman, 2011; Simpson-Vlach & Ellis, 2010; Sinclair, 1991): for
example, strong man or heavy load.
This study adopts this broader co-occurrence based approach to
collocation, and regards recurring word combinations as collocations.
Most importantly, we subscribe to a graded view of collocation, by
treating word combinations as bundles of varying conventionality
depending on the degree of recurrence. In other words, collocation is
considered not a categorical feature but a quantitative feature of a two-
word sequence, which is defined based on the sequence’s corpus-based
distributional properties. On this continuum one may see word
combinations whose meaning is semantically compositional based on
their parts at one end, as well as idioms or fixed expressions whose
meaning is fully opaque at the other extreme. Syntactically, collocations
can be grammatically legitimate phrases, fully predictable from phrase-
30
ACQUISITION OF L2 COLLOCATION COMPETENCE
Operationalizing Collocations
31
Alvin C.-H. Chen
32
ACQUISITION OF L2 COLLOCATION COMPETENCE
33
Alvin C.-H. Chen
This Study
34
ACQUISITION OF L2 COLLOCATION COMPETENCE
35
Alvin C.-H. Chen
the linguistic units. In particular, Gablasova et al. (2017) point out that the
distributional properties of linguistic units may need to consider three
important dimensions of collocability: exclusivity, dispersion, and
directionality. Exclusivity concerns the statistical significance of the
extent to which the words’ co-occurrence is beyond the expected
frequency. Dispersion is the evenness of distribution of the multiword unit
in a corpus. Directionality highlights the fact that words in a bundle are
not always attracted to each other with equal strength. When learners
develop their collocation competence, they may develop their sensitivity
to this multifaceted nature of the distributional properties (Ellis, O'Donnell,
& Römer, 2014; Ellis & Ogden, 2017). Corpus data can provide relevant
distributional metrics for us to further examine the distributional
differences of the collocations.
In this study, we use the Corpus of Contemporary American English
(COCA) as our source of distributional metrics. Take the following two
bigrams, Monday night and excellent swimmer, for example. In COCA,
there are 2314 tokens of Monday night and 12 tokens of excellent swimmer.
The raw frequencies of these two bigrams may give the impression that
Monday night is more formulaic than the other. However, based on several
corpus-based quantitative metrics to be further introduced in Method,
these two bigrams can be compared more comprehensively by considering
the exclusivity, directionality, and dispersion of their distributional
properties.
To begin with, when considering the bigram’s lexical associations, we
can analyze the property of exclusivity of these bigrams in addition to their
frequencies. Based on the mutual information scores of Monday night (MI
= 7.83) and excellent swimmer (MI = 7.70), the lexical items of these two
bigrams are almost equally exclusive to each other even though their
frequencies differ by two orders of magnitude. In other words, these two
bigrams may be equally important as conventional expressions in English
in terms of the exclusivity aspect of the bigram distribution.
Second, when adopting lexical associations with directionality (See
delta P in Collocability Metrics), we can analyze whether the lexical items
in these two bigrams are attracted to each other in a symmetrical way.
According to the delta P scores (See Collocability Metrics for a step-by-
step computation) of these two bigrams, Monday night is a forward-
directed collocation, where the first word, Monday, more strongly prompts
the second word, night; in contrast, excellent swimmer is a backward-
directed collocation, where the second word, swimmer, more strongly
36
ACQUISITION OF L2 COLLOCATION COMPETENCE
prompts the first word, excellent. Therefore, these two bigrams may differ
in the relative strengths of their forward-directed and backward-directed
lexical associations.
Studies we have reviewed so far (Bestgen, 2017; Bestgen & Granger,
2014; Durrant & Schmitt, 2009; Siyanova-Chanturia, 2015) seem to have
stressed mainly the first dimension, exclusivity, when analyzing the
development of L2 collocation competence. More specifically, they have
mostly adopted non-directional association measures, such as MI or t-
scores. These association measures do not address to what extent the
development of the L2 collocation knowledge may be mediated by the
directionality of collocability. It is therefore unclear whether learners
develop collocation competence differently in terms of their native-
likeness in forward and backward word selection. Learners may develop
collocation competence by using word combinations that are more native-
like in terms of forward-directed temporal relations between words. For
example, when using the word apply, learners may demonstrate a forward-
directed collocation knowledge if they choose a preposition for after apply.
On the other hand, learners may develop their collocation knowledge by
using word combinations that are more native-like in terms of backward-
directed temporal relations of words. For example, given a word home,
learners may demonstrate the collocation knowledge when choosing the
preposition at before home.
Finally, Monday night and excellent swimmer may also differ in their
dispersion. According to their distribution in COCA, Monday night is a
bigram which is more widely-dispersed in different documents than
excellent swimmer: the former is found in 119 different documents in the
entire corpus while the latter is found in only 11 documents. Lexical
association measures (i.e., MI, t-score, delta P) would not inform the
degree of dispersion of the collocation, which may however play a role in
the development of L2 collocation competence. While previous studies
have identified a positive relationship between L2 proficiency and the
average MI scores of the two-word sequences used by the learners
(Bestgen, 2017; Bestgen & Granger, 2014; Durrant & Schmitt, 2009;
Siyanova-Chanturia, 2015), it remains unclear how this increase of
exclusivity in two-word sequences may be mediated by their dispersion
rates. We may wonder whether learners also develop collocation
competence by acquiring word sequences that are more domain-general
(i.e., sequences that are widely-dispersed in different documents) at the
beginning and mastering ones that are more domain-specific (i.e.,
37
Alvin C.-H. Chen
METHOD
Data
38
ACQUISITION OF L2 COLLOCATION COMPETENCE
original CEFR B2, C1, and C2 were collapsed into B2+ and the original
B1 was subdivided into B1_1 and B1_2 in order to better represent the
largest group of Asian intermediate-level learners (cf. Ishikawa, 2013).
Thus, learners were grouped into four proficiency levels: A2, B1_1, B1_2,
and B2+. Table 1 shows the distribution of texts in all levels.
Table 1
Data Preprocessing
39
Alvin C.-H. Chen
Collocability Metrics
40
ACQUISITION OF L2 COLLOCATION COMPETENCE
Table 2.
Exclusivity
Directionality
41
Alvin C.-H. Chen
any two-word sequence, i.e. W1W2. When W2 is taken as the outcome and
W1 as the cue, a forward-directed DP can be computed using the formula
in (3); on the other hand, when W1 is taken as the outcome and W2 as the
cue, a backward-directed DP can be computed using the formula in (4).
(3) Forward Delta P of W1W2:
𝑂 𝑂
𝐷𝑒𝑙𝑡𝑎 𝑃 = 𝑃(𝑊2 |𝑊1 ) − 𝑃(𝑊2 |¬𝑊1 ) = 𝑅11 − 𝑅21
1 2
(4) Backward Delta P of W1W2:
𝑂 𝑂
𝐷𝑒𝑙𝑡𝑎 𝑃 = 𝑃(𝑊1 |𝑊2 ) − 𝑃(𝑊1 |¬𝑊2 ) = 11 − 12
𝐶1 𝐶2
We generated directional DPs, forward and backward, for all bigrams
in COCA, amounting to 2,334,463 different bigram types. These adjusted
conditional probabilities can be useful indicators of the native-like
intuition in forward- or backward-directed word co-selection. For example,
according to the forward DP based on COCA, the top five words that most
likely follow the first-person pronoun I… are am, think, do, was, and have.
If the cue is different, e.g., you…, then the native-like intuition for forward
word selection may predict a different set, i.e., know, are, can, have, and
do. Similarly, a native-like intuition for backward word selection would
predict that the top five words that most likely come before home are at,
go, back, his, and come. A different cue word like house would lead to a
different set of words likely preceding the cue, i.e., the, white, ‘s, a, and
my. It is hypothesized that more advanced learners may perform the co-
selection of words more similarly to native-speaker intuition. This study
makes a step further examining whether directionality in word co-
selection plays a role.
Dispersion
42
ACQUISITION OF L2 COLLOCATION COMPETENCE
Unseen Rates
All the aforementioned metrics were targeted toward bigrams used by
learners that were also present in the native reference corpus. That is, the
metrics analyzed bigrams that were found in both L2 texts and COCA.
Following CollGrams (Bestgen, 2017; Bestgen & Granger, 2014), we
considered as well the rates of bigrams that are absent in the reference
corpus in the learner’s production. An unseen bigram may be significant
in two important senses. On the one hand, an unseen word combination
may be an ungrammatical or incongruent sequence in English (i.e., a
deviant word combination); on the other hand, a novel combination may
suggest a learner has mastered creative use of collocation to some extent.
Research Questions
An L2 Text
(A) Identifying
Bigrams
43
Alvin C.-H. Chen
Table 3
44
ACQUISITION OF L2 COLLOCATION COMPETENCE
Exclusivity
45
Alvin C.-H. Chen
Directionality
46
ACQUISITION OF L2 COLLOCATION COMPETENCE
Table 4
47
Alvin C.-H. Chen
48
ACQUISITION OF L2 COLLOCATION COMPETENCE
Dispersion
49
Alvin C.-H. Chen
Unseen Rates
50
ACQUISITION OF L2 COLLOCATION COMPETENCE
DISCUSSION
51
Alvin C.-H. Chen
52
ACQUISITION OF L2 COLLOCATION COMPETENCE
53
Alvin C.-H. Chen
54
ACQUISITION OF L2 COLLOCATION COMPETENCE
For dispersion, our analysis suggests that learners may start to use
more domain-specific collocation patterns in the intermediate level (i.e.,
B1_1 to B1_2) because the IDF shows the most change on average in the
transition of these two learning phases. Interestingly, on the other hand,
the unseen rates show a more prominent decrease in the initial learning
phases (i.e., from A2 to B1_2), suggesting that less proficient learners
begin to use fewer bigrams that have not been used by native speakers
when their proficiency progresses. These two findings both point to a
55
Alvin C.-H. Chen
CONCLUSION
56
ACQUISITION OF L2 COLLOCATION COMPETENCE
57
Alvin C.-H. Chen
REFERENCES
Ä del, A., & Erman, B. (2012). Recurrent word combinations in academic writing by native
and non-native speakers of English: A lexical bundles approach. English for Specific
Purposes, 31(2), 81–92.
Altenberg, B., & Granger, S. (2001). The grammatical and lexical patterning of MAKE in
native and non-native student writing. Applied Linguistics, 22(2), 173–194.
Appel, R., & Trofimovich, P. (2017). Transitional probability predicts native and non-
native use of formulaic sequences. International Journal of Applied Linguistics, 27(1),
24–43.
Appel, R., & Wood, D. (2016). Recurrent word combinations in EAP test-taker writing:
Differences between high- and low-proficiency levels. Language Assessment
Quarterly, 13(1), 55–71.
Bestgen, Y. (2017). Beyond single-word measures: L2 writing assessment, lexical richness
and formulaic competence. System, 69, 65–78.
Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological
competence in L2 English writing: An automated approach. Journal of Second
Language Writing, 26, 28–41.
Biber, D., & Conrad, S. (1999). Lexical bundles in conversation and academic prose. In H.
Hasselgård & S. Oksefjell (Eds.), Out of corpora: Studies in honour of Stig Johansson
(pp. 181–190). Amsterdam: Rodopi.
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at ...: Lexical bundles in university
teaching and textbooks. Applied Linguistics, 25(3), 371–405.
Chen, A. C.-H. (2019). Assessing phraseological development in word sequences of
variable lengths in second language texts using directional association measures.
Language Learning, 69(2), 440–477.
Chen, Y.-H., & Baker, P. (2010). Lexical bundles in L1 and L2 academic writing.
Language, Learning & Technology, 14(2), 30–49.
Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Examples
from history and biology. English for Specific Purposes, 23(4), 397–423.
Crossley, S., & Salsbury, T. L. (2011). The development of lexical bundle accuracy and
production in English second language speakers. IRAL - International Review of
Applied Linguistics in Language Teaching, 49(1), 1–26.
Davies, M. (2012). The Corpus of Contemporary American English (COCA): 560 million
words, 1990-2012. Retrieved from https://corpus.byu.edu/coca
Durrant, P., & Schmitt, N. (2009). To what extent do native and non-native writers make
use of collocations? IRAL-International Review of Applied Linguistics in Language
Teaching, 47(2), 157–177.
Ellis, N. C. (2006). Language acquisition as rational contingency learning. Applied
Linguistics, 27(1), 1–24.
Ellis, N. C., O'Donnell, M. B., & Römer, U. (2014). The processing of verb-argument
constructions is sensitive to form, function, frequency, contingency and
prototypicality. Cognitive Linguistics, 25(1), 55–98.
58
ACQUISITION OF L2 COLLOCATION COMPETENCE
Ellis, N. C., & Ogden, D. C. (2017). Thinking about multiword constructions: Usage-based
approaches to acquisition and processing. Topics in Cognitive Science, 9(3), 604–620.
Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and
second language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL
Quarterly, 42(3), 375–396.
Evert, S. (2009). Corpora and collocations. In A. Ludeling & M. Kyto (Eds.), Corpus
linguistics: An international handbook (pp. 1212–1248). Berlin: Mouton De Gruyter.
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Los Angeles: Sage
Publications.
Firth, J. R. (1957). Modes of meaning. In J. R. Firth (Ed.), Papers in linguistics 1934-1951
(pp. 190–215). Oxford: Oxfored University Press.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus-based language
learning research: Identifying, comparing, and interpreting the evidence. Language
Learning, 67(1), 155–179.
Gries, S. T. (2010). Dispersions and adjusted frequencies in corpora: Further explorations.
In S. T. Gries, S. Wulff, & M. Davies (Eds.), Corpus linguistic applications: Current
studies, new directions (pp. 197–212). Amsterdam: Rodopi.
Gries, S. T. (2013). 50-something years of work on collocations. International Journal of
Corpus Linguistics, 18(1), 137–166.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University
Press.
Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for
Specific Purposes, 27(1), 4–21.
Ishikawa, S. (2013). The ICNALE and sophisticated contrastive interlanguage analysis of
Asian learners of English. In S. Ishikawa (Ed.), Learner corpus studies in Asia and
the world (pp. 91–118). Kobe, Japan: Kobe University.
Jones, J., & Pashler, H. (2007). Is the mind inherently forward looking? Comparing
prediction and retrodiction. Psychonomic Bulletin & Review, 14, 295–300.
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Suchomel, V.
(2014). The Sketch Engine: Ten years on. Lexicography, 1, 7–36.
Kyle, K., Crossley, S., & Berger, C. (2018). The tool for the automatic analysis of lexical
sophistication (TAALES): Version 2.0. Behavior Research Methods, 50, 1030–1046.
Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices,
tools, findings, and application. TESOL Quarterly, 49(4), 757–786.
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A
corpus analysis of learners’ English. Language Learning, 61(2), 647–672.
Leńko-Szymańska, A. (2014). The acquisition of formulaic language by EFL learners: A
cross-sectional and cross-linguistic perspective. International Journal of Corpus
Linguistics, 19(2), 225–251.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language
processing. Cambridge: MIT Press.
Nation, P. (2001). Learning vocabulary in another language. Cambridge: Cambridge
University Press.
59
Alvin C.-H. Chen
Nation, P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9–
13.
Nesselhauf, N. (2005). Collocations in a learner corpus. Amsterdam: John Benjamins.
Onnis, L., & Thiessen, E. (2013). Language experience changes subsequent learning.
Cognition, 126(2), 268–284.
Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection
and nativelike fluency. In J. C. Richards & R. W. Schmidt (Eds.), Language and
Communication (pp. 191–225). London: Longman.
Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in
phraseology research. Applied Linguistics, 31(4), 487–512.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Siyanova-Chanturia, A. (2015). Collocation in beginner learner writing: A longitudinal
study. System, 53, 148–160.
Wood, D. (2015). Fundamentals of formulaic language: An introduction. London:
Bloomsbury Publishing.
Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University
Press.
60
ACQUISITION OF L2 COLLOCATION COMPETENCE
ACKNOWLEDGMENTS
The author would like to thank the anonymous reviewers of the Taiwan Journal of
TESOL for their constructive comments to help improve earlier versions of this
manuscript. This research was supported by a grant from the Taiwan Ministry of Science
and Technology (108-2410-H-003-023-MY2).
CORRESPONDENCE
PUBLISHING RECORD
Manuscript received: May 11, 2020; Revision received: August 9, 2020; Manuscript
accepted: August 24, 2020.
61