Learner Corpora and Computer-Aided Error Analysis
Learner Corpora and Computer-Aided Error Analysis
Learner Corpora and Computer-Aided Error Analysis
Myung-Jeong Ha
Abstract
This study focuses on an alternative approach to looking at learner language. In particular, the
present study investigates the use of English verb-noun collocations in the writing of native speakers of
Korean at intermediate levels. For this purpose, a learner corpus was compiled that consists of about
19,826 words of comparison and descriptive essays. For comparison purposes, the study employed
LOCNESS, a corpus of young adult native speakers of English. The main body of the current study into
verb collocational usage compares a sample of the American English sub-component of LOCNESS
with the essay writing of Korean EFL university students. The most frequently occurring common verbs
in the learner corpus are retrieved and concordances for them are created, and verb-noun collocations
are extracted. Subsequently, two types of comparisons are performed: Korean EFL learners were
compared with native speakers on the variety of common verb collocation use and they are also
compared on the correctness of collocations. The data reveals that the Korean EFL learners produced
far fewer collocations than native speakers, and those errors, particularly interlingual ones, continued
to persist at intermediate levels of proficiency.
1. Introduction
Recent developments in this field of small corpus studies, largely brought about by the personal
computer, have yielded remarkable insights into the nature and use of real language. Computer
corpora have played a main role in language related fields, from lexicography to language teaching
through natural language processing. While the use of corpus has spread to the English for academic
purposes (EAP) field, EAP researchers and material writers mainly rely on native corpora. In fact,
learner corpora containing data produced by second language (L2) learners are rarely examined in
spite of their tremendous potential for EAP studies. Although L2 learners share a lot of difficulties
with novice native writers, they have their own distinctive problems. The aim of this paper is to show
the usefulness of analyzing learner corpus as an effective way of operationalizing writing difficulties.
Following Gaëtanelle, Granger, and Paquot [1], it was fairly difficult to form a picture of learner EAP
writing, but the analysis of learner corpus has the potential to offer a breakthrough since researchers
are now allowed to use large databases of learner corpus and powerful methods of analysis.
Indeed, recent developments in corpus technology have heightened the need for exploring corpus of
learner language. This study employs error-oriented approaches to learner corpora that are different
from EA (Error Analysis) studies because the approaches are computer-aided and involve a higher
degree of standardization. Although it is generally accepted that collocations are indispensable and
problematic for foreign language learners and they therefore should play an important role in second
language acquisition (SLA), L2 learners’ difficulties with collocations have not been discussed in
detail by EFL practitioners so far [2]. Collocations have been largely neglected by researchers,
material designers and EFL practitioners. In the field of computer-assisted language learning with
new corpus technology becoming available, the need for more studies on collocations is obvious.
Following the trend towards collocational competence in second language learning, the present study
investigates the use of English verb collocations in the writing of native speakers of Korean. Since
restricted collocations are frequently said to be an area where L2 learners have greater difficulties but
remains neglected, the use of restricted collocations in a learner corpus are mainly investigated. In
spite of some suggestions made on the teaching of collocations in recent years [3], it is not clear how
and which of collocations in a second language should be taught. To provide some answers to these
1
This research was supported by a 2013 Research Grant from Sangmyung University.
questions, it is important to examine the difficulties that L2 learners have in using collocations. Since
the production of collocations is more problematic than the comprehension, the present study focuses
on collocational problems in L2 learners’ writing in order to identify the difficulties they have. The
purpose of this study is to explore aspects of how high-frequency common verbs in a learner corpus
form collocations with other words. Whereas high frequency makes a word familiar to learners, there
is always a gap between receptive and productive vocabulary. Common verbs such as make and have
attracted much attention from proponents of the lexical approach to second language teaching [2].
Given that common verbs serve various grammatical functions and their meaning is tailored by the
company they keep, this study assumes that these common verbs may not be as easy as to learn as
commonly believed. These verbs are worthy of investigation because they are not generally examined
in L2 vocabulary learning. An investigation into L2 learners knowledge of common verbs would shed
light on the importance of common verbs in L2 writing. Also this study contributes to the growing
literature that draws on Contrastive Interlanguage Analysis (CIA) (Granger 1996) by comparing the
use of verb + noun combinations between native speaker corpora and nonnative speaker corpora.
79
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
One of the primary contributions of computer learner corpora is that they make it possible to
explore aspects of learner language which have been difficult to investigate. These investigations are
comparative. Given that authentic learner data is compared with native speaker data, the process falls
within the domain of Contrastive Interlanguage Analysis (CIA) [11]. This study employs CIA, which is
different from contrastive analysis. Whereas contrastive analysis involves the linguistic comparison of
more than two languages, CIA involves varieties of the same language.
According to Granger [11], CIA concerns “quantitative and qualitative comparisons between L1 and
L2 and between different varieties of interlanguage.” There are different types of comparison at the
heart of studies using CIA: comparing native language and interlanguage and comparing different
types of interlanguage with each other.
The aim of the first type of comparison is to explore distinctive features of a particular
interlanguage. This type of comparison makes it possible to examine the phenomenon of overuse and
underuse of linguistic items. This approach has the potential to reveal different distributional patterns
from comparable native language. These patterns can explain why a text including no overt lexical
errors gives the impression that it has not been written by a native speaker. Overuse and underuse are
intended as neutral, quantitative measures of linguistic differences [12]. This method can identify
between L1-related and universal features of learner language and have a picture of advanced
interlanguage and of the role of the L1 transfer for the different L1 backgrounds.
The studies of overuse and underuse widen the scope of traditional error analysis because it is
difficult to identify overuse and underuse other than computational methods. Computational methods
reflect areas where learner language differs from native language with respect to frequency of
distribution.
The second type of comparison includes the comparison of different non-native speaker (NNS)
varieties. NNS-NNS comparisons make it possible to observe strategies used by all learners or by
several learner groups. In addition, these comparisons are easily facilitated by the design of the sub-
corpora of International Corpus of Learner English (ICLE) with the control of relevant variables [13].
For example, a comparison of French learners’ use of indeed with that of some other learner groups
such as Norwegians reveals that the French learners overuse indeed in contrast to Norwegians who
underuse it [14]. Similarly, Norwegians are known to overuse kind of almost as much as their Swedish
neighbors with 44.8 frequencies per 100,000 words.
It is important to note that CIA is not restricted to ICLE or to Written English only. Rather other
researchers have adopted this method in analyzing interlanguage corpora of German, Italian, and
Norwegian. Also spoken learner discourse is being analyzed by the Louvain International Database of
Spoken English Interlanguage (LINDSEI). LINDSEI that comprises different L1 backgrounds and an
NS reference corpus is a spoken counterpart of ICLE.
4. Methods
In the methods which follow, the terms NNS and NS are used to refer to the non-native speaker and
native speaker groups respectively. The computer learner corpus compiled for this study consists of
19,826 words of comparison essays and descriptive essays. The small corpus consists of comparison
and descriptive essays that were produced in an English writing class at a university of Korea. The 46
essays investigated were written by Korean–speaking university students of English, mainly in their
second or third year. The essays are comparison and descriptive essays, including essay titles such as
‘my favorite place’ or ‘shopping at stores and shopping online’ and have an average length of about
400 words. The corpus contains two essays per learner.
The comparative native-speaker (NS) corpus Louvain Corpus of Native English Essays
(LOCNESS) was compiled at the University of Louvain la Neuve in Belgium and comprises essays
written by young adult NSs of English. The total number of words in the corpus is 324,304. To
compare nonnative speaker use with native English use, a 170,000 word sample from LOCNESS was
used. Because LOCNESS contains argumentative essays written by native-speaker American
university students, the sample is fully comparable to the learner corpus. Table 1 presents the exact size
of the corpora used.
80
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
For this study, WordSmith Tools, a user-friendly and powerful package was used to make it
possible to automate part of the linguistic analysis.
First, a frequency wordlist was generated by WordSmith Tools [15]. As mentioned earlier, I decided
to choose high-frequency common verbs and their collocations for analysis. According to Altenberg
and Granger [16], the following fifteen verbs are placed on any corpus-based list of high-frequency
verbs: have, do, know, think, get, go, say, see, come, make, take, look, give, find, and use.
Given that the literature of high-frequency verbs points to an overuse of these verbs by EFL learners
and the error-proneness of these verbs in L2 writing [17][18], I chose the node words based on these
high-frequency verbs.
Next the researcher used lemmatizing facility, which allowed for grouping all the inflectional forms
of the high-frequency verbs. There is an English lemma list from Someya at
http://www.lexically.net/downloads/BNC_wordlists/e_lemma.text. Figure 1 gives a screenshot of the
lemmatizing facility.
In this paper, the composite set of words is viewed as lemma [4]. Figure 2 below shows an example
of lemmatized results. In this way, the verbs in the corpus were lemmatized to get the total occurrences
of each word.
81
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
Table 2 below lists the frequencies of occurrence of the verbs used in the study. Of the fifteen verbs,
I selected the most frequent verbs from the word lists of the learner corpus - that is, verbs that occurred
more than 50 times in the corpus. In order to study the behavior of words in texts, it is necessary to
obtain enough observations for each verb. It was thus difficult to deal with verbs with relatively low
frequencies. All the necessary word-forms that are associated with the four key verbs were taken out
and counted separately.
After choosing those verbs in the learner corpus, I then examined the frequencies of the verbs from
LOCNESS. The results are given in Table 3.
The next step was to scrutinize concordance lines to weed out irrelevant instances. Modal verbs or
noun forms have been removed. I then picked up collocation errors from the KWIC (Key Word in
Context) lists with the help of KWIC index in WordSmith Tools. Figure 3 presents the concordances of
HAVE.
82
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
I classified them into the three categories, that is, free collocations, restricted collocations, and
idioms. On the basis of previous studies [2][19], I used the BBI dictionary of English word
combinations (BBI) and the Oxford Collocations Dictionary for Students of English (OCDE) as the
main references, supplemented by the Collins Cobuild English Dictionary (CCED). As collocation
errors were identified in the concordances, they were copied from the text and pasted into a separate
file for later analysis.
First, as for the node word, have, 250 verb-object-noun combinations were extracted from the
learner essays. Among them, 125 were classified as free combinations and 36 as collocations. Korean
EFL learners have a tendency to underuse have collocations. The learner data show very a limited use
of have collocations including have an effect, have something in common, have a holiday, have friends,
etc. In contrast, NA corpus contains much wider range of have collocations such as have a sense of,
have a good idea of, have a greater effect on, have a greater respect for, etc. As Figure 4 shows, NS
corpus includes a variety of have collocations.
Some of the noun collocates were shared because they occurred in both corpora, and some were
used exclusively by one group. While shared noun collocates for the node word have were effect,
something in common, friends, exclusive noun collocates used by the NS group appeared to have a far
greater variety, including sense, idea, respect, incidence, grudge, etc. In addition, it is interesting to
note that very few mistakes (3 out of 36) were identified with respect to the use of have collocations
from the learner corpus.
Second, in the case of the node word do, 71 verb-object-noun combinations were extracted from the
learner essays, of which 15 were classified as collocations and 21 as free combinations. Because the
learner corpus involves a very limited use of do collocations including do shopping, do their work, and
83
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
do the task, there is underuse of do collocations from NNS writing. In contrast to the case of have
collocations, more than half (9 out of 15; 60%) the do collocations produced by the Korean learners
contained one or several mistakes. As for the node word do, NS corpus contained 274 verb-object-noun
combinations. With respect to types of do collocations, certain types of do collocations were identified
in NS writing: do business, do a favour, do their jobs, etc.
Third, as for the node word, go, 105 verb-object-noun combinations were extracted from the learner
essays. Among them, 80 were appeared as free combinations and 22 as collocations. Examples of go
collocations in the NNS writing include go shopping, go on, go their own way, etc. Almost half of the
go collocations (12 out of 22) contained certain mistakes. Out of 12 mistakes, the one occurring most
frequently is the preposition of go verb such as *go to picnic (go for/on picnic), *go the store (go to the
store), and *going to there (go there). This seems to indicate that the Korean learners make frequent
errors in the use of prepositions or particles after common verbs. It may be suggested that one reason
for the EFL students' problems in learning English prepositions is that they usually try to learn the
meaning and use of prepositions individually without paying sufficient attention to their collocational
properties [26]. Therefore, knowing which prepositions or particle should follow the verbs is a crucial
part for Korean EFL learners. As for the node word go, NS corpus contained 264 verb-object-noun
combinations. With respect to types of go collocations, certain types of collocations were identified in
NS writing: go on strike, go home, go to bed, go into details, go on for months, go through, etc. This
indicates that NS writing shows a wider range of go collocations.
Finally, in the case of the node word make, 59 verb-object-noun combinations were extracted from
the learner essays, of which 19 were classified as collocations and 39 as free combinations. The learner
data show make collocations including make money, make atmosphere, make use of, make friends, etc.
In contrast, NS corpus containing 408 verb-object-noun combinations reveals much wider range of
make collocations such as make a point, make demand, make money, make assumptions, etc. Like the
use of have collocations, some of the noun collocates were shared and some were used exclusively by
the NS group. The exclusive noun collocates used by the NS group include claim, demand,
assumptions, statement, etc.
Table 5 below summarizes the overall distribution of major types of errors.
84
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
Noun
Wrong choice of noun (or 3 My friends and I mostly do *shop online. A number of …
non-existent noun)
Number
Today many people want to have some healing *times.
Noun used in singular instead 3
…see my school campus? You will make great *memory.
of plural or vice versa
Usage 2 First, a proper friend never *go back on their friend even…
Combination exists but is not 3 we should *go for seek items and walking
used correctly However, the customers *go out on shopping at stores
First, prepositional errors showing 37 occurrences were the most salient. For example, *going to
there is a literal translation of the Korean expression “geogi-e ga-da.” It seems possible that these
results are due to the fact that second language learners tend to put to use as a hypothesis that there is a
word-for-word translation-equivalence between L1 and L2. Another possible reason is that Korean has
fewer words functioning as prepositions and they are used less frequently than in English. That is to
say, major difficulties appear to be L1 related.
Secondly, determiner errors were the second most one, showing 25 occurrences. The result may be
explained by the fact that Korean does not have articles. Moreover, the use of English articles is a
subtle and complex phenomenon, and there is no clear L2 input or formal instructions that can help
Korean learners acquire the semantics of English articles [21].
Compared to the use of determiners and prepositions of a prepositional verb, fewer mistakes (18
occurrences) were made with respect to the type of wrong choice of verb. This finding is interesting
because the verb in a collocation has a restricted sense, which makes its correct use more difficult.
Another interesting finding is that errors influenced by direct translation from L1 seem predominant in
terms of the type of wrong choice of verb. An example of a direct translation of a Korean collocation
is:
Iron Man is fond of the limelight as to *do announcement in the presence of others and enjoy the
given situation. (NNS-w2c6)
The Korean for make an announcement is seoneon-hada whose literal meaning is ‘do
announcement’. Another example is:
However, you cannot *do actions like that when you shop on the internet. (NNS-w2c6)
This is very likely to be a translation of haengdong-hada, literally ‘perform an action’. The Korean
word hada is usually translated as do in English, and there are contexts where this would be acceptable
(such as do shopping), but frequently a different word is expected as in the following example:
Unlike Dark Knight, he can *do flight to the sky and suit is a great weapon, with armor. (NNS-
w2c8)
85
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
used with these nouns as the object: agreement, promise, word, and assurance, which means to break a
promise that one has made.
First, a proper friend never *go back on their friend even though the friend become a beggar. (NNS-
w2d12)
6. Conclusion
This study compared the use of common verb + noun combinations by Korean learners of English
with native speakers at university level. Its aim was to account for the quantitatively and qualitatively
different verb + noun combinations Korean learners of English produce. While the NS corpus
displayed greater variety in the use of common verb combinations, NNS corpus showed very limited
use of verb collocations. In particular, the lexical variety of the noun collocates for each common verb
led to greater differences between NNS and NS corpus.
Regarding sources of non-nativeness in the learner corpus, errors were divided into three factors:
those influenced by grammatical characteristics of the L1 (Korean), those influenced by direct
translation of the L1, and those influenced by lack of exposure to the target language. However, this
study does not suggest that each error is neatly assigned to one factor or another. Without interviewing
learners at the time of writing, judgments about collocational errors seem to be speculative. In addition,
collocational errors may show elements from more than one factor.
It is interesting to note that the Korean EFL learners made relatively fewer mistakes of the type of
wrong choice of verb, whether in the free or the restricted collocations. It seems reasonable to suppose
that the Korean learners were more conservative and cautious in the choice of a noun collocate for a
particular verb, whereas they tended to be more subjective and even creative in the choice of a
preposition collocate for a certain verb, hence generating more mistakes in this respect. The main
conclusion that can be drawn from this study is that, to a certain degree, intermediate Korean learners
still have a problem with the use of high-frequency common verb collocations. Although the students
learned high-frequency verbs very early, once they have been taught, they tend to be overlooked.
As discussed earlier, the students lacked knowledge with respect to the collocational possibilities of
verbs: there is a mismatch between lexical items as in to *do actions. Whereas previous studies have
mainly focused on the combinations of two lexical items, the present study reveals that verb
collocational errors are not entirely mismatch between the verb and the noun. Rather other types such
as prepositional errors and determiner errors were relatively more frequent among the intermediate
Korean learners. It is therefore reasonable to emphasize that the non-lexical elements belonging to
word combinations should not be overlooked.
Several limitations to this study need to be acknowledged. The sample size is relatively small
because the learner corpus used in this study was compiled from essays written by Korean learners in a
composition class. As mentioned earlier, there is some overlap with respect to sources of collocational
errors as well as types of collocational errors and these sources and types represent tendencies rather
than absolute. And this study did not employ error annotation which is particularly relevant for
interlanguage studies because POS taggers have been trained on the basis of native speaker corpora.
Nevertheless, the results have some pedagogical implications with respect to the teaching of
collocations. According to extensive literature review in second language acquisition, native standard
L2 proficiency seems an unattainable goal. Therefore, vocabulary teaching should aim at a functional
proficiency to approximate native-likeness. Because the learning of verb + noun combinations requires
a lot of input exposure, it is relevant to teach their combinatorial possibilities and restrictions explicitly.
For example, it is necessary to teach which collocate (the arbitrary part, e.g., make) is to be combined
with the base (the non-arbitrary part, e.g., announcement. The first step towards this is awareness that
collocations differ from language to language. Learners should be provided with sufficient input
including word combinations which are necessary for their needs, and the possibilities and restrictions
on collocational uses should be pointed out to them.
Whereas studies of collocations have showed convincing results for the explicit teaching of
collocation in the classroom, there still remains the issue of which collocations should be given priority
and how they should be taught. The findings in this study shed light on this issue. As previously
discussed, not all errors occurring are a mismatch between the verb and the noun that concerns the
collocational possibilities of the two lexical items in question. Other types of errors such as
86
Learner Corpora and Computer-aided Error Analysis
Myung-Jeong Ha
prepositional errors as in Kimbob is made *by rice (made of) and determiner errors as in before making
*decision (make a decision) are also frequent among the Korean EFL learners. These results suggest
that teaching grammatical collocations as well as lexical collocations for the frequent node words
would be of great benefit to Korean EFL learners.
A final suggestion related to research methods is in order. It is necessary to complement learner
production data with experimental data in order to capture both aspects of competence and
performance. To conduct a full CIA study, a researcher also needs to compare the NNS production
with comparative NS data based on the criteria of comparability.
7. References
[1] Gilquin Gaëtanelle, Sylviane Granger, Magali Paquot, “Learner corpora: The missing link in EAP
pedagogy”, Journal of English for Academic Purposes, vol. 6, no. 4, pp. 319-335, 2007.
[2] Nadja Nesselhauf, “The use of collocations by advanced learners of English and some implications
for teaching”, Applied Linguistics, vol. 24, no. 2, pp.223-242, 2003.
[3] Michael Lewis, Teaching collocation: Further developments in the Lexical Approach, Language
Teaching Publications, Hove, UK, 2000.
[4] John M. Sinclair, Corpus, concordance, collocation, Oxford University Press, Oxford, UK, 1991.
[5] Anthony Paul Cowie, “Phraseology”, In The Encyclopedia of Language and Linguistics, edited by
R.E. Asher, pp. 3168-3171, Pergamon, USA, 1994.
[6] Batia Laufer, “The development of L2 Lexis in the expression of the advanced learner”, The
Modern Language Journal, vol. 75, no. 4, pp. 440–448, 1991.
[7] Michael Lewis, “Language in the lexical approaches”, In Teaching collocation: Further
developments in the lexical approach, edited by Lewis, Michael, pp. 126-154, Language Teaching
Publications, London, UK, 2001.
[8] Victoria Hasko, “Capturing the Dynamics of Second Language Development via Learner Corpus
Research: A Very Long Engagement”, Modern Language Journal, vol. 97, S1, pp. 1-10, 2013.
[9] Thewissen, Jennifer. "Capturing L2 Accuracy Developmental Patterns: Insights From an Error-
Tagged EFL Learner Corpus", Modern Language Journal, vol. 97, S1, pp.1-25, 2013.
[10] Vyatkina, Nina. "The Development of Second Language Writing Complexity in Groups and
Individuals: A Longitudinal Learner Corpus Study", Modern Language Journal, vol. 96, no. 4, pp.
576-598, 2012.
[11] Dagneaux Estelle, Sharon Denness, Sylviane Granger, "Computer-aided error analysis", System
vol. 26, no. 2, pp. 163-174, 1998.
[12] Sylviane Granger, Learner English on computer. London: Longman, UK, 1998.
[13] Sylviane Granger, "The contribution of learner corpora to second language acquisition and foreign
language teaching." Corpora and language teaching, vol. 33, no. 13, 2009.
[14] Sylviane Granger, "Computer learner corpus research: current status and future prospects."
Language and Computers, vol. 52, no.1, pp. 123-145, 2004.
[15] Mike Scott, WordSmith Tools version 5, Lexical Analysis Software, Liverpool, UK, 2008.
[16] Bengt Altenberg, Sylviane Granger, “The grammatical and lexical patterning of MAKE in native
and non-native student writing”, Applied Linguistics, vol. 22, no. 2, pp. 173-195, 2001.
[17] Lennon, Paul. “Getting ‘easy’ verbs wrong at the advanced level”, IRAL - International Review of
Applied Linguistics in Language Teaching, vol. 34, no. 1, pp. 23-36, 1996.
[18] Howarth, Peter Andrew. Phraseology in English Academic Writing. Tubingen, Niemeyer, UK,
1996.
[19] Batia Laufer, Tina Waldman, “Verb-Noun Collocations in Second Language Writing: A Corpus
Analysis of Learners' English”, Language Learning, vol. 61, no. 2, pp. 647-672, 2011.
[20] Flowerdew, Lynne. A corpus based-analysis of referential and pragmatic errors in students’
writing. Hong Kong University of Science and Technology, China, 1999.
[21] Tania Ionin, Heejeong Ko, Kenneth Wexler, “Article semantics in L2-acquisition: the role of
specificity”, Language Acquisition, vol. 12, pp. 3-69, 2004.
87