Doing Corpus Linguistics
Doing Corpus Linguistics
Doing Corpus Linguistics
Typeset in Sabon
by Apex CoVantage, LLC
Contents
PART 1
Introduction to Doing Corpus Linguistics
and Register Analysis 1
2 Register Analysis 13
2.1 Why Register? 14
2.2 What Is Register Analysis? 16
2.3 Describing Situational Characteristics and
Identifying Variables 17
2.4 Providing a Functional Interpretation 24
2.5 Units of Analysis and Register Studies 28
2.6 End of Chapter Exercises 29
PART 2
Searches in Available Corpora 33
3 Searching a Corpus 35
3.1 KWIC 35
3.2 Collocates 40
vi Contents
3.3 N-Grams 41
3.4 POS Tags 53
PART 3
Building Your Own Corpus, Analyzing Your Quantitative
Results, and Making Sense of Data 73
6 Basic Statistics 94
6.1 Why Do Statistical Analyses? 94
6.2 Basic Terms, Concepts, and Assumptions 94
6.3 How to Go About Getting the Statistical Results 109
Index 161
Tables
byu.edu/). In this way, students can use the different corpora available on
this site so they have multiple opportunities working with a single search
interface. For the larger corpus project, we have focused on one free-ware
program, AntConc, developed and supported by Laurence Anthony at
Waseda University (http://www.laurenceanthony.net/software/antconc/).
We acknowledge that the corpora and software used in this book are not
the only ones, but we feel that they provide a strong foundation for cor-
pus analysis and allow students to start “doing” corpus linguistics with
these readily available and user-friendly tools.
Because a good deal of corpus work involves quantitative data analysis,
we also included some elementary statistical information (Chapter 5) and
tests (Chapter 6). Keeping with one of the guiding principles of this book,
we see this introductory information as a way to have students learn the
basics of analysis with the hope that they may apply this knowledge in
other projects or learn more about more advanced statistical techniques
that they can use in the future.
There are many different descriptive and theoretical frameworks that
are used in corpus linguistics. We have selected one particular frame-
work to guide the students in their interpretation of their corpus findings.
Register analysis has strongly influenced our work and we believe that
this approach to understanding language use is broad enough to encom-
pass the various types of projects that students choose to do. In Chap-
ter 2, we outline the basics of a register analysis and then ask students
to refer to this same framework when building their corpus and doing
their corpus study. We recognize the relevance of other descriptive and
theoretical agendas but feel that focusing on a single approach provides
students with more extended experience interpreting their corpus results
and motivating the significance of their findings. Without knowledge and
practice using a specific framework, we have found that students can be
quite enamored with the “button-pushing” aspects of corpus linguistics
at the expense of interpreting the results of their searches.
We also recognize the importance of reporting on research in a cohe-
sive way. To this end, we have included material dedicated to the specifics
of writing a research paper and presenting research (Chapter 8). Our goal
in this chapter is to provide both students and teachers with some guide-
lines for how to demonstrate and present their specific research projects.
In the final chapter (Chapter 9), we ask students to consider more
advanced types of corpus research with the hope that this book will
serve as an introduction to the field and encourage students to pursue
these ideas at a more advanced level and perhaps even impact the field in
significant ways.
Acknowledgments
We would like to thank the following people who have influenced this
book. We first acknowledge the work of Douglas Biber at Northern Ari-
zona University and Susan Conrad at Portland State University. Their
work in corpus linguistics and register analysis has strongly influenced
this book. Also, Mark Davies at Brigham Young University and Laurence
Anthony at Waseda University have provided free online corpora and
tools that are vital to the structure of this book. Finally, we would like
to thank our students in the previous corpus linguistics classes we have
taught. It is through these experiences that we have seen the need for such
a book.
This page intentionally left blank
Part 1
Introduction to Doing
Corpus Linguistics and
Register Analysis
This page intentionally left blank
Chapter 1
Linguistics, Corpus
Linguistics, and Language
Variation
rules (and sometimes choose not to follow rules) for specific reasons even
though they may not be able to explain the rules themselves. An impor-
tant part of linguistic study focuses on analyzing language and explaining
what may seem on the surface to be a confusing circumstance of facts that
may not make much sense.
When many people think of language rules, they may think of the
grammar and spelling rules that they learned in school. Rules such as
“don’t end a sentence with a preposition” or “don’t start a sentence with
the word and” are rules that many people remember learning in school.
Very often people have strong opinions about these types of rules. For
example, consider the excerpt below taken from a grammar website on
whether or not to follow the grammar rule of “don’t split an infinitive.”
Even if you buy the sales pitch for language being descriptive rather
than prescriptive, splitting infinitives is at the very best inelegant and
most certainly hound-dog lazy. It is so incredibly simple to avoid
doing it with a second or two of thought that one wonders why it is
so common. There are two simple solutions.
(1) “The President decided to not attend the caucus” can be fixed
as easily as moving the infinitive: “The President decided not to
attend the caucus.” I’d argue that works fine, and not using that
simple fix is about as hound-dog lazy as a writer can get, but split
infinitives can be avoided entirely with just a bit more thought.
How about:
(2) “The President has decided he will not attend the caucus.” What
the heck is wrong with that?
It’s hound-dog lazy, I say. Where has the sense of pride gone in writ-
ers? (https://gerryellenson.wordpress.com/2012/01/02/to-not-split-
infinitives/)
Examples such as these are not uncommon. One would only have to look
at letters to the editor in newspapers or at blog posts to find many more
instances of people who have very strong opinions about the importance
of following particular grammar rules.
So far, we have looked at “rules” as doing two different things: 1)
describing implicit, naturally occurring language patterns and 2) prescrib-
ing specific, socially accepted forms of language. Although both descrip-
tive and prescriptive perspectives make reference to language rules,
prescriptive rules attempt to dictate language use while descriptive rules
provide judgment-free statements about language patterns. Both prescrip-
tive and descriptive aspects of language are useful. When writing an aca-
demic paper or formal letter, certain language conventions are expected.
Corpus Linguistics and Language Variation 5
Table 1.1 Common right collocates of equal and identical in the Corpus of Contempo-
rary American English (COCA)
newspaper articles, letters, and fiction writing) but also written rep-
resentations of spoken language (such as face-to-face conversations,
sitcoms, or academic lectures).
The third and fourth characteristics of corpus linguistics make reference
to the importance of computers in the analysis of language as well as dif-
ferent analytical approaches. It would be hard to imagine how one might
use a 450-million-word corpus such as COCA without using a computer
to help identify certain language features. Despite the large number of
texts and the relative ease of obtaining numerous examples, a corpus
analysis does not only involve counting things (quantitative analysis); it
also depends on finding reasons or explanations for the quantitative find-
ings. In Chapters 3 and 4, we will cover some of the specific directions
we can explore in the corpus through software programs that allow for
different types of analysis. It is important to remember that corpus meth-
ods do not just involve using computers to find relevant examples; these
methods also focus on analyzing and characterizing the examples for a
qualitative interpretation.
In addition to these four characteristics of a corpus, Elena Tognini-
Bonelli, in her book Corpus Linguistics at Work (2001), also provides
some useful ideas in defining corpus linguistics. She describes the differ-
ences between reading a single text and using corpus linguistic tools to
investigate a collection of texts (i.e., a corpus). To illustrate this differ-
ence, we will make reference to one specific type of text: a newspaper
article. A single newspaper article is generally read from start to finish
so that the reader can understand the content of the text and relate it
to other points of view on a given topic. In English, the text is read
horizontally from left to right, and all texts from this perspective can
be viewed as a single communicative event or act (see Table 1.2). If one
were to compile a collection of 300 editorials and use a corpus approach
to analyze these texts, the ways these texts are used are quite different.
The texts in a corpus are not read from start to finish in a horizontal
manner as with a single news editorial; instead, the texts are a collection
of different (but related) events and are investigated not as whole but as
fragments, in the sense that many examples of a single feature are seen
in relation to each other. In this sense, the corpus is not read horizontally
but is instead read vertically—many examples of a particular language
feature are examined at one time.
A final point to consider when looking at corpus research relates to
various views that researchers may have about corpus linguistics. Elena
Tognini-Bonelli (2001) has made a distinction between “corpus-based”
research and “corpus-driven” research. In a “corpus-based” approach,
corpus linguistic researchers are guided by previous corpus findings or by
specific issues concerning language use. That is, researchers have a very
specific idea before searching the corpus as to what linguistic item they are
Table 1.2 Text and corpus comparison (based on Tognini-Bonelli, 2001)
looking for in a corpus. We have already seen an example of this with the
split infinitive discussion above. In this case, the perspective on the perils
of using a split infinitive was outlined in the article and this perspective
served as the basis for our corpus investigation on how split infinitives are
actually used. Given this prescriptive rule against splitting an infinitive,
we decided to see what a corpus, i.e., text samples of a naturally occur-
ring discourse, would tell us about how this rule is applied by language
users. Other examples include any linguistic feature that we know that we
want to search for, such as individual words or individual grammatical
items. In all instances, however, we already had an idea a priori (before
the fact) as to what we were going to search for in the corpus.
The other approach to finding out about language use that Tognini-
Bonelli has identified is through a “corpus-driven” method. In contrast
to the method described above, researchers following a “corpus-driven”
approach do not attempt to do corpus analysis with any predetermined
or fixed set of search criteria; instead, they use specific methods to see
what types of language patterns surface from the corpus. They extract
those patterns from the texts and document them, after which they
describe them and interpret them. Examples of research following this
approach are collocation and lexical bundle studies. Lexical bundles are
most frequently occurring four-word sequences in a register. We cannot
know ahead of time what the most frequently occurring sequences are.
Therefore, we rely on special computer programs that can extract those
patterns from the corpus for us. Researchers then analyze them grammat-
ically (Biber et al. 1999) as well as functionally (Biber et al. 2004). Not
only lexical items can be searched this way; grammatical features can be,
as well. Typically, researchers look for co-occurring grammatical patterns
in texts to characterize registers that way. We will describe this method
briefly in the last chapter as a way forward to doing corpus linguistics.
In these types of studies, however, we do not have one specific language
feature in mind a priori.
to work with corpora, interpret data, and present findings (Chapter 4).
Once these basics of corpus analysis and an analytical framework have
been addressed, readers will be ready to build their own corpora and
conduct their own research study. Part 3 of the book takes you through
the steps of building a corpus (Chapter 5) and then covers some statistical
procedures that can be used when interpreting data (Chapters 6 and 7).
Chapter 8 then provides a step-by-step process for writing up and pre-
senting your research. Because this introductory book contains some of
the basics of how to conduct a corpus research project, we do not cover
many of the relevant issues that corpus linguistics is presently address-
ing in its research. In Chapter 9, we discuss some of these issues with
the hope that this book has taught you enough about corpus research to
pursue a more advanced study of the field.
References
Biber, D., S. Conrad & R. Reppen (1998) Corpus Linguistics: Investigating Lan-
guage, Structure and Use. Cambridge: Cambridge University Press
Biber, D., S. Conrad & V. Cortes (2004) ‘If you look at . . . : Lexical Bundles in
university teaching and textbooks’, Applied Linguistics 25/3: 371–405
Biber, D., S. Johansson, G. Leech, S. Conrad & E. Finegan (1999) Longman
Grammar of Spoken and Written English, New York: Longman
Tognini-Bonelli, E. (2001) Corpus Linguistics at Work, Amsterdam: John
Benjamins
Chapter 2
Register Analysis
a variable and looks at the different linguistic features that are found in
specific situations.
In some respects, a register perspective is similar to traditional sociolin-
guistic approaches. Both sociolinguistic variation and register variation
studies are interested in how social or situational characteristics relate to
language use; however, register analysis considers a wider range of factors
that are not only due to what are traditionally viewed as “social” factors
(e.g., age, gender, socio-economic status). For example, when looking
at potential differences between speaking and writing, the communica-
tive purpose and topic are likely not as socially conditioned as are other
components accounted for in register variation such as, the relationship
between participants. Seen from this perspective, register analysis takes
into consideration a broader range of factors that may include social
factors but may also include other factors, for example, topic, purpose
of communication, and mode of communication. Another difference
between sociolinguistics and register analysis relates to the linguistic fea-
tures under investigation. Sociolinguistic studies are generally focused
on a small number of language features that vary for purely social rea-
sons. This approach allows us to understand why some people use the
word grinder and others using the word hoagie. Register variation takes
a different view of language variation by using corpora to investigate
and identify different linguistic features. A register approach also uses a
different type of analysis to investigate how different linguistic features
co-occur in given situations of use. From this perspective, the focus can
either be on specific linguistic features or on the co-occurrence of multiple
features found in particular situations of language use. Because register
variation considers how linguistic features co-occur in a given context,
a corpus linguistic approach is well suited to register analysis because
corpora provide large amounts of authentic texts for analysis. In fact, it
would be hard to see how a register analysis could be achieved without
the use of corpora. Looking at a smaller number of texts would likely not
provide a representative sample of language use to allow for a character-
ization of a given register.
The relevance of the register perspective in corpus linguistics relates
closely to the definition of corpus linguistics discussed in Chapter 1.
Recall that corpus linguistics includes both quantitative and qualitative
analysis. While the quantitative information can sometimes be fairly easy
to obtain (after all, many times all one has to do is push a few buttons
to obtain results!), proposing reasons for the quantitative information
can be more challenging. One way to interpret the quantitative results
(but not by any means the only way) is through register analysis. Because
register variation serves as the basis of much of the corpus analysis in the
remainder of the book, we will take a closer look at the steps in a register
analysis in the next section.
16 Introduction
not necessarily common. When one reads (or hears) “Once upon a time
in a land far, far away” we understand that a fairy tale is likely to follow
because this phrase is used to identify a particular genre. From a style
perspective, the focus is more on the language employed by individual
writers or speakers or by historical factors reflected of language from a
particular time period. When Shakespeare wrote “What light through
yonder window breaks” in Romeo and Juliet, his choice of words not
only came from his much-celebrated ability to use language but was also
reflective of the type of language used in late 16th-century England. In
this soliloquy, Romeo does not say something like “Hey, a light came on
in the window!” We can think of a register as a variety of language, and
register analysis as a method of understanding the relationship between
language context and language use in many ways, and regardless of any
individual “style.”
1) Participants
2) Relations among participants
3) Channel
4) Production circumstances
5) Setting
6) Communicative purposes
7) Topic
Parcipants
•Speaker/writer
•Hearer/listener
Communicave Relaonships among
•Onlookers
purposes parcipants
• Narrate •Interacon
• Inform •Social roles
• Entertain •Personal
relaonships
•Shared knowledge
Situaonal Channel
Seng
Variables •Speech,wring,
gestures
•Shared space and
me •Permanence
•Private vs. public (recorded,
•Present vs. past wrien)
•Transience
(spoken)
Producon
circumstances
•Planned vs. Topic
unplanned
•Edited
Table 2.1 Key situational differences between an email to a friend and an email to a
superior (boss) (Biber and Conrad 2009: 65)
Table 2.2 Situational differences between news writing and news talk
Table 2.3 Key situational differences between student presentations at a symposium and
teacher presentations in the classroom (Csomay 20015:4)
(Continued)
24 Introduction
Table 2.3 (Continued)
Table 2.4 Texts for news writing and a news talk (COCA)
A playwright hikes into the woods with A: Except that Obama is likely to give
his laptop, works until the battery runs it to them. I mean, that is the case.
down and then hikes back. A theater They’re expecting Barack Obama
company sets up huge pieces of to inject the enthusiasm into the
scaffolding that double as musical Republican base.
instruments. An author sits in a cabin, B: One of the great phrases that has
working on a new play, wearing nothing been used in defense of venture
at all. capitalism and Bain Capital is
“When I’m in New York, there are Schumpeter’s creative destruction.
neighbors across the way, “ says Whenever I hear Republicans say
Cassandra Medley, the playwright who that, I want to say, you know what
takes au naturel to heart. “I just like to America has been looking for five
shed and think and feel.” years at a lot of destruction, creative
The opportunity to return to a “primeval, and non-creative. They’re not going
primordial” state is one reason that to like that defense. They’re going to
Ms. Medley—and other authors—love like a defense that says, guess what,
attending the growing network of I can create jobs, I have a plan, we
summer retreats where writers and can move this thing forward, we can
others who work in the theater get save our country. Treatises on the
away from the urban grind and try to essential nature of capitalism, I think,
reconnect with their muses. won’t do it for Mr. Romney.
26 Introduction
referring to oneself as well as by including verbs such as, think, and that
suitability for candidacy would be expressed through verbs such as, can
and have. Another relevant contextual variable is related to the mode of
production. In the spoken mode, people often refer to themselves (and
others) in the conversation. Further support for this is found in the news
writing excerpt where the two instances of a first person pronoun occur
in a direct quotation. This short analysis does not mean that first person
pronouns do not occur in written contexts, but we would not expect
them to occur at the same frequency nor for the same reasons.
So what I’m suggesting to you And we found that an immature cynofields resides
then, is, is that this second in the kidney that’s where we found the most
dynamic, which accounts for the cells with those characteristics and I interpreted
popularity, the contemporary that we found also oh . . . oh . . . a relationship
popularity of civilian review, has for those cynofields but those were more
to do with money, and civil mature. We can say that because . . . The
liability, and the ways in which electro- microscopy results with that we can
the behavior of law enforcement see the morphology and chronology and this is a
institutions can, render, human cynofield with a transmission electronic
municipalities liable for millions microscopy of the human cynofield and we did
and millions and millions of with a zebrafish we found very similar morphology
dollars, uh, in, uh, civil liability that granules are round as same as the human
lawsuits. Not only that, usual ones and the nucleus is big at this stage so we
contingency, um, uh, rules, are found the cell that looks like cynofields so now
waived in these kinds of lawsuits. we want to study its function we study the
All right? What that means is function by migration of recommendation to the
that usually, when you pursue infection and then we see they change their
a civil claim, against somebody, morphology. So we know that cycles-sum in
you ask for a hundred thousand human cynofields includes information response
bucks, OK? And, you get it, and and we inject the fish with the cycles-sum we
your lawyer takes a third. All let them live for 6 hours in order to provide an
right? What happens if you sue order response and then to (syll) we sacrifice the
a municipality and they say yeah single cell suspension and within the facts analysis
we think you’re right but [short of photometry and those are our results. We
laugh] the situation was so much found we use a control also and we can see in
more complicated, we award, the control the populations of cynofields are in not
one dollar, OK? Is your lawyer increase as dramatically with the one that we
gonna take thirty three cents? or injected we cycle-sum and it was 20% more of
in these kinds of lawsuits, right? population of those cell that we found in this gate.
Register Analysis 27
teacher uses more repetitions (hence the lower ratio number) while the
presenter is conveying the information without that many repetitions.
the seven situational factors described in Figure 2.1 would you attri-
bute to traditional sociolinguistic factors?
4. Look back on the situational analysis presented in 2.3.1 and think
about what linguistic features are worthy of closer investigation.
Before you focus on one or two features for comparison, look at the
text in Table 2.6 and identify features found in one text but not in the
other.
Notes
1 Since these text segments are uneven in length (one is 157 words long and the
other is 241 words long), we had to scale the raw frequency counts as if both
texts were 100 words long. To do this, we need to norm the feature count
with a simple statistic: (Raw feature count/actual text length)*desired text
length. We will discuss this technique more in subsequent chapters.
2 To calculate type-token ratios, we take the number of words in a text and
divide it by the number of word types. If a word is repeated, it counts as a new
token but not as a new type. For example, in the following two sentences,
Register Analysis 31
there are 10 tokens (i.e., number of words) and 8 types (because ‘the’ and ‘cat’
are repeated): He saw the cat. The cat was in the garage.
References
Biber, D. (1988) Variation across Speech and Writing, Cambridge: Cambridge
University Press
Biber, D. & S. Conrad (2009) Register, Genre and Style, Cambridge: Cambridge
University Press
Biber, D., S. Conrad, R. Reppen, P. Byrd & M. Helt (2002) ‘Speaking and writ-
ing in the university: A multidimensional comparison’, TESOL Quarterly 36:
9–48
Biber, D., S. Conrad & V. Cortes (2004) ‘If you look at . . . : Lexical Bundles in
university teaching and textbooks’, Applied Linguistics 25/3: 371–405
Biber, D., S. Johansson, G. Leech, S. Conrad & E. Finegan (1999) Longman
Grammar of Spoken and Written English, New York: Longman
Cortes, V. & E. Csomay (2015) Corpus-based Research in Applied Linguistics.
Studies in Honor of Doug Biber. Amsterdam: Benjamins
Csomay, E. (ed) (2012) Contemporary Perspectives on Discourse and corpora.
Special issue of Corpus Linguistics and Linguistic Theory. 8/1.
Csomay, E. (2013) ‘Lexical bundles in discourse structure: A corpus-based study
of classroom discourse’ Applied Linguistics 34: 369–388
Csomay, E. (2015) ‘A corpus-based analysis of linguistic variation in teacher and
student presentations in university settings’, in Cortes, V. and E. Csomay (eds)
2015: 1–23
Duranti, A. (1985) ‘Sociocultural dimensions of discourse’, in Teun van Dijk (ed)
1985: 193–230
Fortanet, I. (2004). ‘The use of ‘we’ in university lectures: Reference and func-
tion’, English for Specific Purposes 23: 45–66
Halliday, Michael A.K. (1978) Language as Social Semiotic: The Social Interpre-
tation of Language and Meaning. London: Edward Arnold
Hymes, D. (1974) Foundations of Sociolinguistics: An Ethnographic Approach,
Philadelphia: University of Pennsylvania Press
Wulff, S., U. Römer & J. Swales (2012) ‘Attended/unattended this in academic
student writing: Quantitative and qualitative perspectives’, in Csomay, E. (ed)
2012: 129–157
This page intentionally left blank
Part 2
Searches in Available
Corpora
This page intentionally left blank
Chapter 3
Searching a Corpus
3.1 KWIC
3.2 Collocates
3.3 N-Grams
3.4 POS Tags
When researchers use corpora for their analyses, they are interested in
exploring the use of lexical items, or certain grammatical constructions.
They may also investigate lexical or grammatical patterns to see how
variation in those patterns may relate to different contexts. In register
studies, as we saw in the previous chapters, by contextual differences we
basically mean differences in particular aspects of the situational charac-
teristics in the construction and production of a given text.
In this chapter, we will use the Corpus of Contemporary American English
(COCA—corpus.byu.edu/coca/) to illustrate the most commonly identified
units of language that researchers use for their analyses: words, collocations,
n-grams/lexical bundles for lexical patterns, and (POS) tags for grammatical
patterns. We will illustrate how to identify these units of language by pro-
viding different tasks that will give you practice in searching and analyzing
these units of language. This chapter then is divided into four sections: 1)
KWIC (keyword in context); 2) Collocations; 3) N-grams; 4) POS tags.
Before we can start, you will need to register with COCA. It is free of
charge, but as a student, you will have more search options and you will
have more opportunities for searching than the very limited number of
daily searches available to those who don’t register. Go to COCA and, at
the upper right hand corner, click on ‘Register’ While registering does not
require you to pay any money, you may still wish to donate to the site to
support the excellent work they do.
The keyword that we are searching for is ‘say’. As we see on the chart
in Figure 3.1 (on the left hand side in the figure under ‘ALL’), our key-
word occurs a total of 426,546 times in the entire corpus; that is, on
average 918.611 times in a million words, including all five registers in
the corpus. We can also see the distributional patterns across the regis-
ters. Our interest is in three registers: spoken, newspaper, and academic
prose. It is apparent, perhaps not surprisingly, that our keyword occurs,
on average, most often in spoken discourse (1,924.90 times in a mil-
lion words) under ‘SPOKEN,’ followed by newspapers (872.18) under
‘NEWSPAPER,’ and finally academic prose (239.06) as shown under
‘ACADEMIC.’ Since we have not yet specified what part of speech cat-
egory we are looking at, and ‘say’ can be both a verb (He said he was
going to be late) and a noun (He has the final say in the matter), these
numbers include all of these options.
If we were interested in how our keyword is used in the immediate
textual context, we would look at the bottom part of what is displayed
in Figure 3.1 (see Figure 3.2). Now, as you see, our keyword is not just
an individual word but is placed back into its textual environment. The
program randomly selects examples from the corpus and lists each occur-
rence of our keyword together with a window of text around it. This kind
of listing is called ‘concordance lines’.
When we use concordance lines, we look for individual (or a group
of) pre-selected keywords to see them in context. The most common
form to display a keyword in context (KWIC) is through concordance
Table 3.1 Distribution of part of speech categories following the word ‘say’
100
Table 3.2 Distribution of part of speech categories following the word ‘say’ in spoken
discourse and in academic prose
Spoken 100
Acad. 100
Prose
Table 3.3 Raw and normed frequency counts for ‘said’ and ‘stated’ in three registers
SAY/SAID STATE/STATED
• To state an action in the past, as in ‘She said that he had gone to the
party’
• To state a completed action with a result relevant in the present (pres-
ent perfect), as in ‘He has stated his rights already’
• To state action completed in the past prior to a past action (past per-
fect), as in ‘I had said that before’
• To modify a noun (adjectival function), as in ‘the stated example’?”
Fill in Table 3.4 below with your numbers. Are there any differences of
the functions across registers?
40 Available Corpora
3.2 Collocates
As mentioned in Chapter 1, collocates are two words that occur together
more often than we would expect by chance. More specifically, the dic-
tionary defines collocate as “be habitually juxtaposed with another
with a frequency greater than chance” (Webster’s). Firth (1951) coined
the term ‘collocation’ to refer to two words that go together. Some
words, even though they may mean roughly the same thing, may not go
together. In the following example, in each instance, we want to express
the fact that something went bad. However, while we often say rancid
butter, we rarely, if at all, say *sour butter, and, in fact, that sounds
odd to some ears. The latter two words do not seem to go together in
English. The same is true for sour milk, which is a collocate, while *ran-
cid milk is not. Another example could be strong tea vs. *powerful tea
and powerful computers vs. *strong computers. There are many further
examples of two words going together. These could be partially or fully
fixed expressions and they are used in particular contexts. Although
these examples show adjective+noun combinations only, it is not the
case that only these two part of speech categories go together. Other
collocate types are, for example, noun+noun (e.g., bus stop) combina-
tions, verb+noun (e.g., spend money), verb+prepositional phrase (e.g.,
account for), and so on.
to bring up all the examples where these two words occur together in
texts. Click on ‘time’, and examine the first 100 samples provided to
you. In which register does this collocate spend (v) + time come occur
most frequently?
3.3 N-Grams
Most often, n-grams in linguistics are sequences of words explored as a
unit, where the value of n denotes how many words there are in that unit.
If the basic unit of analysis is a word, then we can call a word a uni-gram
(1-gram). If we have two words to consider as a unit, they are bi-grams
(2-grams), and if we have three words as a sequence in a unit, it will be a
tri-gram (3-gram), and so on.
A special computer program is often designed to process the text and
to look at sequences of words in a window of text as they emerge in a
corpus. Depending on how “big” your unit is (i.e., how many words in a
sequence you want to trace at a given point in time), the window size is
set accordingly. That is, if you want to identify bi-grams, you will capture
each two-word sequence in the corpus. If you are looking for tri-grams,
you will capture each three-word sequence, and so on. As you are doing
so, the already identified sequences are tracked and the new sequences are
constantly checked against what the program had already found. Each
time the same word-sequence is found, the program counts the frequen-
cies of that sequence. We can explore how frequently particular word
combinations (n-grams) occur together in a corpus or how they are dis-
tributed across different registers.
If you know ahead of time what sequences you are looking for, you
can just type the sequence in the search engine. In this case, you are not
looking for the kinds of n-grams there may be emerging in that corpus;
instead, you are just looking for a pre-defined sequence of words (that
may, or may not, have been extracted from a previous corpus. For exam-
ple, if you type in the word baby, you will see that it occurs over sixty
thousand times in the COCA corpus. But you picked the word ahead of
time, so you knew what to look for. If you are interested in the sequence
a baby, it occurs more than ten thousand times in the same corpus, and if
you are interested in the sequence like a baby, you will see that it occurs
more than six hundred times in COCA. At the same time, the four-word
sequence sleep like a baby only appears twenty-five times in the same cor-
pus. In all of these instances, however, you have typed in the words that
you were interested in.
In contrast, if you don’t know ahead of time what you want to look
for, but you are interested in the possibilities a corpus may have, you can
either design a new computer program for yourself and run your own
data, or run the n-gram program already available to you (e.g., through
42 Available Corpora
A random sample text will appear in the text box that will then be ana-
lyzed for its vocabulary make-up. Choose ‘word’ versus phrase and click
on the ‘search’ button. On the right hand side, you will find the same text
but now marked up with different colors. The color each word gets will
depend on where the word ranks as a frequently occurring word in the
database. (See Figure 3.4.)
As you see, there are three frequency bands. Words marked with the
color blue are among the top 500 most frequently occurring words in the
corpus. The green ones rank as the top 501–3,000 most frequently occur-
ring words in the corpus, and the yellow ones are marked for those that
are in the rank of less commonly used vocabulary in the corpus (beyond
the rank of 3,000). Be careful not to interpret these as frequencies because
these are rank numbers. The most interesting words, however, are the red
ones that tell you which words are academic words. These words occur
with a particularly high distributional pattern across academic texts.
More precisely, the following explanation is given on the same website:
Figure 3.4 Text analysis based on vocabulary frequency in Word and Phrase
Searching a Corpus 45
If you click on any of the words in the list, it gives you information about
that one word in terms of its distributional patterns across the different reg-
isters, provides a definition of the word and its collocates, and, finally, pulls
up examples from the COCA corpus in concordance lines. (See Figure 3.7.)
3.3.3 Tri-Grams
Any three words co-occurring together in the same sequence are known as
tri-grams. Some are complete structural or semantic units such as by the
way or in a way, and can be treated as fixed expressions with a specific
meaning. Others may be a semi-structural unit such as by way of, which is
missing one element coming after it in the phrase, or have a gap in meaning
such as what would you. Most recent studies have explored the possibilities
48 Available Corpora
Step 1: Click on ‘chart’ and ‘search’. This will give you the distribu-
tional patterns of this construction (with any nouns) following
the tri-gram. Then click on KWIC and select ‘Spoken’ from the
choices. This will provide you with examples from spoken dis-
course. Once you have 100 randomly selected samples, classify
the nouns into semantic categories (you may want to go to Biber
et al.’s [1999] Longman Grammar of English or some other
descriptive grammar of English for examples of semantic catego-
ries of nouns).
Step 2: Follow the same procedures as in Step 1 but now select ‘Aca-
demic’ discourse to get examples from that register. Classify the
nouns into semantic categories. Is there a difference between spo-
ken and academic discourse in the kinds of nouns that are used
after by way of?
Figure 3.9 Distributional patterns of the four-gram ‘on the other hand’ in COCA
Searching a Corpus 51
column under spoken. This way all your examples will be from the spo-
ken sub-registers in COCA. (See Figure 3.10.)
Take each one of the 100 samples you get this way and classify each
bundle according to its position in the sentence. You can use the follow-
ing categories: a) sentence initially; that is, when the bundle is the first
four words in the sentence or utterance (for spoken); b) sentence finally;
that is, when the bundle is the last four words in the sentence or utterance
(for spoken); and c) neither sentence initially nor sentence finally; that is,
when neither a) or b) applies. Use Table 3.5 to record the results.
As a next step, do the same with academic prose to get your sam-
ples from that register. Then classify each of the 100 examples as one of
the three categories above. Finally, calculate the percent value for each.
(Note: Since you had 100 observations for each register, the percent value
and the raw counts are the same.) Reset the sample size to 200, run it
again, and see whether your results are similar?
As a final note to this section, the longer the n-gram, or the
word-sequence, the less frequently it will occur simply because n-grams
are embedded in one another. For example, in the four-word sequence
on the other hand, the three-word sequences of on the other and the
other hand are both present. These would be counted as two separate
three-word sequences.
Figure 3.10 Concordance lines for the four-gram ‘on the other hand’ in COCA
52 Available Corpora
Table 3.5 Distribution of the sentence position of ‘on the other hand’ in spoken and
written discourse
in he, she, it, their, him, etc.; determiners [d], such as this, that, etc.; or
conjunctions (both clausal and phrasal) such, as, and, but, because, etc.,
and many more.
Whether a POS belongs to the open or closed class, there are endless
possibilities to search for association patterns, as shown in Chapter 2.
As we have also seen in that chapter, the co-occurring patterns of these
categories are the most interesting types of studies from the perspective of
register variation because they are able to provide us with more compre-
hensive and detailed analyses of texts. Now, some corpora (e.g., COCA)
have POS tags attached to each word, and some corpora (e.g., Michigan
Corpus of Academic Spoken English—MICASE, and Michigan Corpus
of Undergraduate Student Papers—MICUSP) do not have that feature
in addition to the actual words in a corpus. Some scholars find it more
difficult to do a search on POS tags, and others write their own com-
puter programs to process and count the different grammatical patterns
through those tags.
when you compare the frequencies of these words in spoken and written
discourse, you want to compare the instances in both cases only when
they are used as verbs. Otherwise, you would be comparing apples with
oranges. Luckily, COCA is marked up with POS tags, and so you could
just uncheck the boxes for all other possible POS categories (see ‘Part of
Speech’ buttons) that you are not interested in when searching for the
verb function in your comparison.
Second, let’s have a look at how this would broaden your options of
looking for patterns—that is, how this will give you more options and
more flexibility in your search. These main POS categories identify the
word as you type it into the search box—that is, orthographically only.
Through the tags, however, we are able to look for variation within POS
categories as well. The tags, for example, allow you to look for a given
word in different word classes, such as can as a noun and can as a modal
verb. Another example of what tags can do could be if you would like to
look at a specific verb but you are only interested in the past tense forms,
or if you want to search for exact examples of a particular noun but
only in its plural form. To do this type of search, you can use POS codes,
such as [v?d*] (e.g., ate, drank, looked, etc.) and [*nn2*] (e.g., houses or
backyards), respectively.
Now click on ‘COCA’ in the right hand side of the wordandphrase.info
site, enter, login, and type in [v?d*] in the search window. This search
will provide you with all the past tense verbs in the corpus ranked by fre-
quency. Now if you tried [*nn2*], what do you think you will get? (See
all POS tags used in COCA via the help menu.)
The third possibility is to see how often a specific word comes about
in texts with different word forms. In the previous examples, we looked
at say as a verb. If you did the same search as in ‘WordandPhrase’ but
in COCA, you would be typing the following string into the search box:
say.[v*]. This indicates that you want to search and see the word say in
this form and when it is a verb. What if you want to find out how often
the verb say is used in past tense because you would like to make a com-
parison between the past tense use of say across four registers or see the
syntactic position of the past tense use of say. Type the following into
the search box: [say].[v*] (note the brackets around say). What kind of
information do you get? This time, if the word is in square brackets, you
work with the lemma of the word (i.e., the base form of the word), but
you allow the inflections to be listed as well (such as past tense mark-
ers, third person ‘s’ marker, etc.) What you see then is that the program
divides the frequencies of the lexical verb say into the different verb
forms, including past tense and third person singular, etc., and reports
on the results accordingly. Click on the link ‘SAID’ which will pull up
all the examples of the word with this form, should you need textual
examples.
56 Available Corpora
The fourth option could be that you have an n-gram (e.g., a lexical
bundle) and you want to see what kinds of words precede or follow that
sequence. Let’s try this with if you look at.
Step 1: Go to COCA. Type in the search box if you look at [nn*], click
on ‘chart,’ and then ‘KWIC.’ This will give you all the possible
nouns following the bundle if you look at in the concordance lines
and the frequency of this sequence in all of the registers in COCA.
Step 2: Go to a descriptive English reference grammar such as, Biber et al.
(1999) Longman Grammar of English or Biber et al. (2002) Stu-
dent Grammar of Spoken and Written English and make a list of
the categories for nouns (pp. 56–64 in the 2002 book). Now, look
at the nouns that follow the lexical bundle you searched for, and
classify the nouns in the categories you have set up. Can you see
any patterns? That is, can you see any one semantic category that
seems to occur more often than others? Report on your results.
Step 3: Take 30 randomly selected samples for the newspaper listing and
30 randomly selected samples from the spoken corpora listing.
Compare the kinds of nouns following the bundle if you look
at. Based on the examples you see, can you draw any conclu-
sion whether, in your sample, there is evidence that one register
may use more of certain types of nouns than the other with this
bundle?
Step 4: Finally, just for fun, type in the search box if you look at [n*], and
click on ‘chart’ and then ‘KWIC.’ How did your list change from
the one you generated by doing Step 1 above?
Note
1 Please note that these numbers are normed counts. Chapter 8 explains how
normed counts are calculated.
References
Biber, D. (2009) ‘A corpus-driven approach to formulaic language in English:
Multi-word patterns in speech and writing’, International Journal of Corpus
Linguistics 14: 275–311
Biber, D. & F. Barbieri (2007) ‘Lexical bundles in university spoken and written
registers’, English for Specific Purposes 26: 263–286
Biber, D. & S. Conrad (1999) ‘Lexical bundles in conversation and academic
prose’, in Hasselgard, H. & S. Oksefjell (eds) 1999: 81–190
Searching a Corpus 57
Biber, D., S. Conrad & G. Leech (2002) Longman Student Grammar of Spoken
and Written English. New York: Longman
Biber, D., S. Conrad & V. Cortes (2004) ‘If you look at . . . : Lexical Bundles in
university teaching and textbooks’, Applied Linguistics 25/3: 371–405
Biber, D., S. Johansson, G. Leech, S. Conrad & E. Finegan (1999) Longman
Grammar of Spoken and Written English, New York: Longman
Connor-Linton, J. (2012, March) ‘Multiple perspectives on analysis of webtexts’,
Colloquium presentation at Georgetown University Round Table, Georgetown
Cortes, V. (2014, September). ‘Analyzing the structures, semantic prosodies,
and semantic preferences of lexical bundles in research article introductions’,
Paper presented at the Conference for the American Association of Corpus
Linguistics (AACL), Flagstaff, Arizona
Cortes, V. (2015) ‘Situating lexical bundles in the formulaic language spectrum’,
in Cortes and Csomay (eds) 2015: 197–216
Cortes, V. & E. Csomay (2015) Corpus-based Research in Applied Linguistics.
Studies in Honor of Doug Biber. Amsterdam: Benjamins
Csomay, E. & V. Cortes (2014, September). ‘Lexical bundles in cybertexts’ Paper
presented at the Conference for the American Association of Corpus Linguis-
tics (AACL), Flagstaff, Arizona
Firth, J.R. (1951) Modes of Meaning. Essays and Studies of the English Associa-
tion [NS 4], 118–149
Hasselgard, H. & S. Oksefjell (eds). Out of Corpora. Amsterdam: Rodopi
Chapter 4
I’ve literally spent more holidays with the animals than I have with
my own family. (example taken from COCA)
There’s a story about this in this book and it blows my mind, literally.
(example taken from COCA)
Using COCA, determine the variation in use of both the literal and figura-
tive sense of the word literally by completing the following steps:
Step 1: Develop and state your method for determining whether the word
falls into the “literal” category or the “non-literal” category. Do
you find any other senses of the word that do not fit into your
two categories? If so, describe the extra category or categories
you have found and provide examples to support each category.
Projects 61
Step 1: Using COCA, identify at least six cases where this suffix is used
with a noun. For each, note its first appearance; then note when
each term was most frequently used and whether it is still used
today.
Step 2: Using both the British National Corpus (BYU-BNC) and the Stra-
thy Corpus, determine whether this suffix is found in American
English only or whether it is also found in British and/or Canadian
English.
Step 3: Interpret your results: What are some possible reasons for your
findings? Are there other examples of prefixes or suffixes that are
specific to varieties of English?
In this project, we will take a close look at some of the clichés Har-
graves mentions in his book. Complete the following steps:
What do these frequencies tell you about the relationship between these
clichés and American and British English?
1. Are there specific registers that use these phrases more frequently
than other registers? What might be some reasons for any differences
you find?
2. Given your working definition of a cliché, would you define any or
all of the six examples as clichés? Are there other examples that are
more representative of clichés?
categorically deeply entirely far-reaching massively
You will use COCA, the British National Corpus (BYU-BNC), and
GloWbE. Complete the following steps in the project.
Step 1: For both COCA and the British National Corpus (BYU-BNC),
determine the most common word that follows each of the terms
above. You can do this by using the “COLLOCATES” function;
set the span to “0” on the left pull down menu and “1” on the
right pull down menu.
Projects 63
In this project, you will use both COHA and COCA to investigate these
different meanings of the word sustainable (and its noun counterpart,
sustainability) over time and across registers. Complete the following
steps:
Step 1: Using COHA, note the first 50 occurrences of the adjective sus-
tainable. For each use, provide the date of occurrence and note
which of the three definitions provided above best fit with the
occurrence of the word. Make sure to provide examples from the
corpus to support your analysis of their meanings. Is one use of
sustainable more prevalent than other uses of it? Is there a ten-
dency for the meaning to change over time?
64 Available Corpora
Step 2: Using COCA, note the register distribution of the adjective sus-
tainable. In which registers is sustainable most common? In which
registers is sustainable less common? Are there specific meanings
of sustainable that are representative of specific registers? Pro-
vide some reasons for any register or meaning differences that
you find. Make sure to support your analysis with examples from
the corpus.
Step 3: This part of the project asks you to look at the meanings and
register distribution of the noun sustainability. According to the
online site “Environmental Leader”:
one for third person singular subjects (is). Note that for simple past tense,
there is no difference for the regular verbs and only two distinctions for
the verb be (was for first person and third person singular and were for
all other subjects).
Sometimes certain types of nouns—collective nouns—are ambiguous
with respect to the rule of subject verb agreement. Consider, for example,
the two sentences below:
• audience
• committee
• crowd
• team
• family
Step 1: Using only the verb be (e.g., is, are), determine if there is dif-
ferent distribution of singular (is) or plural (are) agreement for
each of the five nouns above. Are there general patterns found
in all five collective nouns in subject position or do the nouns
show different patterns? What specific search terms did you use
to obtain your results?
Step 2: Do you find the same patterns with verbs other than be? What
specific search terms did you use to obtain your results?
Step 3: Do you notice any register differences in the agreement patterns?
That is to say, do some registers use one agreement pattern more
or less frequently than another register? Are there register differ-
ences with respect to any of the specific collective nouns? That is
to say, are some collective nouns in subject position more frequent
in specific registers than other collective nouns?
Step 4: How would you interpret your findings? Are there specific reasons
for any similarities or differences you have found?
Remember:
(a) My father is up all night going to the loo which keeps both him
and my mother awake.
Step 1: Determine the search term(s) you will use to find examples of
going to followed by a verb (e.g., going to [v*]).
Step 2: Use COHA and determine when this use came into the language.
What patterns of development do you find? Are there certain
verbs that tend to occur with going to? Have these verbs changed
over time?
Step 3: Are there differences in the way the two forms of going to and
gonna are used in COHA? What are some possible reasons for
any differences you may find?
Step 4: Using COCA, do you see any register differences in the use of
going to and gonna as a modal verb? If so, what are the differ-
ences you see? Try to provide some explanation for any register
differences you may find.
(a) Chekesha was third at district last year and barely missed going
to regionals.
(b) The study subjects were asked to estimate the surface normal at
many points on the drawings.
(examples from COCA)
Projects 69
Note that the infinitive clause is not a possible complement of miss (*. . .
missed to go to regionals) and the gerund clause is not a possible comple-
ment of ask (*. . . asked estimating the surface normal at many points on
the drawings). There are other verbs that allow both gerund and infinitive
clauses as complements. The verb start, for example, allows both, as seen
in (c) and (d).
(c) When you start looking for things in love, there’s nothing left
to find.
(d) Hair can start to lose its luster and look dull and ashy. . .
(examples from COCA)
In this project, you will look at three verbs (begin, continue, and start)
that allow both gerund and infinitive clauses in COCA. Complete all
steps in the analysis.
Step 1: Come up with search terms that will allow you to find both of the
complementation patterns. (The “POS” function will be helpful
for this exercise.) What search terms did you use?
Step 2: Using COCA, report on the complementation patterns of the
three verbs (begin, continue, and start). How do the three verbs
compare in their complementation patterns?
Step 3: For each of the three verbs, determine whether there are register
differences in the patterns.
Step 4: What reasons account for the variation of complementation pat-
terns in these three verbs?
In your answer, you should also include a discussion of begin and start.
One might expect that because begin and start are virtually synonymous,
they might pattern the same. Is this true? If they do not pattern the same,
what are some possible reasons for the difference?
Step 1: Using COCA, provide the overall frequency of all three coordina-
tors with all four grammatical categories (there will be a total of
12). You can use the “POS List” drop down menu for your search
terms. For example, to find “Verb and Verb” you would use the
search term “[v*] and [v*]”.
Step 2: Rank the grammatical categories with each coordinator from
most frequent to least frequent. Do all coordinators pattern the
same or do they pattern differently?
Step 3: For each of the 12 coordinator and grammatical category types,
do you find any register differences? What are some potential rea-
sons for any register differences you find?
This letter (as well as Abby’s response to it) illustrates the social impact
that synonymous words may have on some people. “Thrifty in Texas”
seeks advice to help his wife understand that their friend Sally did not
intend to offend her. Abby’s response indicates that Sally merely choose
the incorrect word and perhaps should have selected a synonym of
“cheap” that indicated the positive aspects of saving money. This project
will consider 1) what a corpus might tell us about the meanings of the
adjectives cheap, frugal, and thrifty; and 2) whether these three synonyms
share similar syntactic characteristics. Complete all steps in the analysis.
References
Azar, B. & Hagen, S. (2009) Understanding and Using English Grammar (4th
ed.). New York: Pearson Longman.
Biber, D. (1995) Dimensions of Register Variation: A Cross-linguistic Perspective.
Cambridge: Cambridge University Press.
Hargraves, O. (2014) It’s Been Said Before: A Guide to the Use and Abuse of
Clichés. New York : Oxford University Press.
Swann, M. (2005) Practical English Usage (3rd ed.). Oxford: Oxford University
Press.
Part 3
This chapter will take you through the steps to complete a corpus proj-
ect. By reference to a specific research question, you will learn how to
build your own corpus and then analyze it using both the register analysis
approach covered in Chapter 2 and the corpus software programs cov-
ered in Chapter 3. You will also learn how to use AntConc in this chapter.
researchers than to try to build one on your own. Building a good corpus
involves a number of steps that are described below.
Before covering the steps in corpus building, we should acknowledge
potential copyright issues. Chances are, your corpus will use the World
Wide Web for the texts you will use for your corpus. In order to do
this, you will need to carefully consider your selection of materials and
the potential copyright infringement issues that relate to compiling and
storing digital texts. Additionally, it is important to take into account
the country in which the corpus materials are used. Different countries
have different copyright rules. What might be considered a copyright
infringement in one country may not be considered so in another coun-
try. If you are using the corpus for educational purposes and do not
plan on selling the corpus or any information that would result from
an analysis of the corpus (e.g., in publications), the likelihood of being
prosecuted as a copyright violator is usually small. Nevertheless, you
should take into account the following guidelines when building your
own corpora:
• Make sure that your corpus is used for private study and research for
a class or in some other educational context.
• Research presentation or papers that result from the research should
not contain large amounts of text from the corpus. Concordance
lines and short language samples (e.g., fewer than 25 words) are pref-
erable over larger stretches of text.
• When compiling a corpus using resources from the World Wide Web,
only use texts that are available to the public at no additional cost.
• Make sure that your corpus is not used for any commercial purposes.
time thinking seriously about what you want to research and the reasons
for conducting the research; i.e., the motivation for the study. Selecting
an appropriate research issue is not a trivial matter. A well-motivated
research project not only contributes to the knowledge of the field but it
will also hold your interest for the duration of the project (and perhaps
even beyond the duration of the project). The corpus you will build will
depend on your research topic. If you are not clear what you want to
investigate, it will be difficult to build a relevant corpus for the purpose
of your research!
When deciding on a topic, you should choose a topic that is not only
interesting to you but also potentially relevant to others. In other words,
if you were to tell someone about your research topic and they were to
say “So what?” you should have a well-reasoned response. To answer
this question, you should choose a topic that you feel will contribute to
the understanding of how and why language forms may vary in certain
contexts. Some possible general topics could be:
in the corpus. A project that considers how news writing changes in dif-
ferent time periods would, obviously, require a corpus that includes
newspaper articles written at different periods of time. In order to build
a corpus to address this issue, you would need to make sure that there
is an adequate number of newspaper articles that have been written at
different time periods. Additionally, a corpus addressing this issue would
need to have sub-corpora of relatively equal size. As illustrated in some of
the corpus projects that compared COCA with the BYU-BNC corpus, the
unequal sizes of these two corpora did not allow for straight frequency
comparisons between the two corpora. Thus, corpus “balance” is a key
aspect of reliable corpus building. Note that the balance should consider
not only the number of texts in each sub-corpus but should also consider
the word count of the sub-corpora.
Frequency comparisons are done on the basis of the number of
words, not by the number of texts. If your specialized corpus also con-
tains a sub-corpus component, then you should take care to make the
sub-corpora of fairly equal sizes. Another issue related to corpus balance
in your corpus relates to text types. A news writing corpus would need to
include the various types of news texts—sports and lifestyle news as well
as state, local, national, and international news. If only one of these types
of texts is included then the sample might not account for variation in
the different types of news texts. A balanced news writing corpus would
either include texts of sports, lifestyle, and general news texts or would
select only one of these text types for analysis.
Table 5.1 below examples of projects and the type of corpora that
would need to be built to address specific research topics. Notice that
some of the corpora consist of sub-corpora that are investigated sepa-
rately in order to determine possible variation. Corpora looking at gen-
der differences in song lyrics, fitness articles, or romance novels include
sub-corpora of texts written by or intended for different genders. The
same sub-corpora approach is also seen in projects that investigate lan-
guage variation in blog posts by Christians and Atheists and news talk
shows from different political perspectives or newspaper articles written
in different countries. Other types of corpora do not have sub-corpora
attached to them. For example, a project that compares the language of
Facebook posts to different registers would compare the linguistic fea-
tures of the “Facebook Corpus” with different registers of use in existing
corpora.
Once you have located relevant texts that can be used to build a bal-
anced specialized corpus to address a specific research issue, you will need
to prepare the text to be read by a software program such as AntConc.
Different types of texts have different types of character encoding associ-
ated with them. If you use texts from the World Wide Web, the texts will
likely be in Hypertext Mark-up Language (HTML). A text that is read
Building Your Own Corpus 81
Gender and song lyrics Song lyrics written by men and women
Facebook as a register? Facebook posts (compared to register in
COCA)
Gender and romance novel authors A corpus of romance novels written by
men and women
Gender differences in fitness articles A corpus of fitness articles written by men
and by women
Variation in English: Newspaper A corpus of English newspaper articles
language in the United States and written in the United States and in
in China China (in English)
Language differences in religious and A corpus of blogs written by Christians
non-religious blogs and Atheists.
Simplification of daily news reports A corpus of newspaper articles written
for the ESL learner for the general public and for second
language learners of English
Comparison of language in Pakistani, A corpus of newspaper articles in
British, and American newspaper American, British, and Pakistani English
English
Linguistic bias in the media A corpus of news talk shows by Rachel
Maddow and Glenn Beck
You are not very good at parking. You’re just not. Unless you happen
to be a middle-aged gentleman from China called Han Yue. If you are
indeed Mr Yue, then (a) welcome to TopGear.com, and (b) congratu-
lations for obliterating the record for the tightest parallel park . . . in
the wooooorld. Again.
(TopGear 2014)
If this file were to be a part of your corpus and you were to save this text
in an HTML format, the HTML code would be a part of the text and
would include information related to web and text design that is not seen
in the text. The text that is not part of the actual text to examined is often
called “meta-text,” and it is typically put in between brackets < . . . >. The
file would look something like this:
for obliterating the record for the tightest parallel park . . . in the
wooooorld. Again.</span></p><p><span style=“background-color:
#888888;”> (source:http://www.topgear.com/uk/car-news/parking-
world-record-video-2014–20–11)</span></p>
As you can see from the HTML example above, in addition to the actual
text information in the file, there is also extra material that is not rel-
evant to the text to be analyzed. A similar type of extra information is
also present in other types of files, such as Microsoft Word documents.
In order to take this superfluous information out of the text, you will
need to convert any text that you collect into a text file (a file with the
extension “.txt”). The “.txt” format removes all of the mark-up language
found in many other file extensions and allows a software program such
as AntConc to find textual patterns instead of other patterns related to
format or font type. There are different ways you can convert a file into
a text file. If you are collecting texts from the World Wide Web, you can
cut and paste each text into a text file and then use the ‘save as’ option in
order to ensure the file is saved in a plain text format. This same method
works if you have a Microsoft Word document. If you are dealing with
many texts (as we hope you are, in order to build an adequate and repre-
sentative corpus), saving each file into a different format can be tedious
and time consuming. Alternatively, there are a number of file conversion
programs available on the World Wide Web free of charge, such as, for
example, AntConverter. The conversion programs will allow you to con-
vert HTML,. doc/docx, or. pdf files into. txt files using a simple conver-
sion program.
A final and extremely important aspect of corpus building involves
naming and placing your files so you are able to identify them. To return
to the fictitious news writing corpus project described above, a corpus
used to analyze this issue may have 600 total texts, with 200 from three
different time periods. For the purposes of analysis, you would want to
have the 200 texts available as sub-corpora so you could load each of the
sub-corpora separately to see if the files in one sub-corpora varied from
those in another. This would mean that you would need a way to know
which file goes with which sub-corpora. One way to go about doing this
is to come up with a simple coding scheme that allows you to clearly iden-
tify each text as a separate text but also as part of a larger group of texts.
As a simple example, a proposed coding scheme would use a six-number
system in which the first two numbers provide information on the time
period of the text and the last four numbers provide the number of the
text in each time period.
This coding scheme would allow you to clearly identify the text with
“010001” being the first text in time period A, “010002” being the sec-
ond text in time period A, and so on. In order to ensure that the text
Building Your Own Corpus 83
numbers relate to the characteristics of each text, each text will also have
the relevant header information described above. Note that this entire
corpus—call it A Historical Corpus of American News Writing—would
consist of three sub-corpora related to each of the three time periods. All
of the files in a single time period would be available in a single folder so
that each sub-corpus could be loaded separately. Depending on different
research questions, the corpus could also be loaded with all three time
periods. Note that if the files followed a consistent labeling practice, you
would be able to determine the time periods by reference to the file
name easily.
An alternative way to name files would be to use transparent file names
with ‘word strings’ instead of numbers. This way, the file names are trans-
parent immediately, and information about the extra-textual features of
the files can be accessed easily. If you choose to do this, you will need
to make sure that the filename length is the same even though you may
not have information in a particular category (for easier processing). For
example, “news_15_election_00” would mean a news text from 2015
about an election in 2000.
Depending on your research question and the design of your corpus,
you may also want to include specific information on the situational char-
acteristics of each text in individual text files in your corpus. Specific
information such as the length of each text (by number of words), the
topic, or the type of each text (if you included different types within a
specific genre) can also be included in individual text files. This type of
information is not a part of the text analysis but it can provide important
interpretive information used in the functional analysis of your results. In
a corpus of general song lyrics, you would want to include lyrics from dif-
ferent types of music (rock, rap, country, popular music, etc.) in order to
achieve balance in your corpus. To be able to identify these different types
of song lyrics, you could either come up with a system of naming each
file (as described above) or you could include some of this information in
each text file. Because you do not want this information counted as part
of the linguistic characteristics of your text, you can put this or any other
relevant information that you do not want to be a part of the linguistic
analysis into angled brackets (< >). This type of information is included
84 Your Own Corpus
in the “headers” of the text but will not be read by the concordance
software. Thus, each individual text file can include relevant header and
other extra-textual information as well the text itself.
Figure 5.1 below is taken from a corpus of argumentative writing by
university students who do not speak English as a first language. There are
six different headers that specify (in order): 1) gender; 2) age; 3) native lan-
guage; 4) degree program; 5) location of university; and 6) topic of essay.
In order to ensure that the header information is not included in the
text analysis, the software program needs to know that anything in
the angled brackets should be ignored in the text analysis. This can be
achieved by indicating that anything in the angled brackets should be con-
sidered extra information, or “tags.” An example of this is provided for
AntConc (see more on this in the next section) as seen in Figure 5.2 below.
The left bracket is identified as the start of the tag information and the
right bracket is used as the end of the tagged information. This essentially
tells the software program to ignore all the information that is included
between the brackets as it does the textual processing and analysis.
In this section, we have outlined the procedure for building a special-
ized corpus to answer a specific research concern. We have outlined some
basic issues in designing your corpus so that it addresses a specific research
issue by constructing a specialized corpus. We have also covered methods
to save the text in a format that is readable by the software program.
Additionally, we have illustrated how separate text files can be uniquely
identified so that different parts of the corpus—the sub-corpora—can be
loaded and compared to each other. Finally, we have considered ways
of potentially adding extra information to each file so that information
on specific characteristics of each text in the corpus can be identified.
<Female>
<21>
<Thai>
<BA>
<LRU>
<Thai women>
How ever you go about constructing your corpus, you should use the fol-
lowing questions to guide the process:
each one of these as they are very useful, we will mainly focus on two
here, AntWordProfiler for lexical analyses and AntConc for lexical as
well as grammatical analyses, as the most pertinent for your use when
analyzing your corpus.
5.4.1 AntWordProfiler
The function of Anthony’s word profiler is very similar to what we saw
with WordandPhrase, except for two main differences: 1) You can use as
many texts as you want at once for an analysis; and 2) instead of using
COCA as the background or monitor corpus, this one uses two other
word lists (General Service List by Michael West, 1953, and Nation’s
academic word list) on which vocabulary frequency bands are based. (See
Figure 5.3.)
Download the Help file from www.laurenceanthony.net/software/
antwordprofiler/releases/AntWordProfiler141/help.pdf. After reading it,
answer the following questions:
Figure 5.3 AntConc using three-word lists for vocabulary frequency comparison
Building Your Own Corpus 87
1. What kind of information can you get to know about your text(s)
through the Vocabulary Profile Tool?
2. What kind of activities can you do through the File Viewer and
Editor tool?
3. What would the different menu options do?
Project 5.1: Let’s say you are interested in finding out about the differ-
ences in the way vocabulary is used in a Wikipedia page and your own
term paper on the same topic. Take one of the papers that you have writ-
ten for another class and save it as a text file. Then search for the same
topical area on Wikipedia, and copy the text, saving it into a text file.
Read both texts into the AntWord Profiler, and run the program twice,
once on each individual file. Note the differences you see between your
text and the Wikipedia text in the following areas:1
a. Number of lines
b. Number of word types
c. Number of word tokens
d. Number of word tokens that fall into each of the three
vocabulary bands
e. Percentage of word coverage from the various vocabulary bands
If you remember, at the WordandPhrase site, each text was marked with
a different color depending on the frequency band it belonged to. Can
we achieve that kind of marking with this tool? If so, how? Would
you modify your text to use less or more frequent words? In what
situation(s) do you think modifying words in a text this way could be
a useful tool? What other kinds of information can you obtain through
this tool?
5.4.2 AntConc
The function of Anthony’s concordance program is very similar to what
we saw at the main interface of COCA. Using your own corpus, you
should be able to do KWIC searches through the concordance lines, and
other types of lexical as much as grammatical analyses in your own texts.
Once again, download the Help file to get an overview of what is possible
with this particular program (Laurence Anthony 2014).
Clearly, this program is capable of facilitating some of the same kind
of analyses COCA did but with your own texts. Among those analyses
are: KWIC (keyword in context), n-grams, collocates in a particular text,
and word lists. In addition, and this is something that we are unable to
do with already existing corpora in general simply because the tool is not
88 Your Own Corpus
available with those corpora, this program is able to show you how the
word (or collocate or lexical bundle or any n-gram) spreads within each
of your texts (a concordance plot).
Read in (i.e., upload) your corpus through the ‘File’ menu (‘Open
files’), and type any search word in the search box that you would like
to find out about in your text(s) and hit the ‘start’ button to get a KWIC
concordance line. (See Figure 5.4.)
If you press the ‘Sort’ button, the words following the keyword will be
in alphabetical order. It is important to keep in mind that the colors in
AntConc do not denote part of speech categories as they do in COCA;
they simply show first and second and third place after the keyword. (See
Figure 5.5.)
If you click on your keyword in the concordance lines, it will show you
the larger textual window (see Figure 5.6).
As mentioned above, if you click on the ‘Concordance plot’ tab, you
will get a view of the spread of your keyword in each of the files you
uploaded as part of your corpus (see Figure 5.7).
Figure 5.4 Searching your own corpus: Concordance lines in AntConc for the word ‘and’
Figure 5.5 Sorting in AntConc
Note
1 Word tokens are each word in a text, while word types are each type of word
in a text. For example, in the following text (two sentences), there are eight
tokens (eight word tokens) and six word types (the and cat are repeated, so
they count as one type each): He saw the cat. The cat was black.
References
Anthony, Laurence (2014) AntConc (Windows, Macintosh OS X, and Linux)
http://www.laurenceanthony.net/software/antconc/releases/AntConc343/
help.pdf
TopGear (2014, November 20) http://www.topgear.com/uk/car-news/parking-
world-record-video-2014-20-11
Chapter 6
Basic Statistics
Example
You read in an educational journal article that lectures in small classes
are more ‘personable’ than in large classes. As there is no linguistic evi-
dence provided in the article for this claim, you want to find out yourself.
You decide that you will use first person pronouns (I, we) as a measure
of ‘personable.’ You are interested in whether the frequency of first per-
son pronouns (I, we—and all their variants) changes at all when you
96 Your Own Corpus
Variable Scales
“Nominal scales” (also called categorical, discrete, discontinuous scales)
are variables measuring categories. They are used in naming and cat-
egorizing data in a variable, usually in the form of identity groups, or
memberships. The variable could occur naturally (e.g., sex, nationality)
or artificially (experimental, control groups) or any other way, but in all
cases, it is a limited number of categories. They represent non-numeric
categories (e.g., religion, L1, ethnicity). When they are assigned to num-
bers, they carry no numeric value. Instead, they are only a category iden-
tifier (e.g., there are two sexes: 1 = male, and 2 = female).
“Ordinal scales” are used to order or rank data. There is no fixed inter-
val, or numeric relationships in the data other than one is “greater than”
or “lesser than” the other. No fixed interval means that we don’t know
whether the difference between 1 and 2 is the same as between 4 and
Basic Statistics 97
We are unable to tell which text has more nouns because it is only
in relation to the verbs that we might have more nouns. That is, ratios
measure how common one thing is but only in relation to a potentially
unrelated other thing.
Observations
Observations are individual objects that you are characterizing. They
provide the unit of analysis that will make up your data. For register
98 Your Own Corpus
studies, an observation is typically each text that you enter into your
database. For other linguistic studies, it could be the individual linguistic
feature you are considering.
Example
Let’s assume you are interested in how complement clauses are used by
younger and older generations and also how they are used by people with
different educational backgrounds. You are using a corpus to look for pat-
terns. Take 100 instances of complement clauses and mark each for who
uses them in terms of age and educational background (hopefully, this
information will be available in the corpus). You can use other contextual
variables as well, but the focus should be on the two variables you iden-
tified. Instead of listing your findings in a table exemplified by Table 6.1
below, you list the individual cases in a table exemplified in Table 6.2 below.
Table 6.2 is preferable because, given the way the information is pre-
sented in Table 6.1, we are unable to distinguish between the two vari-
ables. That is, we cannot measure one independent variable from another.
A research is confounded when the variables measure the same thing;
that is, based on the data in Table 6.1 above, we can’t say whether the
frequency of complements is due to age or level of education.
Undergrad 50 10 2 2
MA 5 20 10 10
PhD 0 3 20 3
2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5
Yes, the mode is 3, because that is the score that occurs most frequently
(six times), versus 2 (twice), 4 (twice), and 5 (once).
There are, however, problems with using mode as a measure of central
tendency, namely:
• If there is not one most frequent score (but more than one—for
instance, two, just like 2 and 4 above occur with the same frequency,
so if those two were the most frequent scores, we could not tell what
the mode is), there is no mode.
• If each ranked score in the dataset only occurs once, i.e., no score
receives the frequency of higher than one (i.e., every score in the data-
set occurs just once), there is no mode.
• The mode is too sensitive to chance scores (when a mistake is made
in entering the scores).
17 18 | 19 20 20 20 21 | 22 23
25th 50th 75th
Boxplots are typically used as visuals to show the range of scores (mini-
mum and maximum), the 25th, the 50th (median), and the 75th percen-
tile. A boxplot is also able to show outliers in the dataset. In the example
below, we display the use of nouns by teachers and students in the corpus.
25th percentile (on box plot: lowest line where the box starts)
50th percentile (median—on box plot: thick black line in box)
75th percentile (on box plot: highest line where the box finishes)
The range, the percentile figures, and the outlier are in Figure 6.1. Box-
plots do not tell you the individual scores.
“Mean” = (x bar) only works for interval scale. Add up all the values
(x) and divide by the # of cases or observations (N) to get the arithmetic
average of all scores.
X =
∑x
N
X (X bar) = sum of x divided by N
Maximum score
75th percenle
50th percenle
25th percenle
Minimum score
While the mean is the best measure of central tendency for interval
scores, it is at times problematic because it is too sensitive to extreme
scores. If extreme scores enter the equation, it throws the mean off so
much that it cannot be the measure of tendency any more.
Example
Let’s say you are interested in finding out whether undergraduate stu-
dents majoring in natural sciences use fewer ‘hedges’ than students in
the humanities or in the social sciences. You look at a corpus of student
presentations that includes nine presentations from each area. The data
below shows the mean scores for each of the presenter in each of the three
areas. It illustrates how much one score in the dataset can change the
value of the mean, which is the measure of central tendency.
In group A, you have the mean scores for social science students.
A: 0 1 2 20 20 20 97 98 99 (357/9)
In group B, the mean scores are listed for natural sciences students.
B: 17 18 19 20 20 20 21 22 23
C: 17 18 19 20 20 20 21 22 99 (256/9)
As you can see, the mode (most frequent scores) and median (the cen-
tral score after rank ordering all scores) are the same for all three groups.
However, the mean becomes vastly different depending on the actual
scores in the dataset. In Group A, the scores vary a great deal. Social sci-
ences students use hedges in an idiosyncratic way; that is, it really depends
on the individual. Some students use none or very few, and some use a
lot! When this is true, the mean is relatively high (especially in compari-
son with the others). In Group B, the scores are not going into extremes.
Instead, they are pretty evenly distributed. That is, the students in this
group more or less use hedges the same way, or at least very similarly. In
Group C, students overall use hedges similarly but there is one student
102 Your Own Corpus
who hedges a lot. That one student changes the mean score dramatically.
The two median scores in group B and C are very different (20 in Group
B and 28.4 in Group C) and that difference is due to only one score. In
the mean score, one outlier makes a big difference.
The characteristic of normal distribution is that the mode, the median,
and the mean are identical. Group B above has that example. And if you
look at the variability of the scores, you can see that it is steadily in order.
There are no outliers or extreme scores in the dataset. The scores are
simply normally distributed.
For example, the range of scores for Group A in the example above is
99, for Group B it is 6, and for Group C it is 82. The problem here is the
same as with the mean scores, as it changes drastically when you have
more extreme scores (as you see in the examples). Since it is unstable,
it is rarely used for statistical reporting, but range could be informative
as a piece of additional information (e.g., to see whether there is an
outlier).
“Quartile” (interquartile) or percentile measures tell us how the scores
are spread in different intervals in the dataset. As outlined above, the
median is a measure of central tendency, and by adding the interquartile
figures (the percentile figures), we are able to see the spread as well. Once
again, the 25th percentile tells us what scores we would get for a quarter
of our data, the 50th percentile tells us what score we would get for half
of the data and the 75th percentile refers to the score we would get for
three-quarters of the data.
“Variance” summarizes the distance (i.e., how far) individual scores
are from the mean. Let’s say our mean is 93.5 (X = 93.5) . If we have
a score of 89 (x = 89), that means our score is 4.5 points away from
the mean, and that is the deviation (value) away from the mean. In this
instance, we just discussed one score only. However, what we want is a
Basic Statistics 103
measure that takes the distribution and deviation of all scores in the data-
set into account. This is the variance.
To compute variance, take the deviation of the individual scores from
the mean, square each deviation, add them up (oftentimes called the ‘sum
of squares’) and average them for the dataset dividing it by the number of
observations minus one. As a formula, it looks like this:
∑( x
2
� individual
� score − � X � group� mean )
variance = .
N −1
Example
Imagine you would like to find out whether one class is more interactive
than another. As Csomay (2002) did, you define interactivity by the num-
ber of turns taken in a class and by how long those turns are (in terms of
number of words). You look at two lecture segments to compare. Each
has 5 turns, and for simplicity, each segment has a total of 150 words in
them. Here are the numbers:
X = 30
Lecture #2: 5 turns, a total of 150 words, average turn length 30
words, turn length varies for each turn.
104 Your Own Corpus
Turn 1: 2 words
Turn 2: 140 words
Turn 3: 2 words
Turn 4: 3 words
Turn 5: 3 words
Total = 5 turns, 150 words
Average turn length: 30
X = 30
In both instances, the average turn length is 30 words, which is the mea-
sure of central tendency. But it is clear that one lecture is very differ-
ent from another in terms of turn length measures. By calculating the
standard deviation for each, we are able to tell the spread in the scores;
that is, whether the scores are close to each other or they vary, and if the
latter, how much they vary (in terms of magnitude measured by a single
number). For Lecture #1, the standard deviation is 0, and for Lecture #2,
it is 61.49. A zero standard deviation says that there is no variation in the
scores at all (clearly), and 61.49, being very high, tells us that there is a
great variation in the scores.
Does this tell us which lecture is more interactive? If we think that rela-
tively shorter turns are making the class more interactive, then Lecture
#1 is more interactive. If we think that longer stretches of turns coupled
with two- or three-word turns is more interactive, then Lecture #2 it is.
Lecture #1 looks to be the best candidate simply because the number of
turns and the turn length measure together tell us that people would have
more opportunity to express actual ideas rather than just agree to what is
happening with one or two words at a time (see Csomay 2012 for short
turn content).
In sum, the larger the standard deviation, the wider the range of dis-
tribution is away from the measure of central tendency. The smaller the
standard deviation, the more similar the scores are, and the more tightly
the values are clustered around the mean.
To calculate the standard deviation, all you need to do is to square root
the variance (explained above).
that is,
For the scores above in Lecture #2, it would look like this:
Why do parametric tests? The reason parametric tests are more powerful
than non-parametric tests is because a) they have predictive power (i.e.,
we can predict that if we followed the same procedures, and did the study
the same way, we will gain the same results) and therefore, b) the results
are generalizable (i.e., we can generalize that the results are true to the
larger population the samples are drawn from—that is, if we repeat the
study, we would get the same results). Therefore, they are very powerful!
(a) What group are you dealing with? What collection of things (not
only people)?
(b) What happens? What’ are the outcomes? What result are you look-
ing for?
(c) What influences the results?
(a) They are phrased in the form of statements (rather than questions).
(b) Their statements show specific outcomes.
(c) They need to be testable.
In looking for differences, our null hypothesis will be stating that there
is no difference between two or more groups (independent variables) with
respect to some measure (dependent variable). (These are typically para-
metric tests.)
Alternative hypotheses:
Type 1 error: The researcher rejects the hypothesis when it should not
have been rejected.
Type 2 error: The researcher accepts the null hypothesis when it
should have been rejected.
The probability value (alpha, or p) basically tells you how certain you can
be that you are not committing a Type 1 error. When the probability level
is very low (p< .001), we can feel confident that we are not committing
a Type 1 error described above, and that our sample group of students
differs from other groups who may have taken the test in the past or who
might take it in the future (population). We test whether the data from
that sample ‘fit’ with that of the population. A p< .05 tells us that there
are fewer than 5 chances in 100 that we are wrong in rejecting the H0.
That is, we can have confidence in rejecting the H0.
Two-Tailed Test/Hypothesis
In two-tailed tests, we specify no direction for the null hypothesis ahead
of time (that is, whether our scores will be higher or lower than more
typical scores). We just say that they will not be different (and then reject
that if significant). (See first example above.)
One-Tailed Test/Hypothesis
We have a good reason to believe that we will find a difference between
the means based on previous findings. The one-tailed tests will specify the
direction of the predicted difference. In a positive directional hypothesis,
we expect the group to perform better than the population. (See second
example above.) In a negative directional hypothesis, the sample group
will perform worse than the population.
One crucial remark: We cannot repeat tests as often as we may want
to. The statistical tests that we introduce in this book are not exploratory
statistics, but they are experimental design, testing hypotheses. One-time
deal only. Steps for hypothesis testing:
be the time the text was produced: whether the text comes from the year
1920 or 2015. These are your independent variables, and depending on
your research question, you will manipulate these to see if there is varia-
tion in the dependent variable.
When we characterize individual speakers’ way of using certain lan-
guage features, the unit of analysis is the text produced by those speakers.
The unit of analysis is still the text (because the language was produced
and transcribed), but it may not be obviously understood in the same
sense as the text above because each text is more associated with indi-
vidual speakers who would have certain characteristics. Yet, it is the text
produced by them, and that will be the basis for comparison.
Finally, when we look at characteristics of individual linguistic fea-
tures (e.g., article type position in subject and object positions), our unit
of analysis is each instance of that feature. Then, we characterize each
observation for its features, which in this case would be syntactic position
and type of article.
We need to enter two variables in this view: one will be the dependent
variable and the other will be the independent one. Remember, all vari-
ables and their characteristics will need to be entered here. The name of
the variables will be ‘hedge’ and ‘discipline’, repectively. So let’s enter
those and see what characteristics each has. (See Figure 6.3.)
We need to focus on some of these headings, but not all. For example,
those that seem less important are ‘Width,’ which determines how wide
the cell is in your data view, and ‘Columns,’ which determines how many
columns there are. ‘Align’ is also less important as it sets how you would
like to see the text aligned in the data view (to the left, the middle, or to
the right), and ‘Role’ is what role you assign this variable in the dataset
(it will all be input for us). We really do not need to worry much about
these tabs. However, we do need to know more about all the others:
‘Type,’ ‘Decimals,’ ‘Label,’ ‘Values,’ ‘Missing,’ and ‘Measure.’ We will go
through each of these one by one:
Type: Numeric (whether it has a numeric value or not—see nominal
independent variables above).
Decimals: You can set the number of decimals you want to see. For
interval scores, we typically use two decimal points and for nominal
scores, we use 0 decimal points (since they have no numeric value, they
do not and will not have any fractions).
Label: SPSS takes very short names for variable names and only in one-
word strings. Labels, then, provide you with the opportunity to give lon-
ger names that could be used as the labels for your output results as well.
Values: These are the values that you can assign to the levels. For
hedges, we will not have any values assigned. But for the nominal vari-
ables, as we mentioned above, we have 1 = Social Sciences, 2 = Natural
Sciences, and 3 = Humanities. As you enter each one, make sure you hit
the ‘Add,’ or else it will not be added to the list. (See Figure 6.4.)
Missing: It is good not to have any missing data because that will affect
the calculations and the results or we need to set the software program in
sophisticated ways not to do so.
Measure: In this area, you will need to determine what kind of variable
you have. In our example, since hedges are interval variables, we will
choose ‘Scale,’ and since discipline is a nominal variable, we will choose
‘Nominal.’ (See Figure 6.5.)
Before we turn to our Data View, let’s add one more variable, so we can
keep track of our observations. The filenames will be portrayed as a string
variable called ‘text_number’ (we really are not including this as a variable
in any calculations; it is more like a reference for us to know which text file
the data is coming from). So it will be string, and it will be a nominal type
of data (all strings are nominal). (See Figure 6.6.)
Now that this is all set, let’s turn to our Data view to see how the data
will need to be entered. First, as we see in Figure 6.7, we have both vari-
ables and each in a column. In SPSS, all variables are in columns, and
each observation is a different row. So in our case, the dependent and
independent variables are in the columns, and each text (each represent-
ing a presentation) will be in a different row.
114 Your Own Corpus
Descriptive Statistics
In order to get information in a stratified manner for your levels, you
want to give the following command. On the top bar with ‘File,’ ‘Edit,’
‘View,’ choose the following set: Analyze → Descriptive Statistics →
Explore to get to the window shown in Figure 6.8.
Following our case study, as you see, your dependent variable (hedges)
needs to be under ‘Dependent List’ and your independent variable
(discipline) needs to be under ‘Factor List.’ This way, your descriptive
statistics will be calculated for each level (i.e., for each of your disciplines)
Figure 6.8 Descriptive statistics through ‘Explore’ in SPSS
Table 6.3 Descriptive statistics for ‘hedges’ in three disciplines through SPSS
Descriptives
versus giving just one mean score of the entire dataset you have. Run the
statistics, and see to what extent the results by SPSS match the descrip-
tive statistics we calculated earlier (they really should!). If you only want
the numbers, click on ‘statistics’ if you want a boxplot (described in this
chapter) and the numbers, click on both. Explore what option you may
have further by clicking on the ‘Options’ button at the upper right-hand
side.
If you only wanted the numbers, it should look like the details in
Table 6.3 listing all the necessary descriptive statistics for each of the
disciplinary areas.
We believe the numbers generated by SPSS match the hand calculations
we made in this chapter. In the next chapter, we will get serious and look
at four different statistical tests that we can apply to our datasets.
Note
1 To calculate the normed counts, we typically take the frequency of the linguis-
tic feature itself, divide it by the total number of words in the given text and
multiply it with a number (typically 1,000) for each observation. That is, let’s
say Text 1 has 45 first person pronouns, and it is 1,553 words long. Then we
will calculate the normed count to 1,000 words (as if the text were that long)
by (45/1,553)*1,000 and we get 28.97. This way, we translated a nominal
score into an interval score. Also, if texts have different lengths, this way we
can actually compare the numbers. If Text 2 has 125 of the feature and the
text is 3,552 words long, then the normed count will be (125/3,552)*1,000,
or exactly 35.19.
References
Csomay, E. (2002) ‘Variation in academic lectures: Interactivity and level of
instruction’, in Reppen, R. et al. (eds) 2002: 203–224
Csomay, E. (2012) ‘A corpus-based look at short turns in university classroom
interaction’, in Csomay, E. (ed) 2012: 103–128
Csomay, E. (ed) (2012) Contemporary Perspectives on Discourse and Corpora.
Special issue of Corpus Linguistics and Linguistic Theory. 8/1
Hatch, E. & A. Lazaraton (1991) The Research Manual. Design and Statistics for
Applied Linguistics. Boston: Heinle & Heinle
Reppen, R., Fitzmaurice, S., & D. Biber (2002) Using Corpora to Explore Lin-
guistic Variation. Amsterdam: Benjamins
Chapter 7
variability within the groups or across the groups. If the ratio of these
two is small—that is, if the “across” group variation is small relative
to the “within” group variation—there is no statistical difference. If,
however, the “across” group variation is large relative to the “within”
group variation, there is a statistically significant difference across the
groups. That is, the larger this ratio between the “within” group vari-
ability measures and the “across” group variability measures (F score),
the more likely that the difference between the means across the groups
is significant.
Assumptions and requirements with ANOVA:
Example
As an example, let us say you are investigating university classrooms
as your context. You analyze your context for all the situational vari-
ables outlined in Chapter 2, and you realize that discipline may a situ-
ational variables in the academic context that may have an effect on
how language is used in the classrooms. In fact, you have read earlier
that the use of pronouns may vary depending on the discipline. Based
on your readings, you also know that first person pronouns are more
apparent in spoken discourse, and have been associated with situations
where the discourse is produced under more involved kind of produc-
tion circumstances (e.g., where the participants share the same physical
space, allowing for the potential of immediate involvement in interac-
tion). Knowing all of this, you isolate this one pronoun type because you
are interested in the use of first person pronouns (I, me, we, us). More
specifically, you would like to find out whether there is a significant dif-
ference in the use of first person pronouns in different disciplines (more
than two).
Statistical Tests 119
1. How does the use of first person pronouns differ across disciplines?
OR
2. Is there a difference in first person pronoun use across disciplines?
The dependent variable is the composite normed score for the first person
pronouns as listed above (use normed counts—an interval score), and
the one independent variable is discipline with three levels (nominal vari-
able). The three levels are the three disciplinary areas: Business, Humani-
ties, and Natural Sciences.
You formulate your hypothesis:
Table 7.1 First person pronoun data in SPSS data view format
Text 1 3 1
Text 2 7 2
Text 3 8 3
Text 4 10 3
Text 5 12 3
Text 6 . . . 3 1
120 Your Own Corpus
Step 1: Calculate the mean score for each group and for the entire dataset.
Step 2: Calculate distances across scores (and square them).
Step 3: Calculate degrees of freedom.
Step 4: Calculate mean sum of squares.
Step 5: Calculate F score.
Step 6: Determine whether the F score is significant.
Step 7: Calculate strength of association.
Step 8: Locate the differences with post-hoc tests.
Step 1: Calculate the mean score for each group and for the entire dataset.
3 7 8
3 8 10
4 9 12
4 7 8
5 8 16
5 9 18
In our calculations, we will mostly use the second type of distance mea-
sure. In looking at how the scores are dispersed, we need to calculate
a) the distance between the individual score and its own group’s mean,
b) the distance between the group mean and the mean for the grand total,
and c) the distance between the individual score and the mean for the
grand total.
We will work with the following terminology: within sum of squares
(SSW) (the sum of squares within each group), between sum of squares
(SSB) (the sum of squares across groups), total sum of squares (SST) (the
sum of squares for the entire dataset), degree of freedom within (DfW)
(degree of freedom within each group) and degree of freedom between
(DfB) (degree of freedom across groups).
a) Within each group: How far is each score from its own group’s
mean?
SSW =� ∑ ( x � − X
� group )
2
SSW = 96
122 Your Own Corpus
b) Between groups: How far is each group from the total mean?
∑ � N� ( X group� − X � total)
2
SS B =�
SSB = 192
To calculate
_ the total sum of squares, take each score (x) minus total
mean (X ), square it and sum it up (see Table 7.4).
∑ ( x � − X� total)
2
SST =�
SST = 288
DfW = N - Ngroup
DfB = Ngroup - 1
124 Your Own Corpus
This means that the variation of scores across the groups is larger than
the variation within the groups. That is, we can reject the null hypothesis,
which states that “there is no statistically significant difference across
the disciplines in the use of first person pronouns.” Hence, we accept the
alternative hypothesis, which states that “there is a statistically significant
difference in the use of first person pronouns across disciplines.” What
we do not know yet, however, is how strong the association is between
the two variables (dependent and independent) and where the differences
lie. Below is Table 7.5, showing how the statistical program SPSS reports
on the Descriptive statistics and on the One-way ANOVA results.
Table 7.6 above has all the numbers we have calculated in the process
by hand previously.
SSB
R2 =
SST
In our example,
R2 = 192/288 = .666
R2 = .666 means that 66% of the variance in the first person pronoun
use can be accounted for by the discipline. That is, if you know the
discipline, you can predict the use of pronouns more than half the time.
Or, by knowing the first person pronoun score, we are able to predict
which discipline it comes from with quite good certainty—more than
half the time.
With identifying the F score’s significance, we can only say that there
is a statistically significant difference in the use of, in our case, first per-
son pronouns. What we cannot say is where the statistically significant
differences are exactly. In order to do so, we can use a range of post-
hoc tests, including Scheffe, Tukey, Bonferroni, Duncan, or LSD. We are
using Scheffe for the current question and dataset to illustrate how this
works. Table 7.7 was, again, created by SPSS (Statistical Package for
Social Sciences).
As Table 7.7 shows, the mean difference across each of the three dis-
ciplines is statistically significant. To determine where the significant dif-
ferences actually lie and which direction they go, we need to look at each
pair from this table to say the following:
(I) Discipline (J) Discipline Mean Std. Error Sig. 95% Confidence
Difference (I-J) Interval
*The mean difference is significant at the .05 level (as we also see in the Sig. column, all values are
below p<.05)
Statistical Tests 127
The rest of the information in the table is a repetition of this but with
reversed direction. If we look at our original mean scores, it is true that
the Business mean was 4, the Humanities mean was 8, and the Natural
Sciences mean was 12. And now we know that these differences are, in
fact, statistically significant.
Interpretation
Based on previous readings we know that first person pronouns are
typically associated with a communicative context where language is
produced in a shared physical space, and under involved production
circumstances allowing for the potential of interaction. We also know
through the situational analysis that disciplines may differ in the way the
material is presented, and so we want to know to what extent first person
pronouns would be an indicator of such difference. The statistical results
in our mini-study showed that there is a statistically significant differ-
ence across disciplines, and that significantly more first person pronouns
are used in Natural Sciences than in any of the other two disciplines. In
addition, it also shows that when compared to Humanities, Business also
uses significantly fewer first person pronouns. These results indicate that
in Natural Sciences classrooms language features seem to be similar to
those in spoken discourse (rather than written), which then is associated
with a discourse produced under involved production circumstances sug-
gesting interaction. The fact that Business showed the least number of
personal pronouns may be attributed to less interaction in the classroom,
and more teacher talk perhaps.
128 Your Own Corpus
Example
As you have attended classes at the university, you noticed that teachers
talk differently in classes not only from different disciplines (as we have
seen the example before), but also in classes with different educational
levels. Your primary investigation is discipline but it seems that level of
instruction may also be a variable that could intervene in the variability
of the data, and you are hoping that it does not affect your previous
findings. You do not know whether the language change is attributed to
only one of the variables or the two together. In your situational analysis
then, you take discipline as your main variable, and level of instruction
as another, intervening variable. As for the teacher ‘talking differently’,
you continue to believe that, based on your previous readings, first person
pronoun use is what makes the difference.
Statistical Tests 129
1. How does the use of first person pronouns differ across disciplines
and levels of instruction? OR
2. Is there a difference in first person pronoun use across disciplines or
across levels of instruction?
Value Label N
Interpretation
We set out to investigate whether discipline accounts for the variability in
our data. At the same time, we also realized that in the academic context,
an intervening variable, i.e., level of instruction, may have an effect on
this variability as well. We tested this by running a Two-Way ANOVA to
see what effect level of instruction may have. Based on the results above,
we can conclude that discipline alone accounts for the variability in the
data, and level has no effect on this variability. This means, that there is
a consistent change in the use of first person pronouns across disciplines,
irrespective of whether the class is an undergraduate lower or upper divi-
sion class, or a graduate class. Post hoc tests could identify where exactly
the differences are. It turns out that first person pronouns are used most
frequently in Education classes, indicating a more interactive class than
in the other two disciplines.
A word of warning: It must be noted that if the interaction effect is sig-
nificant (that is, if p <.05 or lower in the line with the two variables jux-
taposed Variable 1 * Variable 2), we cannot isolate any of the variables
as significant even if they show a significant result on their own. Consider
the following results (see Table 7.10) in terms of significance level and
instead of what we had before (we are just doing this for the sake of the
exercise—the numbers are not true results):
In this dataset, each of the independent variables is significantly mark-
ing the variation in the first person pronoun use. At the same time, the
interaction measure (Discipline * Level) is also significant with a p< .05.
This means that neither discipline nor level of instruction alone is respon-
sible for the variability in the use of first person pronouns. Instead, the
two variables together cause the change in the dataset. In other words,
we cannot say that, for example, Natural Sciences consistently use more
first person pronouns than Humanities, because their use of pronouns
Statistical Tests 131
Source F Sig.
7.2.1 Chi-square
With Chi-square tests, both the dependent and the independent variables
can be nominal data. The results of non-parametric tests, like Chi-square,
cannot be generalized to the population the sample was drawn from but
we can ask questions related to the given dataset. Namely,
With Chi-square, we also have One-way and Two-way designs. The dif-
ference between the two is simply how many levels each variable has.
A One-way design (see Table 7.11 below) would have a dependent nomi-
nal variable with no levels, and an independent variable with potentially
two or more levels. For example, if you wanted to compare the raw fre-
quency scores of a particular type of relative clause across registers, you
would use a one-way design, as shown in Table 7.11.
All in all, a one-way Chi-square design would have one variable with
no levels (e.g., relative clauses) and one variable with two or more levels
(registers). On the other hand, a two-way design would have a dependent
nominal variable with two or more levels and an independent nominal
variable with two or more levels. Table 7.12 shows a two-way Chi-square
design: one variable with two or more levels (e.g., articles: a, an, the, zero
article) and one variable with two or more levels (position: subject or
object).
Wh-relative 15 5 10 25
Statistical Tests 133
Example
Imagine that you would like to find out about the relationship between
article type (a, an, the, zero article) and position (subject or object). Here
are the steps you need to take:
• For 1x2 or 2x2 table, expected frequency values in each cell must be
at least 5.
• For a 2x3 table, expected frequencies should be at least 2.
• For a 2x4 or 3x3 table, if all frequencies but one are at least 5 and if
the one small cell is at least 1, chi-square is still a good approximation.
If you are worried about the frequencies in the cells, you could collapse
categories that make sense. In the example above, the two types of indefi-
nite articles (a/an) can be collapsed since it’s use is dependent on the word
following it and will not affect the syntactic position. Table 7.13 shows
the crosstab of frequencies.
If the article distribution were the same in subject and object positions,
we would get an equal number of them across article types. So the ques-
tions are: How far is this off? Can we say that they are really off, and
whether there is a difference, or not? That is the real question. In other
the a an 0
subject 54 27 8 19
object 92 49 3 44
134 Your Own Corpus
words, if there were no relationship between the article type and the posi-
tion, we would get an even distribution of the frequencies. Considering
this, the questions are: Is there a relationship between the article type
and clause position? We calculate what we would expect if there were no
relationship, and compare that with the existing dataset.
First, we calculate the row and column totals.
Second, we calculate the expected value for each cell by taking the row
total and the column total, multiply the two, and divide it by the grand
total. Below is the formula.
Row total × Column total
fexpected =
N
For example, to calculate the expected value for the first cell (‘the’ in
subject position) take 108 (Row total) times 146 (column total) divided
by 296 (N) = 53.27. Do the same for each cell. (Figure 7.15)
Finally, we can calculate the Chi-square ( χ 2): Deduct the expected
value from the observed value, square it, and divide it by the expected
value, and add it all up (Figure 7.16).
Subject 54 35 19 108
Object 92 52 44 188
The a/an 0
(f )
2
− � f expected
( ) ∑
2
� χ � �=
Chi � square
observed �
f expected �
χ 2 = 1.64
Degree of freedom:
Interpretation
What does this mean? It means that we cannot reject the null hypothesis
stating that there is no relationship between the article type and position.
In other words, any of the articles could pretty much randomly occur in
any position, as there is no relationship between the position and the type
of article used. While this example may not have direct relevance to reg-
ister studies, we could follow up with a register study. Instead of looking
at the potential relationship between article type and syntactic position,
the focus of the investigation would be to see whether one type of article,
when in a certain position, occurs more often in one register over another,
and versus in another position.
7.2.2 Correlation
Among the three different types of correlations (Pearson, Spearman Rank
Order, and Point-biserial), Pearson correlation is the most frequently used
statistical procedure in corpus studies. With Pearson, we need interval
data for both the dependent and independent variables.
136 Your Own Corpus
Example
You noticed that I mean and ok often come as a package in spoken dis-
course. You also noticed that both teachers and students use it, but you
don’t know whether it’s the same when they are presenting in front of
an audience. Let’s assume you would like to find out whether there is a
relationship in the use of “I mean” and “ok.” You have a small corpus of
presentations comprising of two sub-corpora: teacher presentations and
student presentations. Let’s say, the mean score for I mean used for teach-
ers is 39.1, and for students, it is 42.5.
There is too much overlap between the two types of presentations in
terms of “I mean” use. That is, if we did a difference type of test (e.g.,
like ANOVA), there would not be a significant difference between teacher
and student presentations in terms of the use of “I mean.” In other words,
we would not be able to predict whose presentation it is by knowing the
“I mean” count.
In contrast, the mean score for ok use for teachers is 150, and for
students, it is 328. There is no overlap between the two types of presenta-
tions in terms of the use of “ok.” That is, in a difference type of test (e.g.,
One-way ANOVA), this would show a significant difference between
teacher and student presentations in terms of “ok” use. That is, we could
predict who gives the presentation by knowing the “ok” count.
40
Counts: "I mean" per 1,000 words
35
30 20, 30
10, 27 17, 27
25 15, 25
12, 22
20 10, 20
7, 18
15 5, 15
7, 12
10
5
2, 3
0
0 5 10 15 20 25
Counts: "ok" per 1,000 words
ok I_mean
I mean
78 % overlap
ok
Figure 7.2 Overlap between ‘I mean’ and ‘ok’
Statistical Tests 139
the amount of variance in” one variable “which is accounted for by the
other variable” or the other way round (Hatch and Lazaraton 1991:
441). To translate this to our study here, it sounds like this: The amount
of variance in the use of “I mean” is accounted for by the use of “ok.”
What is a strong overlap and what is a weak one is hard to tell without
knowing the question. If you wish to show that one text is very similar
to another, the higher the overlap the better. In general though, in social
science research if the overlap is over 25%, it is considered very high. But
since it is genuine continuous data, there is no need for a cut-off point.
The degree will depend on how the disciplines regard this as strong or
not, and that is why no significance level is necessary.
Some useful hints:
In this section, we only looked at how two linguistic variables may relate
to one another (or what relationship they may have) but we can look at
more than two at once. It is almost like going to a party where you try to
figure out who is hanging out with whom and what characteristics they
have. In any case, if you look at correlations of more linguistic variables
at once, you can start characterizing texts for their comprehensive lin-
guistic make-up. We will briefly discuss this and point you in that direc-
tion in the last chapter of this book.
Interpretation
The interpretation here is simple: when one language feature occurs, the
other one does as well. That is, there is a positive relationship between
the two variables and so we can predict that if there is a high number of
one, there will be a high number of the other as well. This kind of study
becomes more interesting is when we look at more than just two linguis-
tic features co-occurring with one another. When we are able to detect
how a number of features, when thrown in the same pot and having an
effect on one another, will behave. This will be discussed in Chapter 9
further as we are looking ahead.
140 Your Own Corpus
One-Way ANOVA
To run a One-Way ANOVA test in SPSS, from the tabs select Analyze →
Compare Means → One-Way ANOVA. Again, your ‘Dependent List’ will
contain the dependent variable. Although you could only have one depen-
dent variable to test a One-Way ANOVA, if you want to run the test on
more than one dependent variable at the same time (e.g., you want to see
variation in hedges and also in noun use), instead of opening the window
for each individually, you can list all of them under the dependent list. The
program will take them one by one, and run the test on each separately.
Your independent variable with multiple levels will go into the “Factor”
window.
Two-Way ANOVA
To run a Two-Way ANOVA test in SPSS, from the tabs select Analyze →
General Linear Model → Univariate. Again, you will put your dependent
variable in the ‘Dependent Variable’ field, and the independent variable
will come under ‘Fixed Factor(s)’ and your intervening variable (your sec-
ond independent variable) will come under ‘Random factor(s).’ You can
also determine what ‘Posthoc’ test you may want to use in case only one
variable significantly accounts for the variability of the data.
Chi-Square
To obtain Chi-square results, go to Analyze → Non-parametric tests →
Legacy Dialogs → Chi-Square. Again, your variables will go into the
appropriate boxes and you can hit “run”.
Pearson Correlation
To obtain Pearson correlation results, go to Analyze → Correlate →
Bivariate Correlations, click on ‘Pearson.’ Again, your variables will go
into the appropriate boxes, and you need to decide whether you want a
one-tailed or a two-tailed analysis (see Chapter 6). Again, hit “run”.
Statistical Tests 141
Notes
1 Statistical Package for the Social Sciences.
2 We must be careful as to how we select them in the corpus, though—and the
best way to do so is to get a set of randomly selected relative clause sentences
and continue our classification based on that.
Reference
Hatch, E. & A. Lazaraton (1991) The Research Manual. Design and Statistics for
Applied Linguistics. Boston: Heinle & Heinle.
Chapter 8
Table 8.1 A corpus of problem solution essays written by Thai speakers of English
Individual Collaborative
# of texts 102 51
# of words 14,124 9,284
# of words/text 138.47 182.03
conditions. The research question for this study is: Do collaborative and
individual texts differ in terms of their lexico-grammatical features? The
texts from this corpus were collected under two different conditions:
First, the students were placed into pairs and asked to write a problem
solution paragraph collaboratively. Later in the semester, each student
had to write a problem solution paragraph as part of an in-class examina-
tion. The essays were then typed and saved as text files with headers and
codes to show the authors and topics of the essays. The corpus was also
divided into two sub-corpora: one consisting of collaboratively written
essays and another consisting of individual essays. A description of the
corpus is shown in Table 8.1.
This table provides the number of individual and collaborative essays,
the total number of words in each sub-corpus, and the average number
of words for the individual and collaborative texts. We will return to this
table when we discuss the analysis of linguistic features in section 8.3.
However, before looking at how to search for linguistic features, we will
first describe the method for completing a situational analysis.
Sub-corpus A Sub-corpus B
Figure 8.1 Word list in a corpus of problem solution essays written by Thai speakers of
English
list function above) to longer sequences. Figure 8.2 shows the most fre-
quent four-word n-grams in the individual problem solution texts. These
n-grams are found by using the “Clusters/N-gram” tab and then specify-
ing the n-gram size (4) as well as the minimum frequency (5), using the
tools found at the bottom of the screen (as shown in Figure 8.2). Note
that you can change the size of the n-grams to look at bigrams, trigrams,
and even longer sequences of words if you choose to do so.
The second general method to use is the “corpus-based method.” If
you choose to take a “corpus-based” approach, you already have an idea
of the linguistic features that will be the focus of your research. This
approach is used when previous corpus information determines the
linguistic features used in the analysis. For example, previous corpus
Doing Corpus Linguistics 147
Figure 8.2 Four-grams in a corpus of problem solution essays written by Thai speakers
of English
research has illustrated that face-to-face conversation has many first and
second person pronouns. The high number of these pronouns can be
understood by reference to the interactional nature of face-to-face con-
versation where speakers often make reference to themselves as well as
to the other participants of the conversation. This corpus-informed fact
can serve as a reason to look for the frequency of first and second person
pronouns in corpora that do not involve face-to-face conversation but
that also include examples of spoken language in other contexts (such
as academic lectures, where the purpose of communication is primarily
informational). If one were to find a high number of first and second per-
son pronouns in academic lectures, this may be indicative of the interac-
tional nature or style of academic lectures. In this sense, corpus-informed
searches can be attributed to the fact that certain linguistic features have
148 Your Own Corpus
feature) occurs per x number of words. In this case, we will use 1000 as
the reference number. The results would show that individual “the” occurs
51.40 times per 1000 words (726/14,124 x 1000 = 51.40) and collabora-
tive “the” occurs 45.77 times per 1000 words (425/9,284 x 100 = 45.77).
When considering the frequency comparison of the word “the” in the two
sub-corpora, we can then compare 51.40 (individual) with 45.77 (collab-
orative) and note that the frequency differences between these two corpora
are not so great. If we look at “to,” we would calculate the normed count
in individual texts at 37.09 (524/14,124 x 1000 = 37.09) and collaborative
“to” would be 46.53 (432/9,284 x 1000 = 46.53). In this case, the normed
count in collaborative texts is actually higher than in the individual texts
even though the raw count is lower.
In a linguistic analysis, it is also worthy to note not only the potential
frequency differences in shared words across the two types of texts but also
the use of words that are different in the texts. In this small sample, we not
only see differences in nouns (“English” is the third most frequent word
in the individual corpus and “students” is the fourth most frequent word
in the collaborative corpus) but also differences in function words (“in” is
the fourth most frequent word in the individual texts and “of” is the fifth
most frequent words in the collaborative texts). Some differences might be
related to topic (as with the nouns) but other differences might be related
to the production circumstances or relations among participants. Only a
closer examination of these features in the corpus can provide us with evi-
dence to support the analysis. Furthermore, Table 8.4 only shows the five
most common words; looking at longer word lists would likely show other
differences in the corpus that are worthy of analysis and interpretation.
In addition to word lists, the n-gram function can show us potential
variation in corpora. As discussed Chapter 3, n-grams are contiguous
sequences of words that can vary in length depending on the interest of
the researcher. In Table 8.5, we provide the five most frequent four-grams
in the individual and collaborative texts. We have included both the raw
frequency counts as well as the counts that are normed to 1000 words in
parentheses.
Table 8.4 The five most frequent words in a corpus of problem solution essays written
by Thai speakers of English (both raw and normed counts to 1000 words)
Table 8.5 The five most frequent four-grams in a corpus of problem solution essays
written by Thai speakers of English (both raw and normed counts to 1000 words)
There are many different types of searches that you can do with your cor-
pus. As mentioned in Chapter 5, you should feel encouraged to explore
the AntConc program (as well as related literature such as the “read me”
files in the AntConc website) to learn about the program. New tools
are frequently available so you should visit the site from time to time to
learn about the new functions and to download the latest version of the
program.
For example, an important consideration in understanding the use
of any feature (including both word lists and n-grams) relates to the
distribution (or dispersion) of a feature in the corpus. A given feature
may be frequent, but it is important to make sure that the feature is
not used in a few texts at a very high frequency. Although not in a sta-
tistical sense as we described dispersion in Chapter 7, you can check
the visual of the distribution in AntConc by using the ‘Concordance
Plot’ option (see Chapter 5 for details). This function will show you
how many different files the given feature occurs in as well as how
many times the feature occurs in a single file. This is an easy method
to provide a visual representation of the distributional patterns of the
feature that you are looking at. Distributional patterns like these can
be very helpful in interpreting your results. Seeing the distributional
Doing Corpus Linguistics 151
patterns can also help in examining whether your findings for a given
feature are, in fact, spread in your corpus or are found in a limited
number of texts only. If the latter, you may need to be aware that that
language feature is probably used in an idiosyncratic way; that is, it
is used only by one or two participants or in only a few of the texts
(depending on what your unit of analysis is).
References
Biber, D. (1988) Variation across Speech and Writing, Cambridge: Cambridge
University Press
Biber, D. (1995) Dimensions of Register Variation: A Cross-linguistic Compari-
son, Cambridge: Cambridge University Press
Biber, D. & S. Conrad (2009) Register, Genre and Style, Cambridge: Cambridge
University Press
Biber, D., S. Johansson, G. Leech, Conrad, S. & E. Finegan (1999) Longman
Grammar of Spoken and Written English, New York: Longmans
Chapter 9
A Way Forward
Beyond simply illustrating how searches can be done with a corpus, the
purpose of this book is to show how a complete corpus-based project can
be carried out, including some of the technical aspects and some basic sta-
tistical analyses. As we have discussed many times throughout the book,
one can follow a) a corpus-driven or b) a corpus-based approach to the
linguistic analysis. In this book, we made an attempt to illustrate both.
We showed how we can do a corpus-based study with already-identified
language features, whether doing a lexical study or searching for the use
of particular grammatical patterns. We have also illustrated the notion
of a corpus-driven study, as we extracted lexical items (n-grams) from a
small corpus. While it is relatively easy to carry out lexical studies with
corpus-driven approaches (whether you rely on existing corpora or ana-
lyze your own corpus), as available tools allow you to extract lexical pat-
terns from the texts, it is quite difficult to apply corpus-driven approaches
to do a full lexico-grammatical analysis of texts. The main reason for this
difficulty related to the fact that texts need to be grammatically tagged so
that grammatical categories can be extracted from corpora in the same
way that specific lexical items are. Tagged corpora cannot only include
specific types of grammatical items (such as nouns, verbs and adjectives)
but also sub-categories in these different word types (such as concrete
or abstract nouns, private and suasive verbs, attributive and predicative
adjectives). Some tagging software is available, but there is an increasing
need for corpus researchers to gain computational and statistical skills to
carry out more in-depth analyses. If you don’t have such skills (yet), per-
haps the best solution is to continue doing corpus-based studies. You can
continue to look for lexico-grammatical patterns that you find interest-
ing, or you can carry out corpus-based studies that rely on the results of
previous, corpus-driven studies. For the latter, you would use the findings
and apply them to new datasets. Without access to tagged corpora and
advanced programming and statistical knowledge, your corpus-driven
research will be limited to focusing on word lists and n-grams.
A Way Forward 157
References
Atkinson, D. (2001) ‘Scientific discourse across history: A combined multidimen-
sional / rhetorical analysis of The Philosophical Transactions of the Royal
Society of London’, in Conrad, S. & D. Biber (eds) 2001: 45–65
Biber, D. (1988) Variation across Speech and Writing, Cambridge: Cambridge
University Press
Biber, D. (1995) Dimensions of Register Variation: A cross-linguistic comparison,
Cambridge: Cambridge University Press
Biber, D. & J. Burges (2001) ‘Historical shifts in the language of women and
men’, in Conrad, S. and D. Biber (eds) 2001: 21–37
Biber, D. & E. Finnegan (2001) ‘Intra-textual variation within medical research
articles’, in Conrad, S. & D. Biber (eds) 2001: 108–123
Biber, D. & S. Conrad (2009) Register, Genre and Style, Cambridge: Cambridge
University Press
Biber, D., S. Conrad, R. Reppen, P. Byrd & M. Helt (2002) ‘Speaking and writing
in the university: A multidimensional comparison’, TESOL Quarterly 36: 9–48
Biber, D., S. Conrad & V. Cortes (2004) ‘If you look at . . . : Lexical Bundles in
university teaching and textbooks’, Applied Linguistics 25/3: 371–405
A Way Forward 159