LONGMAN GROUP LIMITED
London
Associated companies, branches and
representatives throughout the world
© Longman Group Ltd. 1979
All rights reserved. No part of this
publication may be reproduced,stored in
a retrieval system, or transmitted in any
form or by any means, electronic,
mechanical, photocopying, recording, or
otherwise, without the prior permission of
the Copyright owner.
Oller, John W
Language tests at school. - (Applied
linguistics and language study.)
1. Language and languages - Ability
testing
I. Title II. Series
407'.6 P53 78-41005
Contents
ISBN 0-582-55365-2
ISBN 0-582-55294-X Pbk
First published 1979
Chapter 1
Introduction
A. What is a language test?
B. What is language testing research
about?
C. OrganIzation of this book
Key Points
Discussion Questions
Suggested Readings
page
1
3
6
11
12
13
PART ONE
THEORY AND RESEARCH BASES FOR
PRAGMATIC LANGUAGE TESTING
Printed in Great Britain by
Butler & Tanner Ltd, Frome and London.
ACKNOWLEDGEMENTS
We are grateful to the following for permission to reproduce copyright material:
Longman Group Ltd., for an extract from Bridge Series: Oliver Twist edited by Latif
Doss; the author, John Pickford for his review of 'The Scandaroon' by Henry
Williamson from Bookcase Broadcast on the BBC World Service January 6th 1973,
read by John Pickford; Science Research Associates Inc., for extracts from 'A Pig Can
Jig' by Donald Rasmussen and Lynn Goldberg in The SRA Reading Program - Level A
Basic Reading Series © 1964, 1970 Donald E. Rasmussen and Lenina Goldberg.
Reprinted by permission of the publisher Science Research Associates Inc. Board of
Education of the City of New York, from 'New York City Language Assessment
Battery' ; reproduced from the Bilingual Syntax Measure by permission. Copyright ©
1975 by Harcourt Brace Jovanovich, Inc. All rights reserved; Center for Bilingual
Education, Northwest Regional Educational Laboratory from 'Oral Language Tests
for Bilingual Students'; McGraw-Hili Book Company, from Testing English as a
Second Language' by Harris; McGraw-Hili Book Company from 'Language Testing'
by Lado; Language Learning (North University Building) from 'Problems in Foreign
Language Testing'; Newbury House Publishers from 'Oral Interview' by Ilyin; 'James
Language Dominance Test', copyright 1974 by Peter James, published by Teaching
Resources Corporation, Boston, Massachusetts, U.S.A.; 'Black American Cultural
Attitude Scale', copyright 1973 by Perry Alan Zirkel, Ph.D, published by Teaching
Resources Corporation, Boston, Massachusetts, U.S.A.
Chapter 2
Language Skill as a Pragmatic
Expectancy Grammar
A. What is pragmatics about?
B. The factive aspect oflanguage use
C. The emotive aspect
D. Language learning as grammar
construction and modification
E. Tests that invoke the learner's
grammar
Key Points
Discussion Questions
Suggested Readings
Chapter 3
Discrete Point, Integrative, or
Pragmatic Tests
A. Discrete point versus integrative
testing
v
16
16
19
26
28
32
33
35
35
36
36
VI
CONTENTS
CONTENTS
B. A definition of pragmatic tests
C. Dictation and cloze procedure as
examples of pragmatic tests
D. Other examples of pragmatic tests
E. Research on the validity of
pragmatic tests
1. The meaning of correlation
2. Correlations between different
language tests
3. Error analysis as an
independent source of validity
data
Key Points
Discussion Questions
Suggested Readings
Chapter 4
Multilingual Assessment
A. Need
B. Multilingualism versus
multidialectalism
C. Factive and emotive aspects of
multilingualism
D. On test biases
E. Translating tests or items
F. Dominance and proficiency
G. Tentative suggestions
Key Points
Discussion Questions
Suggested Readings
Chapter 5
Measuring Attitudes and Motivations
A. The need for validating affective
measures
B. Hypothesized relationships
between affective variables and the
use and learning oflanguage
C. Direct and indirect measures of
affect
D. Observed relationships to
achievement and remaining puzzles
38
39
44
Key Points
Discussion Questions
Suggested Readings
143
145
147
PART TWO
50
52
57
64
70
71
73
74
74
77
80
84
88
93
98
100
102
104
105
105
112
121
138
THEORIES AND METHODS
OF DISCRETE POINT TESTING
Chapter 6
Syntactic Linguistics as a Source for
Discrete Point Methods
A. From theory to practice,
exclusively?
B. Meaning-less structural analysis
C. Pattern drills without meaning
D. From discrete point teaching to
discrete point testing
E. Contrastive linguistics
F. Discrete elements of discrete
aspects of discrete components of
discrete skills - a problem of
numbers
Key Points
Discussion Questions
Suggested Readings
150
150
152
157
165
169
172
177
178
180
Chapter 7
Statistical Traps
A. Sampling theory and test
construction
B. Two common misinterpretations
of correlations
C. Statistical procedures as the final
criterion for item selection
D. Referencing tests against nonnative performance
Key Points
Discussion Questions
Suggested Readings
199
204
206
208
Chapter 8
Discrete Point Tests
A. What they attempt to do
209
209
181
181
187
196
vii
viii
CONTENTS
CONTENTS
B. Theoretical problems in isolating
pieces of a system
C. Examples of discrete point items
D. A proposed reconciliation with
pragmatic testing theory
Key Points
Discussion Questions
Suggested Readings
Chapter 9
Multiple Choice Tests
A. Is there any other way to ask a
question?
B. Discrete point and integrative
multiple choice tests
C. About writing items
D. Item analysis and its
interpretation
E. Minimal recommended steps for
multiple choice test preparation
F. On the instructional value of
mUltiple choice tests
Key Points
Discussion Questions
Suggested Readings
212
217
227
229
230
230
231
231
233
237
245
255
256
257
258
259
PART THREE
PRACTICAL RECOMMENDATIONS
FOR LANGUAGE TESTING
Chapter 10
Dictation and Closely Related
Auditory Tasks
A. Which dictation and other
auditory tasks are pragmatic?
B. What makes dictation work?
C. How can dictation be done?
Key Points
Discussion Questions
Suggested Readings
262
263
265
267
298
300
302
Chapter 11
Tests of Productive Oral
Communication
A. Prerequisites for pragmatic
speaking tests .
B. The Bilingual Syntax Measure
C. The Ilyin Oral Interview and the
Upshur Oral Communication Test
D. The Foreign Service Institute Oral
Interview
E. Other pragmatic speaking tasks
Key Points
. Discussion Questions
Suggested Readings
Chapter 12
Varieties Qf Ooze Procedure
A. What is the cloze procedure?
B. Cloze tests as pragmatic tasks
C. Applications of cloze procedure
D. How to make and use cloze tests
Key Points
Discussion Questions
Suggested Readings
Chapter 13
Essays and Related Writing Tasks
A. Why essays?
B. Examples of pragmatic writing
tasks
J'
C. Scoring for conformity to correct
prose
D. Rating content and organization
E. Interpreting protocols
Key ,points
Discussion Questions
Suggested Readings
Chapter 14
Inventing New Tests in Relation to a
Coherent Curriculum
A. Why language skills in a school
curriculum?
303
304
308
314
320
326
335
336
338
340
341
345
348
363
375
377
379
381
381
383
385
392
394
398
399
400
401
401
ix
X
I
CONTENTS
B. The ultimate problem of test
validity
C. A model: the Mount Gravatt
reading program
D. Guidelines and checks for new
testing procedures
Key Points
Discussion Questions
Suggested Readings
CONTENTS
I
403
408
(I
Figure 9
-\
Figure 10
I
415
418
419
421
Appendix
I
Figure 11
Figure 12
"
I
Figure 13
Figure 14
Figure 15
The Factorial Structure of Language
Proficiency: Divisible or Not?
A. Three empirically testable
alternatives
B. The empirical method
C. Data from second language
studies
D. The Carbondale Project, 1976-7
E. Data from first language studies
F. Directions for further empirical
research
423
424
426
Figure 17
428
431
451
Figure 19.
Figure 18
-
LIST OF FIGURES
page
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
A cartoon drawing illustrating the style of
the Bilingual Syntax Measure
A hypothetical view of the amount of
variance in learning to be accounted for by
emotive versus factive sorts of information
A dominance scale in relation to proficiency
scales
Example of a Likert-type attitude scale
intended for children
Componential breakdown of language
proficiency proposed by Harris (1969, p. II)
Componential analysis oflanguage skills as
a framework for test construction from
Cooper (1968, 1972, p. 337)
'Langnage assessment domains' as defined
tiy Silverman et al (1976, p. 21)
Schematic representation of constructs
posited by a componential analysis of
48
Table 3
Table 4
83
Table 5
99
137
173
222
222
222
222
309
309
311
315
318
342
Intercorrelations of the Part Scores on the
Test of English as a Foreign Language
Averaged over Forms Administered through
April,1967
Intercorrelations of the Part Scores on the
Test of ]£nglish as a Foreign Language
Averaged over Administrations from
October, 1966 through June, 1971
Response Frequency Distribution Example
One
Response Frequency Distribution Example
Two
Intercorrelations between Two Dictations,
Spelling Scores on the Same Dictations, and
Four Other Parts of the UCLA English as a
Second Language Placement Examination
Form 2D Administered to 145 Foreign
Students in the Spring of 1971
190
. 191
253
254
281
TABLES IN THE APPENDIX
173
174
221
page
Table 1
Table 2
Figure 1
175
LIST OF TABLES
456
459
479
References
Index
Figure 16
language skills based on discrete point test
theory, from Oller (1976c, p. 150)
The ship/sheep contrast by Lado (1961,
p. 57) and Harris (1969, p. 34)
The watching/washing contrast, Lado (1961,
p.57)
The pin/pen/pan contrast, Lado (1961, p. 58)
The ship/jeep/sheep contrast, Lado (1961,
p.58)
'Who is watching the dishes?' (Lado, 1961,
p.59)
Pictures from the James Language
Dominance Test
Pictures from the New York City
Language Assessment Battery, Listening and
Speaking Sub test
Pictures 5, 6 and 7 from the Bilingual Syntax
Measure
Sample p,ictures for the Orientation section
of the [lyin Oral Interview (1976)
Items from the Upshur Oral Communication
Test
Some examples of visual closure - seeing the
overall pattern or Gestalt
page
Table 1
J>rincipal Components Analysis over
Twenty-two Scores on Language Processing
Tasks Requiring Listening, Speaking,
xi
xii.
CONTENTS
Reading, and Writing as well as Specific
Grammatical Decisions
Table 2
Varimax Rotated Solution for Twenty-two
Language Scores
Table 3
Principal Components Analysis over Sixteen
Listening Scores
Table 4
Varimax Rotated Solution for Sixteen
Listening Scores
Table 5
Principal Components Analysis over
Twenty-seven Speaking Scores
Table 6
Varimax Rotated Solution over Twentyseven Speaking Scores
Table 7
Principal Components Solution over Twenty
Reading Scores
Table 8
Varimax Rotated Solution over Twenty
Reading Scores
Table 9
Principal Components Analysis over
Eighteen Writing Scores
Table 10 Varimax Rotated Solution over Eighteen
Writing Scores
Table 11 Principal Components Analysis over
Twenty-three Gra=ar Scores
Table 12 Varimax Rotated Solution over Twentythree Grammar Scores
435
437
439
440
Acknowledgements
443
444
446
447
449
450
452
453
George Miller once said that it is an 'ill-kept secret' that many
people other than the immediate author are involved in the writing of
a book. He said that the one he was prefacing was no exception, and
neither is this one. I want to thank all of those many people who have
contributed to the inspiration and energy required to compile, edit,
write, and re~write many times the material contained in this book. I
cannot begin to mention all of the colleagues, students, and friends
who shared with me their time, talent, and patience along with all of
the other little joys and minor agonies that go into the writing of a
book. Neither can I mention the teachers of the extended classroom
whose ideas have influenced the ones I have tried to express here. For
all of them this is probably a good thing, because it is not likely that
any of them would want to own some of the distillations of their ideas
which find expression here.
Allowing the same discretion to my closer mentors and collaborators, I do want to mention some of them. My debt to my father
and to his Spanish program published by Encyclopaedia Britannica
Educational Corporation, will be obvious to anyone who has used
and understoop the pragmatic principles so well exemplified there. I
also want to thank the staff at Educational Testing Service, and the
other members ofthe Committee of Examiners for the Test of English
as a Foreign Language who were kind enough both to tolerate my
vigorous criticisms ofthat test and to help fill in the many lacunae in
my still limited understanding of the business of tests and
measurement. I am especially. indebted to the incisive thinking and
challenging communications with John A. Upshur of the University
of Michigan who chaired that committee for three of the four years
that I served on it. Alan Hudson ofthe University of New Mexico and
Robert Gardner of the University of Western Ontario stimulated my
interest in much of the material on attitudes and sociolinguistics
which has found its way into this book. Similarly, doctoral research
Xl11
XlV
ACKNOWLEDGEMENTS
by Douglas Stevenson, Annabelle Scoon, Rose Ann Wallace,
Frances Hinofotis, and Herta Teitelbaum has had a significant
impact on recommendations contained here. Work in Alaska with
Eskimo children by Virginia Streiff, in the African primary schools by
John Hofman, in Papua New Guinea by Jonathon Anderson, in
Australia by Norman Hart and Richard Walker, and in Canada and
Europe by John McLeod has fundamentally influenced what is said
concerning the testing of children.
In addition to the acknowledgements due to people for contributing to the thinking embodied here, I also feel a debt of gratitude
towards those colleagues who indirectly contributed to "the
development of this book by making it possible to devote
concentrated periods of time thinking and studying on the topics
discussed here. In the spring of 1974, Russell Campbell, now
Chairman of the TESL group at UCLA, contributed to the initial
work on this text by inviting me to give a series of lectures on
pragmatics and language testing at the American University, English
Language Institute in Cairo, Egypt. Then, in the fall semester of 1975,'
a six week visit to the University of New Mexico by Richard Walker,
Deputy Director of Mount Gravatt College of Advanced Education
in Brisbane, Australia, resulted in a stimulating exchange with several
centers of activity in the world where child language development is
being seriously studied in relation to reading curricula. The
possibility of developing tests to assess the suitability of reading
materials to a given group of children and the discussion of the
relation oflanguage to curriculum in general is very much a product
of the dialogue that ensued.
More recently, the work on this book has been pushed on to
completion thanks to a grant from the Center for English as a Second
Language and the Department of Linguistics at Southern Illinois
University in Carbondale. Without that financial s.upport and the
encouragement of Patricia Carrell, Charles Parish, Richard Daesch;
Kyle Perkins, and others among the faculty and students there, it is
doubtful that the work could have been completed.
Finally, I want to thank the students in the testing courses at
Southern Illinois University and at Concordia University in
Montreal who read all or parts of the manuscript during various
stages of development. Their comments, criticisms, and encouragement have sped the completion of the work and have improved the
product immensely.
Preface
It is, in retrospect, with remarkable speed that the main principles
and assumptions have become accepted of what can be called the
teaching and learning oflanguage as communication. Acceptance, of
course, does not imply practical implementation, but dis,tinctions
between language as form and language as function, the meaning
potential of language ~s ·discourse and the role of the learner as a
negotiator of interpretations, the match to be made between the
integrated skills of communicative actuality and the procedures of
the classroom, among many others, have all been widely announced,
although not yet adequately described. Nonetheless, we are certain
enough of the plausibility of this orientation to teaching and learning
to· suggest types of exercise and pedagogic procedures which are
attuned to these principles and assumptions. Courses are being
designed, and textbooks written, with a communicative goal, even if,
as experiments, they are necessarily partial in the selection of
principles they choose to emphasise.
Two matters are, however, conspicuously lacking. They are
connected, in that the second is an ingredient of the first. Discussion
of a communicative approach has. been very l~rgely concentrated
on syllabus specifications' and, to a lesser extent, on the design of
exercise types, rather th~ln on any coherent and consistent view of
what we can call the communicative curriculum. Rather than
examine the necessary interdependence of the traditional curriculum components: purposes, method~ and evaluations, from the
standpoint of a communicative view of language and language
learning, we have been happy to IQok at the components singly, and
at some much more than others. Indeed, and this is the second matter,
evaluation has hardly been looked at at all, either in terms of
assessment of the c01pmunicative abilities of the learner or the
efficacy of the programme he is following. There are many current
examples, involving materials· and methods aimed at developing
communicative interaction among learners, which are preceded,
xv
xvi
PREFACE
interwoven with or followed by evaluation instruments totally at
odds with the view of language taken by the materials and the
practice with which they are connected. Partly because we have not
taken the curricular view, and partly because we have directed our
innovations towards animation rather than evaluation, teaching and
testing are out of joint. Teachers are unwilling to adopt novel
materials because they can see that they no longer emphasise
exclusively the formal items oflanguage structure which make up the
'psychometric-structuralist' (in Spolsky's phrase) tests another
generation of applied linguists have urged them to use. Evaluators of
programmes expect satisfaction in terms of this testing paradigm
even at points within the programme when quite other aspects of
communicative ability are being encouraged. Clearly, this is an
undesirable and unproductive state of affairs.
It is to these twin matters of communication and curriculum that
John Oller's major contribution to the Applied Linguistics and
Language Study Series is addressed. He poses two questions: how can
language testing relate to a pragmatic view of language as communication and how can language testing relate to educational
measurement in general?
Chapter 1 takes up both issues; in a new departure for this Series
John Oller shows how language testing has general relevance for all
educationalists, not just ,those concerned with language. Indeed, he
hints here at a topic he takes up later, namely the linguistic basis of
tests of intelligence, achievement and aptitude. Language testing, as a
branch of applied linguistics, has cross-curricular relevance for the
learner at school. The major emphasis, however, remains the
connection to be made between evaluation, variable learner
characteristics, and a psycho-socio-linguistic perspective on 'doing'
language-based tasks.
This latter perspective is the substance of the four Chapters in Part
One of the book. Beginning from a definition of communicative
proficiency in terms of'accuracy' in a learner's 'expectancy grammar'
(by which Oller refers to the leamer's predictive competence in
formal, functional and strategic terms) he proceeds to characterise
communication as a functional, context-bound and culturallyspecific use oflanguage involving an integrated view of receptive and
productive skills. It is against such a yardstick that he is able, both in
Chapters 3 and 4 of Part One, and throughout Part Two, to offer a
close, detailed and well-founded critical assessment of the theories
and methods of discrete point testing. Such an approach to testing~
PREFACE
xvii
Oller concludes, is a natural corollary of a view of language as form
and usage, rather than of process and use. If the" view of language
changes to one concerned with the communicative properties of
language in use, then our ways of evaluating learners' competences to
communicate must also change.
In following Spolsky's shift towards the 'integrativesociolinguistic' view of language testing, however, John Oller does
not avoid the frequently-raised objection that although such tests
gain in apparent validity, they do so at a cost of reliability in scoring
and handling. The immensely valuable practical recommendations for pragmatically-orientated language tests in Part Three of the
book constantly return to this objection, and show that it can be
effectively countered. What is more, and this is a strong practical
theme throughout the argument of the book, it is necessary to invoke
a third criterion in language testing, that of classroom utility. Much
discrete point testing; he argues, is not only premissed on an untenable
view of language for the teacher of communication, but in requiring
time-consuming and often arcane pre-testing, statistical evaluation
and rewriting techniques, poses quite impractical burdens on the
classroom teacher. What is needed are effective testing procedures,
linkeq to the needs of particular instructional programmes, reflecting
a communicative view of language learning and teaching, but which
are within the design and administrative powers of the teacher.
Pragmatic tests must be reliable and valid: they need also to be
practicable and to be accessible without presupposing technical
expertise. If, as the examples in Part Three of the book show, they can
be made to relate and be relevant to other subjects in the curriculum
than merely language alone, then the evaluation of pragmatic and'
communicative competence has indeed cross-curricular signlficanqe.
Finally, a word on the book's organisation; although it is lengthy, /
the book has clear-cut divisions: the Introduction in Chapter 1
provides an overview; Part One defines the requirements on
pragmatic testing; Part Two defines and critically assesses current
and overwhelmingly popular discrete point tests, and the concluding
Part Three exemplifies and justifies, in practical and technical terms,
the shape of alternative pragmatic tests. Each Chapter is completed
by a list of Key Points and Discussion Questions, and Suggested
Readings, thus providing the valuable and necessary working
apparatus to sustain the extended and well-illustrated argument.
Christopher N Candlin, General Editor
Lancaster, July 1978.
F
AUTHOR'S PREFACE
Author's Preface
A simple way to find out something about how well a person knows a
language (or more than one language) is to ask him. Another is to hire
a professional psychometrist to construct a more sophisticated test.
Neither of these alternatives, however, is apt to satisfy the needs of
the language teacher in the classroom nor any other educator,
whether in a multilingual context or not. The first method is too
subject to error, and the second is too complicated and expensive.
Somewhere between the extreme simplicity of just asking and the
development of standardized tests there ought to be reasonable
procedures that the classroom teacher could use confidently. This
book suggests that many such usable, practical- classroom testing
procedures exist and it attempts to provide langu~ge teachers and
educators in bilingual programs or other multilingual contexts access
to those procedures.
There are several textbooks about language testing already on the
market. All of them are intended primarily for teachers of foreign
languages or English as a second language, and yet they are generally
based on techniques of testing that were not developed for classroom
purposes but for institutional standardized testing. The pioneering
volume by Robert Lado, Language Testing (1961), the excellent book
by Rebecca Valette, Modern Language Testing: A Handbook (1967),
the equally useful book by David Harris, Testing English as a Second
Language (1969), Foreign Language Testing: Theory and Practice
(1972) by John Clark, Testing and Experimental Methods (1977) by
J. P. B. Allen and Alan Davies, and Valette's 1977 revision of Modern
Language Testing all rely heavily (though not exclusively) on
techniques and methods of constructing multiple choice tests
developed to serve the needs of mass production.
Further, the books and manuals oriented toward multilingual
education such as Oral Language Testsfor Bilingual Students (1976)
by R. Silverman, et ai, are typically aimed at standardized published
XVlli
.-'
{
XIX
tests. It would seem that all of the previously published books
attempt to address classroom needs for assessing proficiency in one or
more languages by extending to the classroom the techniques of
standardized testing. The typical test format discussed is generally the
multiple-choice discrete-item type. However, such techniques are
difficult and often impracticable for classroom use. While Valette,
Harris, and Clark briefly discuss some of the so-called 'integrative' "
'tests like dictation (especially Valette), composition, and oral
interview, for the most part they concentrate on the complex tasks of
writing, pre-testing, statistically evaluating, and re-writing discrete
point multiple-choice items.
The emphasis is reversed in this book. We concentrate here on
pragmatic testing procedures which generally do not require pretesting, statistical evaluation, or re-writing before they can be applied
in the classroom or some other educational context. Such tests can be
shown to be as appropriate to monolingual contexts as they are to
multilingual and multicultural educational settings. 1
Most teachers whether in a foreign language classroom or in a
multilingual school do not have the time nor the technical
background necessary for multiple choice test development, much
less for the work that goes into the standardization of such tests.
Therefore, this book focusses on how to make, give, and evaluate valid
and reliable language tests of a pragmatic sort. Theoretical and
empirical reasons are given, however, to establish the practical
foundation and to show why teachers and educators can confidently
use the recommended testing procedures without a great deal of
prerequisite technical training. Although such training is desirable in
its own right, and is essential LO the researcher in psychometry,
psycholinguistics, sociolinguistics, or education per se, this bookjs meant as a handbook for those many teachers and educators who do
not have the time to master fully (even if that were possible) the highly
technical fields of statistics, research design, and applied linguistics.
The book is addressed to educators at the consumer end of
-educational research. It tries to provide practical information
without presupposing technical expertise. Practical examples of
testing procedures are given wherever they are appropriate.
1 Since it is believed that a multilingual context is normally also multicultural, and since
it is also the case that language and culture are mutually dependent and inseparable,
the term 'multilingual' is 'often used as an abbreviation for the longer term
'multilingual-multicultural' in spite of the fact that the latter term is often preferred by
many authors these days.
F
XX
AUTHOR'S PREFACE
The main criterion of success for the book is whether or not it is
useful to educators. If it also serves some psychometrists, linguists
and researchers, so much the better. It is hoped that it will fill an
important gap in the resources available to language teachers and
educators in multilingual and in monolingual contexts. Thoughtful
suggestions and criticism are invited and will be seriously weighed in
the preparation offuture editions or revisions. '
1
Introduction
John Oller
Albuquerque, New Mexico, 1979
A. What is a language test?
B. What is language testing research
about?
C. Organization of this book
This introduction discusses language testing in relation to education
in general. It is demonstrated that many tests which are not
traditionally thought of as language tests may actually be tests of
language more than of anything else. Further, it is claimed that this is
true both for students who are learning the language of the school as a
second language, and for students who are native speakers of the
language used at school. The correctness of this view is not a matter
that can be decided by preferences. It is an empirical issue, and
substantiating evidence is presented throughout this book. The main
point of this ch~pter is to survey the overall topography of the crucial
issues and to consider some of the angles from which the salient.
points of interest can be seen in bold relief. The overall organization
of the rest of the book is also presented here.
A. What is a language test?
When the term 'language test' is mentioned, most people probably
have visions of students in a foreign language classroom poring over
a written examination. This interpretation of the term is likely
because most educated persons and most educators have had such an
experience at one time or another. Though only a few may have really
learned a second language (and practically none of them in a
classroom context), many have at least made some attempt. For
them, a language test is a device that tries to assess how much has
1
2
,i
I
.
,
..
:
!
i
LANGUAGE TESTS AT SCHOOL
been learned in a foreign language course, or some part of a course.
But written examinations in foreign language classrooms are only
one of the many forms that language tests take in the schools. For any
student whose native language or language variety is not used in the
schools, many tests not traditionally thought ofas language tests may be
primarily tests of language ability. For learners who speak a minority
language whether it is Spanish, Italian, German, Navajo, Digueno,
Black English, or whatever language, coping with any test in the
school context may be largely a matter of coping with a language test.
There are, moreover, language aspects to.tests in general evenfor the
student who is a native speaker of the language of the test.
In one way or another, practically every kind of significant testing of
human beings depends on a surreptitious test of ability to use a
particular language. Consider. the fact that the psychological
construct of 'intelligence' or IQ, at least insofar as it can be measured,
may be no more than language proficiency. In any case, substantial
research (see Oller and Perkins, 1978) indicates that language ability
probably accounts for the lion's share of variability in IQ tests.
It remains to be proved that there is some unique and meaningful
variability that can be associated with other aspects of intelligence
once the language factor has been removed. And, conversely, it is
apparently the case that the bulk of variability in language
proficiency test scores is indistinguishable from the variability
produced by IQ tests. In Chapter 3 we will return to consider what
meaningful factors language skill might consist of (also see the
Appendix on this question). It has not yet been shown conclusively
that there are unique portions of variability in language test scores
that are not attributable to a general intelligence factor.
Psycholinguistic and sociolinguistic research rely on language test
results in obvious ways. As Upshur (1976) has pointed out, so does
research on the nature of psychologically real grammars. Whether
the criterion measure is a reaction time to a heard stimulus, the
accuracy of attempts to produce, interpret, or recall a certain type of
verbal material, or the amount of time required to do so, or a mark on .
a scale indicating agreement or disagreement with a given statement,
language proficiency (even if it is not the object of interest per se) is
probably a major factor in the design, and may often be the major
factor.
, The renowned Russian psychologist, A. R; Luria (1959) has
argued that even motor tasks as simple as pushing or not pusb.inga
button.in response to a red or green light may be rather directly
INTRODUCTION
3
related to language skill in very young children. On a much broader
scale, achievement testing may be much more a problem oflanguage
testing than ·ls commonly thought.
For all ofthese reasons the problems oflanguage testing are a large
subset of the problems of educational measurement in genenil. The
methods and findings of language testing research are of crucial
importance to research concerning psychologically real grammatical
systems, and to all other areas that must of necessity make
assumptions about the nature of such systems. Hence, all areas of
educational measurement are ei;tller directly or indirectly implicated.
Neither intelligence measurement, achievement testing, aptitude
assessment, nor personality gauging, attitude measurement, or just
plain classroom evaluation can be done without language testing. It
therefore seems reasonable to suggest that educational testing in
general can be done better if it takes the findings oflanguage testing
research into account.
B. What is language testing research about?
In general, the subject matter of language testing research is the use
and learning oflanguage. Within educational contexts, the domain of
foreign language teaching is a special case of interest. Multilingual
delivery of curricula is another very important case of interest.
However, the phenomena of interest to research in language testing
are yet more pervasive.
Pertinent questions of the broader domain include: (1) How can
levels oflanguageproficiency, stages oflanguage acquisition (first or
second), degrees of bilingualism, or language competence be defined?
(2) How are earlier stages of language learning different from later
stages, and how can known or hypothesized differences be
demonstrated by testing procedures? (3) How can the effects of
instructional programs or techniques (or environmental changes in
general) be demonstrated empirically? (4) How are levels of language
proficiency and the concomitant social interactions that they allow or
deny related to the acquisition of knowledge in an educational
setting? This is notto say that these questions have been or even will
be answered by language testing research, but that they are indicative
of some of the kinds of issues that such research is in a unique position
to grapple with.
Three angles of approach can be discerned in the literature on
language testing research. First, language tests may be examined as
F
4
LANGUAGE TESTS AT SCHOOL
tests per se. Second, it is possible to investigate learner characteristics
using language tests as elicitation procedures. Third, specific
hypotheses about psycholinguistic and sociolinguistic factors in the
performance of language based tasks may be investigated using
language tests as research tools.
It is important to note that regarding a subject matter from one
angle rather than another does not change the nature of the subject
matter, and neither does it ensure that what can be seen from one
angle will be interpretable fully without recourse to other available
vantage points. The fact is that in language testing research, it is never
actually possible to decide to investigate test characteristics, or
learner traits, or psycholinguistic and sociolinguistic constraints on
test materials without making important assumptions about all three,
regardless which happens to be in focus at the moment.
In this book, we will be concerned with the findings of research
from all three angles of approach. When the focus is on the tests
\~hemSelves, questions of validity, reliability, practicality, and
I instructional value will be considered. The validity of a test is related
! 0 how well the test does what it is supposed to do, namely, to inform
us about the examinee's progress toward some goal in a curriculum or
course of study, or to differentiate levels of ability among various
examinees on some task. Validity questions are about what a test
actually measures in relation to what it is supposed to measure.
The reliability of a test is a matter of how consistently it produces
similar results on different occasions under similar circumstances.
Questions of reliability have to do with how consistently a test does
what it is supposed to do, and thus cannot be strictly separated from
. validity questions. Moreover, a test cannot be any more valid than it
is reliable;
A test's practicality must be determined in relation to the cost in
terms of materials, time, and effort that it requires. This must include
the preparation, administration, scoring, and interpretation of the
test.
Finally, the instructional value ofa test pertains to how easily it can
be fitted into an educational program, whether the latter involves
teaching a foreign langulj.ge, teaching language arts to native
speakers, or verbally imparting subject matter in a monolingual or
multilingual school setting.
When the focus of language testing research is on learner
characteristics, the tests themselves may be viewed as elicitation
procedures for data to be subsequently analyzed. In this case, scores
INTRODUCTION
5
on a test may be treated as summary statistics indicating various
positions. on a-developmental scale, or individual performances may
be analyzed in a more detailed way in an itttempt to diagnose specific
aspects of learner development. The results of the latter. sort of
analysis, often referred to as 'error analysis' (Richards, 1970a, 1971)
may subsequently enter into the process of prescribing therapeutic
intervention - possibly a classroom procedure.
When the focus of language testing research is the verbal material
in the test itself, questions 'usually relate to the psychologically real
grammatical constraints on particular phonological (or graphological) sequences, syllable structures, vocabulary items, phrases, clauses,
and higher level patterns of discourse. Sociological constraints may also be investigated with respect to the negotiability of those elements
and sequences of them in interactional exchanges between human
beings or groups of them.
For any of the stated purposes of research, and ofcourse there are
others which are not mentioned, the tests may be relatively formal
devices or informal elicitation procedures. They· may require the
production or comprehension of verbal sequences, or both. The
language material may be heard, spoken, read, written (or possibly
merely thought), or some combination of these. The task may require
recognition only, or imitation, or delayed recall, memorization,
meaningful conversational response, learning and long term storage,
or some combination of these ..
Ultimately, any attempt to apply the results of language testing
research must consider the total spectrum of tests qua tests, learner
characteristics, and the psychological and sociological constraints on
test materials. Inferences concerning psychologically real grammars
cannot be meaningful apart from the results oflan.guage tests viewed
from all three angles of research outlined above. Whether or not a
particular language test is valid (or better, the degree to which it' is
valid or not valid), whether or not an achievement test or aptitude
test, or personality inventory, or IQ test, or whatever other sort of test
one chooses to consider is a language test, is dependent on what
language competence really is and what sorts of verbal sequences
present a challenge to that competence. This is essentially the
question that Spolsky (1968) raised in the paper entitled: 'What does
it mean to know a language? Or, How do you get someone to perform
his competence?'
'p
6
LANGUAGE TESTS AT SCHOOL
C. Organization of this book
A good way, perhaps the only acceptable way to develop a test of ~
given ability is to start with a clear definition of the capacity in
question. Chapter 2 begins Part One:. on Theory and Research Bases
for Pragmatic Language Testing by proposing a definition for
language proficiency. It introduces the notion of an expectancy
grammar as a way of characterizing the psychologically real system
that governs the use of a language in an individual who knows that
language. Although it is acknowledged that the details of such a
system are just beginning to be understood, certain pervasive
characteristics of expectancy systems can be helpful in explaining
why certain kinds of language tests apparently work as well as they
do, and how to devise other effective testing procedures that take
account of those salient characteristics of functional language
proficiency.
In Chapter 3, it is hypothesized that a valid language test must
press the learner's internalized expectancy system into action and
must further challenge its limits of efficient functioning in order to
discriminate among degrees of efficiency. Although it is suggested
that a statistical average of native performance on a language test is
usually a reasdnable upper bound on attainable proficiency, it is
almost always possible and is sometimes essential to discriminate
degrees of proficiency among native speakers, e.g. at various stages of
child language learning, or among children or adults learning to read,
or among language learners at any stage engaged in normal
inferential discourse processing tasks.~riierion referencedtestint,
where passing the test or s0m.e portion 9:f itrrteanl'b~~;)i:m~i'fo
l»'ft'frmth~ task crt t~b:lt~:~fu@t~tf'tevef6¥'W~qi:£ae1,twhicfi
may be native-like performance in some cases) is also discussed.
Pragmatic language tests are defined and exemplified as tasks that
require the meaningful processing of sequences of elements in the
target language (or tested language) under normal time constraints.
It is claimed that time is always involved in the processing of verbal
sequences.
Chapter 4 extends the discussion to questions often raised in
reference to bilingual-bicilltural programs and other multilingual
contexts in general. The implications of the now famous Lau versus
Nichols case are discussed. Special attention is given to the role of
socio-cultural attitudes in language acquisition and in education in
general. It is hypothesized that other things being equal, attitudes
INTRODUCTION
7
expressed and perceived in the schools probably account for more
variance in nitl< and amount of learning than do educational
methodologies related to the transmission of the traditionally
conceived curriculum. Some of the special problems that arise in
muhilingual contexts are considered;,SUchas, cilltural bias in tests,
difficillties in translating test items, and methods of assessing
.
language dominance.
Chapter 5 concludes Part One with a discussion of the measure;
ment of attitudes and motivations. It discusses in some detail
questions related to the hypothesized~relationship between attitudes
and language learning (first or second), and considers such variables
as the context in which the language learning takes place, and the
types of measurement techniques that have been used in previous
research. Several hypotheses are offered concerning the relationship
between attitudes, motivations, and achievement in education.
Certain puzzling facts about apparent interacting influences in
multilingual contexts are noted.
Part Two, Theories and Methods of Discrete Point Testing, takes up
some of the more traditional and probably less theoretically sound
ways of approaching language testing. Chapter 6 discusses some of
the difficillties associated with testing procedures that grew out of
contrastive linguistics, syntax based structure drills, and certain
assumptions about language structure and the learning of language
from early versions of transformational linguistics.
Pitfalls in relying too heavily on statistics for guiding test
development are discussed in Chapter 7. It is shown that different
theoretical assumptions may resillt in contradictory interpretations
of the same statistics. Thus it is argued that such statistical techniques
as are normally applied in test development, though helpfill if used,
with care, should not be the chief criterion for deciding test format.
An understanding of language use and language learning must take
priority in guiding format decisions.
Chapter 8 shows how discrete point language tests may produce
distorted estimates of language proficiency. In fact, it is claimed that
some discrete point tests are probably most appropriate as measures
of the sorts of artificial grammars that learners are sometimes
encouraged to internalize on the basis of artificial contexts of certain
syntax dominated classroom methods. In order to measure
communic/ltive effectiveness for real-life settings in and out of the
classroom; it is reasoned that the language tests used in the classroom
(or in any educational context) must reflect certain crucial properties
8
LANGUAGE TESTS AT SCHOOL
of the normal use of language in ways that some discrete point tests
apparently cannot. Examples of discrete point items which attempt to
examine the pieces of language structure apart from some of their
systematic interrelationships are considered. The chapter concludes
by proposing that the diagnostic aims of discrete point tests can in
fact be achieved by so-called integrative or pragmatic tests. Hence, a
reconciliation between the apparently irreconcilable theoretical
positions is possible.
In conclusion to Part Two on discrete point testing, Chapter 9
provides a natural bridge to Part Three, Practical Recommendations
for Language Testing, by discussing multiple choice testing
procedures which may be of the discrete point type, or the integrative
type, or anywhere on the continuum in between the two extremes.
However, regardless of the theoretical bent of the test writer, multiple
choice tests require considerable technical skill and a good deal of
energy to prepare. They are in some respects less practical than some
of the pragmatic procedures recommended in Part Three precisely
because of the technical skills and the effort necessary to their
preparation. Multiple choice tests need to be critiqued by some native
speaker, other than the test writer. This is necessary to avoid the
pitfalls of ambiguities and subtle differences of interpretation that
may not be obvious to the test writer. The items need to be pretested,
preferably on some group other than the population which will
ultimately be tested with the finished product (this is ofte~ not
feasible in classroom situations). Then, the items need to be
statistically analyzed so that non-functional or weak items can be
revised before they are used and interpreted in ways that affect
learners. In some cases, recycling through the whole procedure is
necessary even though all the steps of test development may have
been quite carefully executed. Because of these complexities and costs
of test development, multiple choice tests are not always suitable for
meeting the needs of classroom testing ~ or for broader institutional
purposes in some cases.
The reader who is interested mainly in classroom applications of
pragmatic testing procedures may want to begin reading at Chapter
lOin Part Three. However, unless the material in the early chapters
(especially 2 through 9) is fairly well grasped, the basis for many of the
recommendations in Part Three will probably not be appreciated
fully. For instance, many educators seem to have acquired the
impression that certain pragmatic language tests, such as those based
on the cloze procedure for example (see Chapter 11), are 'quick and
INTRODUCTION
9
dirty' methods of acquiring information about language proficiency.
This idea, however, is apparently. based only. on intuitions and is
disconfirmed by the research discussed in Part One. Pragmatic tests
are typically better on the whole than any other procedures that have
been carefully studied. Whereas the prevailing techniques of
language testing that educators are apt to be most familiar with are
based on the discrete point theories, these methods are rejected in
Part Two. Hence, were the reader to skip over to Part Three
immediately, he might be left in a quandary as to why the pragmatic
testing techniques discussed there are recommended instead of the
more familiar discrete point (and typically multiple choice) tests.
Although pragmatic testing procedures are in some cases
deceptively simple to apply, they probably provide more accurate
information concerning language proficiency (and even specific
achievement objectives) than the more familiar tests produced on the
basis of discrete point theory. Moreover, not only are pragmatic tests
apparently more valid, but they are more practicable. It simply takes
less premeditation, and less time and effort to prepare and use
pragmatic tests. This is not to say that great care and attention is not
necessary to the use of pragmatic testing procedures, quite the
contrary. It is rather to say that hour for hour and dollar for dollar
the return on work and money expended in pragmatic testing will
probably offer higher dividends to the learner, the educator, and the
taxpayer. Clearly, much more research is needed on both pragmatic
and discrete point testing procedures, and many suggestions for
possible studies are offered throughout the text.
Chapter 10 discusses some of the practical classroom applications
of the procedure of dictating material in the target language.
Variations of the technique which are also discussed include the
procedure of 'elicited imitation' employed with monolingual,
bilingual, and bidialectal children.
In Chapter 11 attention is focussed on a variety of procedures
requiring the use of productive oral language skills. Among the ones
discussed are reading aloud, story retelling, dialogue dramatization,
conversational interview techniques, and specifically the Foreign
Service Institute Oral Interview, the Ilyin Oral Interview, the Upshur
Oral Communication Test, and the Bilingual Syntax Measure are also
discussed.
Increasingly widely used cloze procedure and variations of it are
considered in Chapter 12. Basically the procedure consists of deleting
words from prose (or auditorily presented material) and asking the
.
;f
"
10
1'1
I
'I
I
.. ,~,q0
.'~
.
INTRODUCTION
LANGUAGE TESTS AT SCHOOL
examinee to try to replace the mlssmg words. Because of the;
simplicity of application and the demonstrated validity of the
technique, it has become quite popu1ar in recent years. However, it is
probably not any more applicable to classroom purposes than some
of the procedures discussed in other chapters. The cloze procedure is
sometimes construed to be a measure of reading ability, though it
may be just as much a measure of writing, listening and speaking
ability (see Chapter 3, and the Appendix).
Chapter 13 looks at writing tasks per se. It gives considerable space
to ways of approaching the difficu1ties of scoring relatively' free
essays. Alternative testing methods considered include various
methods of increasing the constraints on the range of material that
the examinee may produce in response to the test procedure. Controls
range from suggesting topics for an essay to asking examinees to
rephrase heard or read material after a time lapse. Obviously, many
other control techniques are possible, and variations on the cloze
procedure can be used to construct many of them.
Chapter 14 considers ways in which testing procedures can be
related to curricu1a. In particu1ar it asks, 'How can effective testing
procedures be invented or adapted to the needs of an instructional
program?' Some general guidelines are tentatively suggested for both
developing or adapting testing proceq.ures and for studying their
effectiveness in relation to particufar i!ducational objectives. To
illustrate one of the ways. that curriculum (learning, teaching, and
testing) can be related to a comprehensive sort of validation research,
the Mount Gravatt reading research project is discussed. Tliis project
provides a rich source of data concerning preschool children, and
children in the early grades. By carefully studying the kinds of
language games that children at various age levels can play and win
(Upshur, 1973), that is, the kinds of things that they can explore
verbally and with success, Norman Hart, Richard Walker, and their
colleagues have provided a model for relating theories of language
learning via carefu1 research to the practical educational task of
teaching reading. There are, of course, spin off benefits to all other
areas of the curricu1um because of the fundamental part played by
language use in every area of the educational process. It is strongly
urged that language testing procedures, especially for assessing the
language skills of children, be carefully examined in the light of such
research.
Throughout the text, wherever technical projects are ref6rred to,
details of a technical sort are either omitted or are explained in non-
11
technical language. More complete research reports are often
referred to in the text (also see the Appendix) and should be consu1ted
by anyone interested in applying the recommendations contained
here to the testing of specific research hypotheses. However, for
classroom purposes (where at least some of the technicalities of
research design are luxuries) the suggestions ~ffered here are intended
to be sufficient. Additional readings; usually of a non-technical sort,
are suggested at the end of each chapter. A complete list of technical
reports and other works referred to in the text is included at the end of
the book in a separate Bibliography. The latter includes all of the
Suggested Readings at the end of el).ch chapter, plus many items not
given in the Suggested Readings lists. An Appendix reporting on
recent empirical study of many of the pressing questions raised in the
body of the text is included at the end. The fundamental question
addressed there is whether or not language proficiency can be parsed
up into components, skills, and the like. Further, it is asked whether
language proficiency is distinct from IQ, achievement, and other
educational constructs. The Appendix is not inchldedin the body of
the text precisely because it is somewhat technical.
It is to be expected that a book dealing with a subject matter that is
changing as rapidly as the field of language testing research shou1d
soon be outdated. However, it seems that most current research is
pointing toward the refinement of existing pragmatic testing
procedures and the discovery of new ones and new applications. It
seems unlikely that there will be a substantial return to the strong
versions of discrete point theories and methods of the 1960s and early
1970s. In any event the emphasis here is on pragmatic language
testing because it is believed that such procedures offer a richer yield
of information.
KEY POINTS
1. Any test that challenges the language ability of an examinee can at least
in part, be construed as a language test. This is especially true for
examine~s who ?o not know or normally use the language variety of the
test, but IS true m a broader sense for native speakers of the language of
the test.
2. It is not known to what extent language ability may be co-extensive with
IQ, but there is evidence that the relationship is a very strong one .. Hence,
IQ tests (and many other varieties of tests as well) may be tests of
language ability more than they are tests of anything else.
3. Language testing is crucial to the investigation of psychologically real
grammars, to research in all aspects of distinctively human symbolic
I
F
12
INTRODUCTION
LANGUAGE TESTS AT SCHOOL
6. Consider any educational research project that you know of or have
access to. What sorts of measurements did the research use? Was there a
testing technique? An observational or rating procedure? A way of
recording behaviors? Did language figure in the measurements taken?
7. Why do you think language might or might not be related to capacity to
perform motor tasks, particularly in young children? Consider finding
your way around town, or around a new building, or around the house.
Do you ever use verbal cues to guide your own stops,starts, turns?
Subvocal ones? How about in a strange place or when you are very tired?
Do you ever ask yourself things like, Now what did! come in here for ?
8. Can you conceive of any way to operationaliie notions like language
competence, degree" of bilingualism, stages of learning, effectiveness of
language teaching, rate oflearning, level of proficiency without language
tests?
.
If you were to rank the criteria of validity, reliability, practicality and
instructional value in their order of import alIce, what order would you
put them in? Consider the fact that validity without practicality is
certainly possible. The ·same is true for validity without instructional
value. How about instructional value without validity?
Do you consider the concept of intelligence or IQ to be a useful
theoretical construct? Do you believe that researchers and theorists
know what they mean by the term apart from some test score? How
about grammatical knowledge? Is it the same sort of construct?
~1. Can you, think of any way(s) that time is normally involved in a task like
readin~ a novel- when no one is holding a stop-watch?
behavior, and to educational measurement in general.
4. Language testing research may focus on tests, learners, or constraints on
verbal materials.
5. Among the questions of interest to language testing research are: (a) how
to operationally define levels of language proficiency, stages of language
learning, degrees of bilingualism, .or linguistic competence; (b) how to
differentiate stages of learning; (c) how to measure possible effects of
instruction (or other environmental factors) on language learning; (d)
how language ability is related to the acquisition of knowledge in an .
educational setting.
6. When the research is focussed on tests, validity, reliability, practicality,
and instructional value are among the factors of interest.
7. When the focus is on learners and their developing language systems,
tests may be viewed as elicitation procedures. Data elicited may then be
analyzed with a view toward providing detailed descriptions of learner
systems, and/or diagnosis of teaching procedures (or other therapeutic
interventions) to facilitate learning.
8. When research is directed toward the verbal material in a given test or
testing procedure, the effects of psychological or sociological constraints
built into the verbal sequences themselves (or constraints which are at
least implicit to language users) are at issue.
9. From all ofthe foregoing it follows that the results oflanguage tests, and
the findings of language testing research ~re highly relevant to
psychological, sociological, and educational measurement in general.
DISCUSSION QUESTIONS
1. What tests are used in your school that require the comprehension or
production of complex s€<quences of material in a language? Reading
achievement tests? Aptitude tests? Personality inventories? Verbal I Q
tests? Others? What evidence exists to show that the tests are really
measures of different things?
2. Are tests in your school sometimes used for examinees whose native
language (or language variety) is not the language (or language variety)
used in the test? How do you think such tests ought to be interpreted? .
3. Is it possible that a non-verbal test of IQ could have a language factor
unintentionally built into it? How are the instructions given? What
strategies do you think children or examinees must follow in order to do
the items on the test? Are any of those strategies related to their ability to
code information verbally? To give subvocal commands?
'/4. In what ways do teachers normally do language testing (unintentionally)
in their routine activities? Consider the kinds of instructions children or
adults must execute in the classroom.
5. Take any·standardized test used in any school setting. Analyze it for its
level of verbal complexity. What instructions does it use? Are they more
or less complex than the tasks which they define or explain? For what age ~
level of children or for what proficiency level of second language learners
would such instructions be appropriate? Is guessing necessary in orderto
understand the instructions of the test?
13
"9.
SUGGESTED READINGS
1. George A. Miller, 'The Psycholinguists,' Encounter 23, 1964, 29-37.
Reprinted in Charles E. Osgood and Thomas A. Sebeok (eds.)
Psycho linguistics : A Survey of Theory and Research Problems.
Bloomington, Indiana: Indiana University, 1965.
2. John W. Oller, Jr. and Kyle Perkins, Language in Education: Testing the
Tests. Rowley, Massachusetts: Newbury House, 1978.
3. Bernard Spolsky, 'Introduction: Linguists and Language Testers' in
Advances in Language Testing: Series 2, Approaches to Language
Testing. Arlington, Virginia: Center for Applied Linguistics, 1978, v-x.
c
L
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
2
Language Skill
as a Pragmatic Expectancy
Grammar
A.
B.
C.
D.
What is pragmatics about?
The factive aspect oflanguage use
The emotive aspect
Language learning as grammar
construction and modification
E. Tests that invoke the leamer's
grammar
Understanding what is to be tested is prereq~isite to good testing of
any sort. In this chapter, the object of interest is language as it is used
for communicative purposes - for getting and giving informationabout (or for bringing about changes in) facts or states of affairs, and
for expressing attitudes toward those facts or states of affairs. The
notion of expectancy is introduced 'a,s a key to understanding the
nature of psychologically ft!al processes that underlie language use. It
is suggested that expectancy generating systems are constructed and
modified in the course oflanguage acquisition. Languag~ proficiency .
is thus characterized as consisting of such an expectancy generating
system. Therefore, it is claimed that for a proposed measure to
qualify as a language test, it must invoke the expectancy system or
grammar of the examinee.
A. What is pragmatics about?
The newscaster in Albuquerque who smiled cheerfully while speaking
of traffic fatalities, floods, and other calamities was expressing an ~
entirely different attitude toward the, facts he was referring to than
was probably held by the friends and relatives of the victims, not to
16
, :
I,
17
mention the more compassionate strangers who were undoubtedly
watching him on the television. There might not have been any
disagreement about what the facts were, but the potent contrast in the
attitudes of the newscaster and others probably accounts for the
brevity of the man's career as the station's anchorman.
Thus, two aspects of language use need to be distinguished.
Language is usually used to convey information about people, things,
events, ideas, states of affairs, and attitudes toward all of the
foregoing. It is possible for two or more people to agree entirely about
the facts referred to or the assumptions implied by a certain statement
but to disagree markedly in their attitudes toward those facts. The
newscaster and his viewers probably disagreed very little or not at all
concerning the facts he was speaking of. It was his manner of
speaking (including his choice of words) and the attitude conveyed by
it that probably shortened,his career.
Linguistic analysis has traditionally been concerned mainly with
what might be called thefactive (or cognitive) aspect oflanguage use.
The physical stuff of language which codes factive information
usually consists of sequences of distinctive sounds which combine to
form syllables, which form words, which get hooked together in
highly constrained ways to form phrases, which make up clauses,
which alsd combine in highly restricted ways to yield the incredible
diversity of human discourse. By contrast, the physical stuff of
language which codes emotive (or affective, attitudinal) information
usually. consists of facial expression, tone of voice, and gesture.
Psychologists and sociologists have often been interested more in the
emotive aspect oflanguage than in the cognitive complexities of the
factive aspect. Certainly cognitive psychology and linguistics, along
with philosophy and logic, on the other hand, have concentrated on
the latter.
Although the two aspects are intricately interrelated, it is often
useful and sometimes essential to distinguish them. Consider, for
instance, the statement that Some of Richard's lies have been
discovered. This remark could be taken to mean that there is a certain
person named Richard (whom we may infer to be a male human),
who is guilty oflying on more than one or two occasions, and some of
whose lies have been found out. In addition, the remark implies that
there are other lies told by Richard which may be uncovered later.
Such a statement relates in systematic ways to a speaker's asserted
beliefs concerning certain states of affairs. Of course, the speaker may
be lying, or sincere but mistaken, or sincere and correct, and these are
c
18
LANGUAGE TESTS AT SCHOOL
only some of the many possibilities. In any case, however, as persons ,"who know English, we understand the remark about Richard partly,
by inferring the sorts of facts it would take to make such a statement
true. Such inferences are not perfectly understood, but there is no
doubt that language users make them. A speaker (or writer) must
make them in order to know what his listener (or reader) will
probably understand, and the listener (or reader) must make them in
.
order to know what a speaker (or writer) means.
In addition to the factive information coded in the words and
phrases. of the statement, a person who utters that statement· may
convey attitudes toward the asserted or implied states of affairs, and
may further code information concerning the way the speaker thinks
the listener should feel about those states of affairs. For instance, the
speaker may appear to hate Richard, and to despise his lies (both
those that have already been discovered and the others not yet found
out), or he may appear detached and impersonal. In speaking, such
emotive effects are achieved largely by facial expression, tone of
voice, and gesture, but they may also be achieved in writing by
describing the manner in which a statement is made or by skillful
choice of words. The latter, of course; is effective either in spoken or
written discourse as a device for coding emotive information. Notice
the change in the emotive aspect if the word lies is replaced by halftruths: Some of Richard's half-truths have been discovered. The'
disapproval is weakened still further if we say: Some of Richard's
mistakes have been discovered, and further still if we change mistakes
to errors ofjudgement.
In the normal use oflanguage it is possible to distinguish two major
kinds of context. First, there is the physical stuff oflanguage which is
organized into a more or less linear ~rrangement of verbal elements
skillfully and intricately interrelated with a sequence of rather
precisely timed changes in tone Of voice, facial expression, body
posture, and so on. To call attention tothe fact that in human beings
even the latter so-called 'paralinguistic' devices of communication are
an integral part of language use, we may refer to the verbal and
gestural aspects of language in use as constituting linguistic context.
With reference to speech it is possible to decompose linguistic context
into verbal and gestural contexts. With reference to writing, the terms
linguistic context and verbal context may be used interchangeably.
A second major type of context has to do with the world, outside of
language, as it is perceived by language users in relation to themselves
and valued other persons or groups. We will use the term'"
rI
lJ
•
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
19
extralinguistic context to refer to states of affairs constituted by
things, events, people, ideas, relationships, feelings,' perceptions,
memories, and so forth. It may be useful to distinguish objective
aspects of extralinguistic context from subjective aspects. On the one
hand, there is the world of existing things, events, persons, and so
forth, and on the other, there is the world of self-concept, otherconcept, interpersonal relationships, group relationships, and so on.
In a sense, the two worlds are part of a single totality for any
individual, but they are not necessarily so closely related. Otherwise,
there would be no need for such terms as schizophrenia, or paranoia.
Neither linguistic nor' extralinguistic contexts are simple in
themselves, but what complicates matters still further and makes
meaningful communication possible is that there are systematic
correspondences between linguistic contexts· and extralinguistic ones.
,That is, sequences of linguistic elements in normal uses of language
are not haphazard in their relation to people, tliings, events, ideas,
relationships, attitudes •.etc., but are systematically related to states of
affairs outside oflanguage. Thus we may say that linguistic contexts
are pragmatically mapped onto extralinguistic contexts, and vice
versa.
We can now offer a definition of pragmatics. Briefly, it addresses
the question: how do l,ltterances relate to human experience outside
oflanguage? It is concerned with the relationships between linguistic
contexts and extralinguistic contexts. It embraces the traditional
subject matter of psycholinguistics and also that of sociolinguistics.
Pragmatics is about how people communicate information about
facts and feelings to other people, or how they merely express
themselves and their feelings through the use of language for no
particular audience, except possibly an omniscient God. It'is about
how meaning is both coded and in a sense invented in the normal
intercourse of words and experience (to borrow a metaphor from
Dewey, 1929).
B. The factive.aspect oflanguage use
Language, when it is used to convey information about facts, is
always an abbreviation for a richer conceptualization. We know
more about objects, events, people, relationships, and states of affairs
than we are ever fully able to express in words. Consider the difficulty
of saying all you know about the familiar face of a friend. The fact is
that your best effort would probably fail to convey enough
.1-
20
LANGUAGE TESTS AT SCHOOL
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
information to enable someone else to single out your friend in a large; :
crowd. This simply illustrates the fact that you know more than yo-q'
are able to say.
Here is another example. Not long ago, a person whom I know
very well, was involved in an accident. He was riding a ten-speed
bicycle around a corner when he was hit head-on by a pick-up. The
driver of the truck cut the corner at about thirty miles an hour leaving
no room for· the cyclist to pass. The collision was inevitable. Blood.
gushed from a three inch gash in the top of~his head and a blunt
handlebar was rammed nearly to the bone in his left thigh, From this
description you have some vivid impressions about the events
referred to. However, no one needs to point out the fact that you do
not know as much about the events referred to as the person who
experienced them. Some of what you do know is a result of t~e ,.
linguistic context of this paragraph, and some of what you know IS
the result of inferences that you have correctly made concerning what
it probably feels like to have a blunt object rammed into your thigh,
or to have a three inch gash in your head, but no matter how much
you are told or are able to infer it will undoubtedly fall short ofthe
information that is available to the person who experienced the
events in his own flesh. Our words are successful in conveying only
part of the infortnation that we possess.
.
'
Whenever we say anything·at all weleave a great deal more unsaid ..
We depend largely for the effect of our communications not only ~n .
what we say but also on the cr~ative ability of our listeners to fill 10.
what we have left unsaid. The fact is that a normal listener supplies a (
great deal of information by creative inference and in a very
important sense is always anticipating what the speaker will say ne~t.
Similarly, the speaker is always anticipating what the listener will
infer and is correcting his output on the basis of feedback received
from the listener. Of course, some language users are more skillful in
such things than others.
We are practically always a jump or two ahead of the person that
we are listening to, and sometimes we even outrun our own tongues (
when we are speaking. It is not unusual in a speech error for a speaker
to say a word several syllables ahead of what he intended to say, nor is
it uncommon for a listener to take a wrong turn in his thinking and
fail· to understand correctly, simply because he was expecting
something else to be said.
It has been shown repeatedly that tampering with the speaker:s
own feedback of what he is saying has striking debilitating effects
21
(Chase, Sutton, and First, 1959). The typical experiment illustrating ,
this involves delayed auditory feedback or sidetone. The speaker's
voice is recorded on a tape and played back a fraction of a second
later into a set of headphones which the speaker is wearing. The result
is that the speaker hears not what he is saying, but what he has just
said a fraction of a second earlier. He invariably stutters and distorts
syllables almost beyond recognition. The problem is that he is trying
to compensate for what he hears himself saying in relation to what he
expects to hear. After some practice, it is possible for the speaker to
ignore the delayed auditory feedback and to speak normally by
attending instead to the so-called kinestheti6 feedback of the
movements of the· vocal 'apparatus and presumably the bone
conducted vibrations of the voice.
The pervasive importance of expectations in the processing of all
sorts of information is well illustrated in the following remark by the
world renowned neurophysiologist, Karl Lashley:
. .. The organization of ianguage seems to me to be
characteristic of almost all other cerebral activity. There is a
series of hierarchies of organization; the order of vocal
movements in pronouncing the word, the order of words in the
sentence, the order of sentences in the paragraph, the rational
or~er of paragraphs i~ a discourse. Not only speech, but all
skilled acts seem to mvolve the same problems. of serial
ordering,even down to the temporal coordinations of muscular
co.ntractions in such a movement as reaching and grasping
.
(1951, p. 187).
A major aspect of language use that a good theory must explain is
that there is, in Lashley's words, 'a series' of hierarchies of
organization.' That is, there are units that combine with each other to
form higher level units. For instance, the letters in a written word
combine to form the written word itself. The word is not a letter and
the letter is not a word, but the one unit acts as a building block for'
the other, something like the, way atoms combine to form molecules.
Of cmlfse, atoms consist of their own more elementary building
blocks and molecules combine in complex ways to become the
building blocks of a great diversity of higher order substances.
Words make phrases, and the phrases carry neW and different
meanings which are not part of the separate words of which they are
made. For instance, consider the meanings of the words head, red,
the, and beautiful. Now consider their meanings again in the phrase
the beautiful redhead as in the sentence, She's the beautiful redhead I've
been telling you about. At each higher level in the hierarchy, as John
'~-
()
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
22
LANGUAGE TESTS AT SCHOOL
lecture: He asked Mr Macklin to kindly read it aloud once to the
audience and then to repeat it from memory. So Mr Macklin read:
Dewey (1929) put it, new meanings are bred from the cop~latihg .l '
forms. This in a nutshell is the basis of the marvelous compleXlty and
novelty oflanguage as an instrument for coding information and for
conceptualizing.
Noam Chomsky, eminent professor of linguistics at the
Massachusetts Institute of Technology" is mainly responsible for the
emphasis in modern linguistics on the char~cteristic novelty of
sentences. He has argued convincingly (cf. especIally Chomsky, 1972)
that novelty is the rule rather than the exception in the everyday use
of language. If a sentence is more than part of a ritual verbal p~ttern
(such as, 'Hello. How are you 1'), it is probably a novel con~octlOn.of
the speaker which probably has never been heard or SaId by h~m
before. As George Miller (1964) has pointed out, a co~servat~ve
estimate of the number of possible twenty-word sentences III Enghsh
is on the order of the number of seconds in one' hundred million
centuries. Although sentences may share certain structural features,
any particular one that happens to be uttered is probably invented,
new on the spot. The probability that it has been heard before and
memorized is very slight.
The novelty of language, however, is a kind of freedom within
limits. When the limits on the creativity allowed by language are
violate~, many versions of nonsense result. They may range fro~
unpronounceable sequences like,glmtmbwk (unpronounceable III
English at least) to pronounpeable nonsense such' as nox ems glerf
onmo kebs (from 'Osgood, 1955). They may be syntactically,
acceptable but semantically strange concoctions l,ike the m~ch
overused example of Jabberwocky or Chomsky s (now tnte)
illustrative sentence, Colorless gr,een ideas sleep furiously. 1
A less well known passage of nonsense was inve~ted by Samuel
Foote, one of the best known humorists of the nineteenth century, in,
order to prove apoint about the organization of memory. Foote had,
been attending a series oflectures by Charles Macklin on oratory. O~
one particular evening, Mr Macklin boasted that he had mastered the '
principles of memorization so thoroughly that he could repeat any
paragraph by rote after having read it only once. At t~e end. of the
lecture, the unpredictable Foote handed Mr Mackhn a piece of
paper on which he had composed a brief paragraph during the
So she went into the garden to cut a cabbage leaf to make an
apple pie; and at the same time a great she-beat coming up the
street pops its head in the shop. 'What! No soap!' So he died,
and she very imprudently married the barber: and there were
present the Picninnies, the Joblillies, and the Garcelies, and the
Great Panjandrum himself~ with the little round button at the,
top and they all fell to playing the game of catch-as-catch-can,
till 'the gunpowder ran 'out the heels of their boots (Samuel
Foote, ca. 1854; see Cooke, 1902, p. 221f).2
The incident probably improved Mr Macklin's modesty and it
surely instructed him on the importance of the reconstructive aspects
of verbal memory. We don't just happen to remember things in all
their detail ; rather, we remember a kind of skeleton, or possibly a
whole hierarchy of skeletons, to which we attach the flesh of detail by
a creative and reconstructive process. That process, like all verbal and
cognitive activities, is governed largely by what we have learned to
expect. The fact is that she-bears rarely pop their heads into barber
shops, nor do people cut cabbage leaves to make apple pies. For
such reasons, Foote's prose is difficult to remember. It is because the
contexts, outside 'of language, which are mapped by his choice of
words are odd contexts in themselves. Otherwise, the word sequences
are grammatical enough.
Perhaps the importance of our normal expectancies concerning
words and what they mean is best illustrated by nonsense which
,violates those expectancies. The sequence gbntmbwk forces oli our
attention things that we know only subconsciously about our
language - for example, the fact that g cannot immediately precede b
at the beginning of a word, and that syllables in English must have a
vowel sound in them somewhere (unless shhhh! is a syllable). These
are facts we know because we have acquired an expectancy grammar
for English.
Our internalized grammar tells us that glerf is a possible word in
English. It is pronounceable and is parallel to words that exist in the
2 I am indebted to my father, John Oller, Sr., for this illustration. He used it often in his
talks on language teaching to show the importance of meaningful sequence to recall
and learning. He often attributed the prose to Mark Twain, and it is possible that
Twain used this same piece of wit to debunk a supposed memory expert in a circus
contest as my father often claimed. I have not been able to document the story about
Twain, though it seems characteristic of him, No doubt he knew of Foote and may
have consciously imitated him,
1 At one Summer Linguistics Institute, someone had bumper stickers printed up with
Chomsky'S famous sentence. One'{)fthem found its way into the hands of my brother,
D. K. Oller, and eventually onto my bumper to the consternation of many Los Angeles
motorists,
,I
I'
23
'.1-
24
LANGUAGE TESTS AT SCHOOL
language such as glide, serf, and slurp; still, glerf is not an English
word. Our grammatical expectancies are not completely violated by
Lewis Carroll's phrase, the frumious bandersnatch, but we recognize
this as a novel creation. Our internalized grammar causes us to
suppose that frumious must be an adjective that modifies the noun
bandersnatch. We may even imagine a kind of beast that, in the
context, might be referred to as a frumious bandersnatch. Our
inferential construction mayor may not resemble anything that
Carroll had in mind, if in fact he had any 'thing' in mind at all. The
inference here is similar to supposing that Foote .was referring to
Macklin himself when he chose the phrase the Great Panjandrum
himself, with the little round button at the top. In either case, similar
grammatical expectancies are employed.
But it may be objected that what we are referring to here as
grammatical involves more than what is traditionally subsumed
under the heading grammar. However, we are not concerned here
with grammar in the traditional sense as being something entirely
abstract and unrelated to persons who know languages. Rather, we
are concerned with the psychological realities oflinguistic knowledge
as it is internalized in whatever ways by real human beings. By this
definition of grammar, the language user's knowledge of how to map
utterances pragmatically onto contexts outside of language and vice
versa (that is, how to map contexts onto utterances) must be
incorporated int<;> the grammatical system. To illustrate, the word
horse is effective in communicative exchanges if it is related to the
right sort of animal. Pointing to a giraffe and calling it a horse is not
an error in syntax, nor even an error in semantics (the speaker and
listener may both know the intended meaning). It is the pragmatic
mapping of a particular exemplar of the category GIRAFFE (as an
object or real world thing, not as a word) that is incorrect. In an
important sense, such an error is a grammatical error.
The term expectancy grammar calls attention to the peculiarly
sequential organization oflanguage in actual use. Natural language is
perhaps the best known example of the complex organization of
elements into sequences and classes, and sequences of classes which
are composed of other sequences of classes and so forth. The term
pragmatic expectancy grammar further calls attention to the fact that
the sequences of classes of elements, and hierarchies of them which
constitute a language are available to the language user in real life
situations because they are somehow indexed with reference to their
appropriateness to extralinguistic contexts.
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
25
In the normal use oflanguage, no matter what level oflanguage or
mode of processing we think of, it is always possible to predict
partially what will come next in any given sequence of elements. The
elements may be sounds, syllables, words, phrases, sentences,
paragraphs, or larger units of discourse. The mode of processing may
be listening, speaking, reading, writing, or thinking, or some
combination of these. In the meaningful use oflanguage, some sort of
pragmatic expectancy grammar must function in all cases.
A wide variety of research has shown that the more grammatically
predictable a sequence oflinguistic elements is, the more readily it can
be processed. For instance, a sequence of nonsensical syllables ·as in
the example from Osgood, Nox ems glerf onmo kebs, is more difficult
than the same sequence with a more obvious structure imposed on it,
as in The nox ems glerfed the onmo kebs. But the latter is still more
difficult to process than, The bad boys chased the pretty girls. It is easy
to see that the gradation from nonsense to completely acceptable
sequences of meaningful prose can vary by much finer degrees, but
these examples serve to illustrate that as sequences of linguistic
elements become increasingly more predictable in terms of
grammatical organization, they become easier to handle.
Not only are less constrained sequences more difficult than more
constrained ones, but this generalization holds true regardless of
which of the four traditionally recognized skills we are speaking of. It
is also true for learning. In fact, there is considerable evidence to
suggest that as organizational constraints on linguistic sequences are
increased, ease of processing (whether perceiving, producing,
learning, recalling, etc.) increases at an accelerating rate, almost
exponentially. It is as though our learned expectations enable us to lie
in wait for elements in a highly constrained linguistic context and
make much shorter work of them than would be possible if they took
us by surprise.
As we have been arguing throughout this chapter, the constraints
on what may follow in a given sequence of linguistic elements go far
beyond the traditionally recognized grammatical ones, and they
operate in every aspect of our cognition. In his treatise on thinking,
John Dewey (1910) argued that the 'central factor in thinking' is an
element of expectancy. He gives an example of a man strolling along
on a warm day. Suddenly, the man notices that it has become cool.
It occurs to him that it is probably going to rain; looking up, he sees a
dark cloud between himself and the sun. He then quickens his steps
(p. 61). Dewey goes on to define thinking as 'that operation in which
26
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
LANGUAGE TESTS AT SCHOOL
27
Although the facts referred to by such terms as exaggerating and
lying may be the same facts in certain practical cases, the attitudes
expressed toward those facts by selecting one or the other term are
quite different. Indeed, the accompanying facial expression and tone
of voice may convey attitudinal information so forcefully as to even
contradict the factive claims of a statement. For instance, the teacher
who says to a young child in an irritated tone of voice to 'Never mind
about the spilled glue! It won't be any trouble to clean it up!' conveys
one message factively and a very different one emotively. In this case,
as in most such cases, the tone of voice is apt to speak louder than the
words. Somehow we are more willing to believe a person's manner of
speaking than whatever his words purport to say.
It is as if emotively coded messages were higher ranking and
therefore more authoritative messages. As Watzlawick, Beavin, and
Jackson (1967) point out in their book on the Pragmatics of Human
Communication, it is part of the function of emotive messages to
provide instructions concerning the interpretation of factively coded
information. Whereas the latter can usually be translated into
propositional forms such as 'This is what I believe is true', or 'This is
what I believe is not true', the former can usually be translated into
propositional forms about interpersonal relationships, or about how
certain factive statements are to be read. For instance, a facetious
remark may be said in such a manner that the speaker implies 'Take
this remark as ajoke' or 'I don't really believe this, and you shouldn't
either.' At the same time, people are normally coding information
emotively about the way they see each other as persons. Such
messages can usually be translated into such propositional forms as
'This is the way I see you' Or 'This is the way I see myself' or 'This is
the way I see you seeing me' and so on.
Although attitudes toward the self, toward others, and toward the
things that the self and others say may be more difficult to pin down
than are tangible states of affairs, they are nonetheless real. In fact,
Watzlawick et ai, contend that emotive messages concerning such
~bstract aspects of interpersonal realities are probably much more
Important to the success of communicative exchanges than the
factively coded messages themselves. If the self in relationship to
?thers is satisfactorily defined, and if the significant others in
lllteractionalrelationships confirm one's definition of self and others
communication concerning factive information can take place:
Otherwise, relationship struggles ensue. Marital strife over whether
or not one party loves the other, children's disputes about who said
presentfacts suggest other facts (or truths) in such a way as to induce~
belief in the latter upon the ground or the warrant of the former' (p. 8f).
C. The emotive aspect
To this point, we have been concerned primarily with the factive
aspect oflanguage and cognition. However, much of what has been
said applies as well to the emotive aspect oflanguage use. Nonetheless there are contrasts in the coding of the two types of information.
While factive information is coded primarily in distinctly verbal
sequences, emotive information is coded primarily in gestures, tone
of voice, facial expression, and the like. Whereas verbal sequences
consist of a finite set of distinctive sounds (or features of sounds),
syllables, words, idioms, and collocations, and generally of discrete
and countable sequences of elements, the emotive coding devices are
typically non-discrete and are more or less continuously variable.
For example, a strongly articulated p in the word pat hardly
changes the meaning of the word, nor does it serve much better to
distinguish a pat on the head from a cat in the garage. Shouting the
word garage does not imply a larger garage, nor would whispering it
necessarily change the meaning of the word in terms of its factive
value. Either you have a garage to talk about or you don't, and there
isn't much point in distinguishing cases in between the two extremes.
With emotive information things are different. A wildly brandished
fist is a stronger statement than a mere clenched fist. A loud shout
means a stronger degree of emotion than a softer tone. In the kinds of
devices typically used to code emotion information, variability in the
strength of the symbol is analogically related to similar variabilityin
the attitude that is symbolized.
In both speaking and writing, choice of words also figures largely in
the coding of attitudinal information. Consider the differences in the
attitudes elicited by the following sentences: (l) Some people say it is
better to explain our point of view as well as give the news; (2) Some
people say it is better to include some propaganda as well as give the
news. Clearly, the presuppositions and implications of the two
sentences are somewhat different, but they could conceivably be used
in reference to exactly the same immediate states of affairs or
extralinguistic contexts. Of the people polled in a certain study,
42.8 %agreed with the first, while only 24.7 %agreed with the second
(Copi, 1972, p. 70). The 18.1 %difference is apparently attributable to
the difference between explaining and propagandizing.
m
28
LANGUAGE TESTS AT SCHOOL
what and whether or not he or she meant it, labor and management
disagreements about fair wages, and the arms race between the major
world powers, are all examples of breakdowns in factive communication once relationship struggles begin.
What is very interesting for a theory of pragmatic expectancy
grammar is that in normal communication, ways of expressing
attitudes are nearly perfectly coordinated with ways of expressing
factive information. As a person speaks, boundaries between
linguistic segments are nearly perfectly synchronized with changes in
bodily postures, gestures, tone of voice, and the like. Research by
Condon and Ogston (1971) has shown that the coordination of
gestures and verbal output is so finely grained that even the smallest
movements of the hands and fingers are nearly perfectly coincident
with boundaries in linguistic segments clear down to the level of the
phoneme. Moreover, through sound recordings and high resolution
motion photography they have been able to demonstrate that when
the body movements and facial gestures of a speaker and hearer 'are
segmented and displayed consecutively, the speakers and hearer look
like puppets moved by the same set of strings' (p. 158).
The demonstrated coordination of mechanisms that usually code
factive information and devices that usually code emotive information shows that the anticipatory planning of the speaker and the
expectations of the listener must be in close harmony in normal
communication. Moreover, from the fact that they are so synchronized we may infer something of the complexity of the advance
planning and hypothesizing that normal internalizep grammatical
systems must enable language users to accomplish. Static grammatical devices which do not incorporate an element of real time would
seem hard put to explain some of the empirical facts which demand
explanation. Some sort of expectancy grammar, or a system
incorporating temporal constraints on linguistic contexts seems to be
required.
D. Language learning as grammar construction and modification
In a sense language is something that we learn, and in another it is a
medium through which learning occurs. Colin Cherry (1957) has said
that we never feel we have fully grasped an idea until we have 'jumped
on it with both verbal feet.' This seems to imply that language is not
just a means of expressing ideas that we already have, but rather that
it is a means of discovering ideas that we have not yet fully discovered.
LANGUAGE-5KILL AS A PRAGMATIC EXPECTANCY GRAMMAR
29
John Dewey argued that language was not just a means of 'expressing
antecedent thought', rather that it was a basis for the very act of
creative thinking itself. He wryly observed that the things that a
person says often surprise himself more than anyone else. Alice in
Through the Looking Glass seems to have the same thought instilled in
her own creative imagination through the genius of Lewis Carroll.
She asks, 'How can I know what I am going to say until I have already
said it?'
.
Because of the nature of human limitations and because of the
complexities of our universe of experience, in order for our minds to
cope with the vastness of the diversity, it categorizes and systematizes
elements into hierarchies and sequences of them. Not only is the
universe of experience more complex than we can perceive it to be at a
given moment of time, but the depths of our memories have
registered untold millions of details about previous experience that
are beyond the grasp of our present consciousness.
Our immediate awareness can be thought of as an interface
between external reality and the mind. It is like a corridor of activity
where incoming elements of experience are processed and where the
highly complex activities of thinking and language communication
are effected. The whole of our cognitive experience may be compared
to a more or less constant stream of complex and interrelated objects
passing back and forth through this center of activity.
Because of the connections and interrelationships between
incoming elements, and since they tend to cluster together in
predictable ways, we learn to expect certain kinds ofthings to follow
from certain 'others. When you turn the corner on the street where
you live you expect to see certain familiar buildings, yards, trees, and
possibly your neighbor's dog with teeth bared. When someone speaks
to you and you turn in his direction, you expect to see him by
looking in the direction of the sound you have just heard. These sorts
of expectations, whether they are le~rned or innate, are so
commonplace that they seem trivial. They are not, however. Imagine
the shock of having to face a world in which such expectations
stopped being correct. Think what it would be like to walk into your
living room and find· yourself in a strange place. Imagine walking
toward someone and getting farther from him with every step. The
violations of our commonest expectations are horror-movie material
that make earthquakes and hurricanes seem like Disneyland.
Man's great advantage over other organisms which are also
prisoners of time and space, is his ability to learn and use language to
30
LANGUAGE TESTS AT SCHOOL
systematize and organize experience more effectively. Through ~he
use oflanguage we may broaden or narrow the focus of our att~ntl~n
much the way we adjust the focus of our vision. We ma~ t~mk m
terms of this sentence, or today, or this school year, or our hfetlme, or
known history, and so on. Regardless of how broad or narrow our
perspective, there is a sequence of elements attended ~o by our
consciousness within that perspective. The sequence Itself may
consist of relatively simple elements, or sets of interrelated and highly
structured elements, but there must be a sequence because the totality
of even a relatively simple aspect of our universe is too complex to be
taken in at one gulp. We must deal with certain things ahead of
others. In a sense, we miIst take in elements single file at a given rate,
so that within the span of immediate consciousness, the number of
elements being processed does not exceed certain limits.
In a characteristic masterpiece publication, George 'Miller (1956)
presented a considerable amount of evidence from a ~de variety of
sources suggesting that the number of separate thmgs th~t our
consciousness can handle at anyone time is somewhere m the
neighborhood of seven, plus or minus one or two. He also pointed out
that human beings overcome this limitation in part by what he ~alls
'chunking', By treating sequences or clusters of ele~ents as umtary
chunks (or members of paradigms or classes) the mmd constructs a
richer cognitive system. In other words, by setting up us~ful
categories of sequences, and categories of sequenqes of c~tegones,
our capacity to have correct expectations is enhanced -. that IS, we are
enabled to have correct expectations about more objects, or more
complex sorts of objects· (in the most abstract sense of 'object')
without any greater cost to the cognitive system.
All of this is merely a way of talking ab,out learning. As sequences
of elements at one level are organized into classes at a higher order of
abstraction, the organism can be said to be constructing an
appropriate expectancy grammar, or learning. A universal c.onsequence of the construction and modification of an appropnate
expecta,ncy grammar is that the processing of sequences of elements
that conform to the constraints of the grammar is thus enhanced,
Moreover, it may be hypothesized that highly organized sequences of
elements that are presented in contexts where the basis for the
organization can be discovered will be more conducive to the
construction of an appropriate expectancy grammar than the
presentation of similar sequences without appropriate sorts of
context.
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
31
We are drawn to the generalization that there is an extremely
important parallel between tlfenormal use of language and the
learning of a language. The learner is never quite in the position of
having no expectations to begin with. Even the newborn infant
apparently has certain innate expectancies, e.g., that sucking its
mother's breast will produce a desired effect. In fact, experiments by
Bower (1971, 1974) seem to point to the conchision that an infant is
born with certain expectations of a much more specific sort - for
example, the expectation that a seen object should have some tangible
solidity to it. He proved that infants at surprisingly early ages were
astonished when they passed their hands through the space occupied
by what appeared to be a tangible object. However, his experiments
show that infants apparently hav~ to learn to expect entities (such as
mother) to appear in only one place at one time. They also seem to
have to learn that a percept of a moving object is caused by the same
object as the percept of that same moving object when it comes to
rest.
The problem, it would seem, from an educational point of view is
how to take advantage of the expectancies that a learner has already
acquired in trying to teach new material. The question is, what does
the learner alre,ady know, and how can that knowledge be optimally
utilized in the presentation of new material? It has been demonstrated many times over that learning of verbal material is
enhanced if the meaningfulness of the material is maximized from the
learner's point of view, An unpronounceable sequence of letters like
gbntmbwk is more difficult to learn and to recall than say, nox ems
glerf, in spite of the fact that the latter is a longer sequence ofletters,
The latter is easier because it conforms to some of the expectations
that English speakers have acquired concerning phonological and
graphological elements. A phrase like colorless grlien ideas conforms
less well to our acquired expectancies than beautifulfall colors. Given
appropriate contexts for the latter and the lack of them for the most
part for the former, the latter should also be easier to learn to use
appropriately than the fonner. A nonsensical passage like the one
Mr Foote invented to stump Mr Macklin would be more difficult to
learn than nofinal prose. The reason is simple enough. Learners
know more about normal prose before the learning task begins.
Language programs that employ fully contextualized and
maximally meaningful language necessarily optimize the learner's
ability to use previously acquired expectancies to help discover the
pragmatic mappings of utterances in the new language onto
32
LANGUAGE TESTS AT SCHOOL
extralinguistic contexts. Hence, they would seem to be superior to
programs that expect learners to acquire the ability to use a language
on the basis of disconnected lists of sentences in the form of pattern
drills, many of which are not only unrelated to meaningful
extralinguistic contexts, but which are intrinsically unrelatable.
If one carefully examines language teaching methods and language
learning settings which seem to be conducive to success in acquiring
facility in the language, they all seem to have certain things in
common. Whether a learner succeeds in acquiring a first language
because he was born in the culture where that language was used, or
was transported there and forced to learn it as a second language;
whether a learner acquires a second language by hiring a tutor and
speaking the language incessantly, or by marrying a tutor, or by
merely maintaining a prolonged relationship with someone who
speaks the language; whether the learner acquires the language
through the command approach used successfully by J. J. Asher
(1969,1974), or the old silent method (Gattegno, 1963), or through.z
set of films of communicative exchanges (Oller, 1963-65), or by
joining in a bilingual education experiment (Lambert and Tucker,
1972), certain sorts of data and motivations to attend to them are
always present. The learner must be exposed to linguistic contexts in
their peculiar pragmatic relationships to extralinguistic contexts, and
the learner must be motivated to communicate with people in the
target language by discovering tl;lOse pragmatic relationships.
Although we have said little about education in a broader sense,
everything said to this point has a broader application. In effect, the
hypothesis concerning pragmatic expectancy grammar as a basis for
explaining success and failure in' language learning and 'language
teaching can be extended to all other areas of the school curriculum in
which language plays a large part. We will return to this issue in
Chapter 14 where we discuss reading curricula and other language
based parts of curricula in general. In particular we will examine
research into the developing language skills of children in Brisbane,
Australia (Hart, Walker, and Gray, 1977).
E. Tests that invoke the learner's grammar
When viewed from the vantage point assumed in this chapter,
language testing is primarily a task of assessing the efficiency of the
pragmatic expectancy grammar the learner is in the process of
constructing. In order for a language test to achieve validity in terms
of the theoretical construct of a pragmatic expectancy grammar, it
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
33
will have to invoke and challenge the efficiency of the learner's
developing grammar. We can be more explicit. Two closely
interrelated criteria of construct validity may be imposed on language
tests: first, they must cause the learner to process (either produce or
comprehend, or possibly to comprehend, store, and recall, or some
other combination) temporal sequences of elements in the language
that conform to normal contextual constraints (linguistic and
extralinguistic); second, they must require the learner to understand
the pragmatic interrelatiOnship of linguistic contexts and extralinguistic contexts.
The two validity requirements just stated are like two sides of the
same coin. The first emphasizes the sequential constraints specified by
the grammar, and the second emphasizes the function of the grammar
in relating sequences of elements in the language to states of affairs
outside of language. In subsequent chapters we will often refer to
these validity requirements as the pragmatic naturalness criteria. We
will explore ways of accomplishing such assessment in Chapter 3, and
in greater detail in Part Three which includes Chapters 10 through 14.
Techniques that fail to meet the naturalness criteria are discussed in
Part Two - especially in Chapter 8. Multiple choice testing
procedures are discussed in Chapter 9.
KEY POINTS
1. To u~derstand the problem of constructing valid language tests, it is
essentIal to understand the nature of the skill to be tested.
2. Two .a~pects of language in use need to be distinguished: the factive (or
cogmtIve) aspect of language use has to do with the coding of
information about states of affairs by using words, phrases, clauses, and
discourse; the emotive (or affective) aspect oflanguage use has to do with
the coding of infmmation about attitudes and interpersonal relationships by using facial expression, gesture, tone of voice, and choice
of words. These two aspects oflanguage use are intricately interrelated.
3. Two major kinds of context are distinguished: linguistic context consists
of verbal and gestural aspects; and extralinguistic context similarly
consists of objective and subjective aspects.
4. The systematic correspondences between linguistic and extralinguistic
contexts are referred to as pragmatic mappings.
5. !ragmatics asks how utterances (and of course other forms oflanguage
III use) are related to human experience.
6. In relation to the factive aspect of coding information about states of
affairs .o~tside of language, it is asserted that language is always an
abb~evIatlOn for a much more complete and detailed sort of knowledge.
7. An Important aspect of the coding of information in language is the
anti~ipatory planning of the speaker and the advance hypothesizing of
the lIstener concerning what is likely to be said next.
34
LANGUAGE TESTS AT SCHOOL
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR
8. A pragmatic expectancy grammar is defined as a psychologically real
system that sequentially orders linguistic elements in time and in relation
to extralinguistic contexts in meaningful ways.
.
9. As linguistic sequences become more highly constrained by grammatical
organization of the sorts illustrated, they become easier to process ...
10. Whereas coding devices for factive information are typically dIgI.tal
(either on or off, present or absent), coding devices for emotIve
information are usually analogical (continuously variable). A tone of
voice which indicates excitement may vary with the degree of excitement,
but a digital device for, say, referring to a pair of glasses cannot be
whispered to indicate very thin corrective lenses and shouted to indicate
thick ones. The word eyeglasses does not have such a continuous
variability of meaning, but a wild-eyed shout probably does mean a
greater degree of intensity than a slightly raised voice.
.
11. Where there is a conflict between emotively coded information and
factive level information, the former usually overrides the latter.
12. When relationship struggles begin, factive level communication usually
ends. Examples are the wage-price spiral and the arms race.
13. The coding of factive and emotive information are very precisely
synchronized, and the gestural movements of speaker and listener in a
typical communicative exchange are also timed in surprisingly accurate
cadence.
14. Some sort of grammatical system incorporating the element of real time
and capable of substantial anticipatory-expectancy activity seems
required to explain well known facts of normal language use.
15. Language is both an object and a tool of learning. Cherry suggests that
we not only express ideas in words, but that we in fact discover them by
putting them into words.
16. Language learning is construed as a process of constructing an
appropriate expectancy generating system. Learning is enhancing one's
capacity to have correct expectations about the nature of experience.
17. It is hypothesized that language teaching programs (and by implication
educational programs in general) will be more effective if they optimize
the learner's opportunities to take advantage of previously acquired
expectancies in acquiring new knowledge.
18. It is further hypothesized that the data necessary to language acquisition
are what are referred to in this book as pragmatic mappings - i.e., the
systematic correspondences between lingUistic and extralinguistic
contexts. In addition to opportunity, the only other apparent necessity is
sufficient motivation to operate on the requisite data in appropriate
ways.
19. Valid language tests are defined as those tests which meet the pragmatic
naturalness criteria - namely, those which invoke and challenge the
efficiency ofthe learner's expectancy grammar, first by causing the learner
to process temporal sequences in the language that conform to normal
contextual constraints, and second by requiring the learner to
understand the systematic correspondences of linguistic contexts and
extralinguistic contexts.
35
DISCUSSION QUESTIONS
1. Why is it so important to understand the nature of the skill you are trying
to test? Can you think of examples of tests that have been used for
educational or other decisions but which were not related to a careful
consideration of the skill or knowledge they purported to assess? Study
closely a test that is used in your school or that you have taken at some
time in the course of your educational experience. How can you tell if the
test is a measure of what it purports to measure? Does the label on the
test really tell you what it measures?
2. Look for examples in your own experience illustrating the importance of
grammatically based expectancies. Riddles, puns, jokes, and parlor
games are good sources. Speech errors are equally good illustrations.
Consider the example of the little girl who was asked by an adult where
she got her ice cream. She replied, 'All over me,' as she looked shee.pishly
at the vanilla and chocolate stains all over her dress. How dId her
expectations differ from those of the adult who asked the question? .
3. Keep track oflistening or reading errors where you took a wrong turn III
your thinking and had to do some retreading farther down the line.
Discuss the source of such wrong turns.
4. Consider the sentences: (a) The boy was bucked off by the pony, and (b)
The boy was bucked off by the barn (example from Woods, 1970). Why
does the second sentence require a mental double-take? Note similar
examples in your reading for the next few days. Write down examples.
and be prep(lred to discuss them with your class.
SUGGESTED READINGS
1. George A. Miller, 'The Magical Number Seven Plus or Minus Two:
Some Limits on Our Capacity for Processing Information,'
Psychological Review 63,1956,81-97.
2. Donald A. Norman, 'In Retrospect,' Memory and A ttention. New York:
Wiley, 1969, pp. 177-181.
3. Part VI of Focus on the Learner. Rowley, Mass.: Newbury House, 1973,
pp.265-300.
4. Bernard Spolsky, 'What Does It Mean to Know a Language or How Do
You Get Someone to Perform His Competence?' In J. W. Oller, Jr. and
J. C. Richards (eds.) Focus on the Learner. Rowley, Mass.: Newbury
House, 1973, 164-76.
m
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
3
Discrete Point,
Integrative,
or Pragmatic Tests
A. Discrete point versus integrative
testing
B. A definition of pragmatic tests
C. Dictation and cloze procedure as
examples of pragmatic tests
D. Other examples of pragmatic tests
E. Research on the validity of
pragmatic tests
1. The meaning of correlation
2. Correlations between different
language tests
3. Error analysis as an
independent source of validity
data
Not all that glitters is gold, and not everything that goes by the name
is twenty-four karat. Neither are all tests which are called language
tests necessarily worthy of the name, and some are better than others.
This chapter deals with three classes of tests that are called measures
of language - but it will be argued that they are not equal in
effectiveness. It is claimed that only tests which meet the pragmatic
naturalness criteria defined in Chapter 2 are language tests in the most
fundamental sense of what language is and how it functions.
37
credited with first proposing the distinction between discrete point
and integrative language test~. Although the types are not always
different for practica! purposes, the theoretical bases of the two
approaches contrast markedly and the predictions concerning the
effects and relative validity of different testing procedures also differ
in fundamental ways depending on which of the two approaches one
selects. The contrast between these two philosophies, of course, is not
limited to language testing per se, but can'be seen throughout the
whole spectrum of educational endeavor.
Traditionally, a discrete point test is one that attempts to focus
attention on one point of grammar at a time. Each test item is aimed
at one and only one element of a particular component of a grammar
(or perhaps we should say hypothesized grammar), such· as
phonology, syntax, or vocabulary. MOreover, a discrete point test
purports to assess only one skill at a time (e.g., listening, or speaking,
or reading, or writing) and only one aspect of a skill (e.g., productive
versus receptive or oral versus visual). Within each skill, aspect, 'and
component, discrete items supposedly focus on precisely one and
only one phoneme, morpheme, lexical item, grammatical rule, or
whatever the appropriate element may be. (See Lado, 1961, in
Suggested Readings at the end of this chapter.) For instance, a
phonological discrete item might require an examinee to distinguish
between minimal pairs, e.g., pill versus peel, auditorily presented. An
example of a morphological item might be one which requires the
selection of an appropriate suffix such as -ness or -i/y to form a noun
from an adjective like secure, or sure. An example of a syntactic item
might be a fill-in-the-blank type where the examinee must supply the
suffix -s as in He walk - to town each morning now that he lives in the
city.!
A. Discrete point versus integrative testing
The concept of an integrative test was born in contrast with the
definition of a discrete point test. If discrete items take language skill
apart, integrative tests put it back together. Whereas discrete items
attempt to test knowledge oflanguage one bit at a time, integrative
tests attempt to assess a learner's capacity to use many bits all at the
same time, and possibly while exercising several presumed components of a grammatical system, and perhaps more than one of the
traditionally recognized skills or aspects of skills.
However, to base a definition of integrative language testing on
In recent years, a body ofliterature on language testing has developed
which distinguishes two major categories oftests.John Carroll (1961, '
see the Suggested Readings at the end of this chapter) was the person
1 Other discrete item examples are offered in Chapter 8 where we return to the topic of
discrete point tests and examine them in greater detail.
36
38
LANGUAGE TESTS AT SCHOOL
what would appear to be its logical· antithesis and in fact its
competing predecessor is to assume a fairly limiting point of view. It
is possible to look to other sources for a theoretical basis and
cc
rationale for so-called integrative tests.
B. A definition of pragmatic tests
The term pragmatic test has sometimes been used interchangeably
with the term integrative test in order to call attention to the
possibility of relating integrative language testing procedures to a
theory of pragmatics, or pragmatic expectancy grammar. Whereas
integrative testing has been somewhat loosely defined in terms of
what discrete point testing is not, it is possible to be somewhat more
precise in saying what a pragmatic test is: it is any procedure or task
that causes the learner to process sequences of elements in a language
that conform to the normal contextual constraints of that 1anguage,
and which requires the leamer to relate sequences of linguistic
elements via pragmatic mappings to extralinguistic context.
Integrative tests are often pragmatic in this sense, and pragmatic
tests are always integrative. There is no ordinary discourse situation
in which a learner might be asked to listen to and distinguish between
isolated minimal pairs of phonological contrasts. There is no normal
language use context in which one's attention would be focussed on _
the syntactic rules involved in placing appropriate suffixes on verb
stems or in moving the agent of an active declarative sentence from
the front of the sentence to the end in order to form a passive (e.g., "
The dog bit John in the active form becoming John was bitten by the
dog in the passive). Thus, discrete point tests cannot be pragmatic, and
conversely, pragmatic tests cannot be discrete point tests. Therefore,
pragmatic tests must be integrative.
But integrative language tasks can be conceived which do not meet
one or both of the naturalness criteria which we have imposed in our
definition of pragmatic tests. If a test merely requires an examinee to
use more than one of the four traditionally recognized skills mid/or
one or more of the traditionally-recognized components of grammar,
it must be considered integrative. But to qualify as a pragmatic test,
more is required.
.
In order for a test user to say something meaningful (valid) about
the efficiency of a learner's developing grammatical system, the
pragmatic naturalness criteria require that the test invoke and
challenge that developing grammatical system. This requires
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
39
processing sequences of elements in the target language (even if it is
the learner's first and only language) subject to temporal contextual
constraints. In addition, the tasks must be such that for examinees to
do them, linguistic sequences must be related to extralinguistic
contexts in meaningful ways.
Examples of tasks that do not qualify as pragmatic tests include all
discrete point tests, the rote recital of sequences of material without
attention to meaning; the manipulation of sequences of verbal
elements, possibly in complex ways, but in ways that do not require
awareness of meaning. In brief, if the task does not require attention
to meaning in temporally c~ntrained sequences oflinguistic elements,
it cannot be construed as a pragmatic language test. Moreover, the
constraints must be of the type that are found in normal uses of the
language, not merely in some classroom setting that may have been
contrived according to some theory of how languages should be
taught. Ultimately, the question of whether or not a task is pragmatic
is an empirical one. It cannot be decided by theory based preferences;
or opinion polls.
C. Dictation'arid cloze procedure as examples of pragmatic tests
The traditional dictation, rooted in the distant past of language
teaching, is an interesting example of a pragmatic language testing
procedure. If the sequences of words or phrases to be dictated are
selected from normal prose, or dialogue, or some other natural form
of discourse (or perhaps if the sequences are carefully contrived to
mirror normal discourse, as in well-written fiction) and if the material
is presented orally in sequences that are long enough to challenge the
short term memory of the learners, a simple traditional dictation
meets the naturalness requirements for pragmatic language tests.
First, sucha task requires the processing of temporally constrained
sequences of material in the language and second, the task of dividing
up the stream of speech and writing down what is heard requires
understanding the meaning of the material - i.e., relating the
linguistic context (which in a sense is given) to the extralinguistic
context (which must be inferred).
Although an inspection of the results of· dictation tests with
appropriate statistical procedures (as we will see below) shows the
technique to be very reliable and highly valid, it has not always been
looked on with favor by the experts. For example, Robert Lado
(1961) said:
FA
40
LANGUAGE TESTS AT SCHOOL
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
41
count spelling as a criterion for correctness. Somaratne's remark
seems to imply that one must, but research shows that one shouldn't.
We will return to this question in particular, namely, the scoring of
dictation, and other practical questions in Chapter 10.
The view that a language learner can take dictation (which is
presented in reasonably long bursts, say, five or more words between
pauses, and where each burst is given at a conversational rate)
without doing some very active and creative processing is credible
only from the vantage point of the naive examiner who thinks that the
learner automatically knows what the examiner knows about the
material being dictated. As the famous Swiss linguist pointed out
three quarters of a century ago,
Dictation ... on critical inspection ... appears to measure very
little of language. Since the word order is given ... it does not
test word order. Since the words are given ... it does not test
vocabulary. It hardly tests the aural perception of the
examiner's pronunciation because the words can in many cases
be identified by context. ... The student is less likely to hear the
sounds incorrectly in the slow reading of the words which is
necessary for dictation (p. 34).
Other authors have tended to follow Lado's lead:
As a testing device ... dictation must be regarded as generally
uneconomical and imprecise (Harris, 1969, p. 5). Some teachers
argue that dictation is a test of auditory comprehension, but
surely this is a very indirect and inadequate test ·of such an
important skill (Anderson, 1953, p. 43). Dictation is primarily a
test of spelling (Somaratne, 1957, p. 48).
... the main characteristic of the sound chain is that it is linear.
Considered by itself it is only a line, a continuous ribbon along
which the ear perceives no self-sufficient and clear-cut division
... (quoted from lectures compiled by de Saussure's students,
Bally, Sechehaye, and Riedlinger, 1959, pp. 103-104).
More recently. J. B. Heaton (1975), though he cites some of the up-todate research on dictation in his bibliography, devotes less than two
pages to dictation as a testing procedure and concludes that
To prove that the words of a dictation are not necessarily 'given'
from the learner's point of view, one only needs to try to write
dictation in an unknown language. The reader may try this test: have
a speaker of Y oruba, Thai, Mandarin, Serbian or some other
language which you do not know say a few short sentences with
pauses between them long enough for you to write them down or
attempt to repeat them. Try something simple like, Say Man what's
happening, or How's life been treating you lately, at a conversational
rate. If the proof is not convincing, consider the kinds of errors that
non-native speakers of English make in taking dictation.
In a research report circulated in 1973, Johansson gave examples of
vocabulary errors: eliquants, elephants, and elekvants for the word
eloquence. It is possible that the first rendition is a spelling error, but
that possibility does not exist for the other renditions. At the phrase
level, consider of appearance for of the period, person in facts for
pertinent facts, less than justice, lasting justice, last in justice for just
injustice.Or when a foreign student at UCLA writes, to find particle
man living better and mean help man and boy tellable damage instead of
to find practical means of feeding people better and means of helping
them avoid the terrible damage of windstorms, does it make sense to
say that the words and their order were 'given'?
Though much research remains to be done to understand better
what learners are doing when they take dictation, it is clear from the
above examples that whatever mental processes they are performing
dictation ... as a testing device measures too many different
language features to be effective in providing a means of
assessing anyone skill (p. 186).
Davies (1977) offers much the same criticism of dictation. He suggests
that it is too imprecise in diagnostic information, and further that it is
apt to have an unfortunate 'washback' effect (namely, in taking on
'the aura oflanguage goals'). Therefore, he argues
it may be desirable to abandon such well-worn and suspect
techniques for less familiar and less coherent ones (p. 66).
.
In the rest of the book edited by Allen and Davies (1977) there is only
one other mention of dictation. Ingram (1977) in the same volume
pegs dictation as a rather weak sort of spelling test (see p. 20).
If we were to rely on an opinion poll, the weight of the evidence
would seem to be against dictation as a useful language testing
procedure. However, the validity of a testing procedure is hardly the
sort of question that can be answered by taking a vote.
Is it really necessary to read the material very slowly as is implied
by Lado's remarks? The answer is no. It is possible to read slowly, but
it is not necessary to do so. In fact, unless the material is presented in
sequences long enough to challenge the learner's short term memory,
and quickly enough to simulate the normal temporal nature of speech
sequences, then perhaps dictation would become a test of spelling as
Somaratne and Ingram suggest. However, it is not even necessary to
-
42
LANGUAGE TESTS AT SCHOOL
must be active and creative. There is much evidence to suggest tl~at',"
there are fundamental parallels between tasks like taking dictation, (
and using language in a wide variety of other ways. Among closely'
related testing procedures are sentence repetition tasks (or 'elicited
imitation') which have been used in the testing of children for
proficiency in one or more languages or language varieties. We return'
to this topic in detail in Chapter 10.
All of the research seems to indicate that in order for examinees to ,
take dictation, or to repeat utterances that challenge their short term memory, it is necessary not only to make the appropriate
discriminations in dividing up the continuum of speech, but also to
understand the meaning of what is said.
Another example of a pragmatic language testing procedure is the
cloze technique. The best known variety of this technique is the sort, '
of test that is constructed by deleting every fifth, sixth, or seventh \
word from a passage of prose. Typically each deleted word is replaced
by a blank of standard length, and the task set the examinee is to fill in _,
the blanks by restoring the missing words. Other varieties of the '
procedure involve deleting specific vocabulary items, parts of speech,
affixes, or particular types of grammatical markers.
The word cloze was invented by Wilson Taylor (1953) to call
attention to the fact that when ~n examinee fills in the gaps in a "
passage of prose, he is doing something similar to what Gestalt
psychologists call 'closure', a process related to the perception of
incomplete geometric figures, for example. Taylor considered words'
deleted from prose to present a special kind of closure problem. From:
what is known of the grammatical knowledge the examinee brings to "
bear in solving such a closure problem, we can appreciate the fact that'
the problem is a very special sort of closure.
Like dictation, doze tests meet both ofthe naturalness criteria for
pragmatic language tests. In order to give correct responses (whether
the standard of correctness is the exact word that originally appeared
ata particular point, or any other word that fully fits the context of
the passage), the learner must operate ___ the basis of, both
immediate and long-range ___ constraints. Whereas some of the
blanks in a cloze test (say of the standard variety deleting every nth
word) can be filled by attending only to a few words on either side of
the blank, as in the first blank in the preceding sentence, other blanks
in a typical cloze passage require attention to longer stretches of
linguistic context. They often require inferences about extralinguistic
context, as in the case of the second blank in the preceding sentence.
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
43
The word on seems to be required in the first blank by the words
operate and the basis of, without any additional information.
Bowever, unless long range constraints are taken into account, the
second blank offers many possibilities. If the examinee attended only
to such constraints as are afforded by the words from operate
onward, it could be filled by such words as missile, legal, or leadership.
The intended word was contextual. Other alternatives which might
have occurred to the reader, and which are in the general semantic
target area might include temporal, verbal, extralinguistic, grammatical, pragmatic, linguistic, psycho linguistic, sociolinguistic, psychological, semantic, and so on.
In taking a cloze test, the examinee must utilize information that is
inferred about the facts, events, ideas, relationships, states of affairs,
social settings and the like that are pragmatically mapped by the
linguistic sequences contained in the passage. Examples of cases
where extralinguistic context and the linguistic context of the passage
are interrelated are obvious in so-called deictic words such as here
and now, then and there, this and that, pronouns that refer to persons
or things, tense indicators, aspect markers on verbs, adverbs of time
and place, determiners and demonstratives in general, and a host of
others.
For a simple example, consider the sentence, A horse was fast when
he was tied to a hitching post, and the same animal was also fast when he
won a horse-race. If such a sentence were part of a larger context, say
on the difficulties of the English language, and if we deleted the first a,
the blank could scarcely be filled with the definite article the because
no horse has been mentioned up to that point. On the other hand, if
we deleted the the before the words same animal, the indefinite article
could not be used because ofthe fact that the horse referred to by the
phrase A horse at the beginning of the sentence is the same horse
referred.to by the phrase the same animal. This is an example of a
pragmatic constraint. Consider the oddity of saying, The horse was
fast when he was tied to a hitching post, and a same animal was also fast
when he won a horse-race.
Even though the pragmatic mapping constraints involved in
normal discourse are only partially understood by the theoreticians,
and though they cannot be precisely characterized in terms of
grammatical systems (at least not yet), the fact that they exist is wellknown, and the fact that they can be tested by such pragmatic
procedures and the cloze technique has been demonstrated (see
Chapter 12).
44
LANGUAGE TESTS AT SCHOOL
All sorts of deletions of so-called content words (e.g., nouns,
adjectives, verbs, and adverbs), and especially grammatical con:
nectors such as subordinating conjunctions, negatives, and a great
many others carry with them constraints that may range backward or
forward across several sentences or more. Such linguistic elements
may entail restrictions that influence items that are widely separated
in the passage. This places a strain on short term memory which
presses the learner's pragmatic expectancy grammar into operation.
The accuracy with which the learner is able to supply correct
responses can therefore be taken as an index of the efficiency of the
learner's developing grammatical system. Ways of constructing,
administering, scoring, and interpreting cloze tests and a variety of
related procedures for acquiring such indices are discussed in
Chapter 12.
D. Other examples of pragmatic tests
Pragmatic testing procedures are potentially innumerable. The
techniques discussed so far, dictation, cloze, and variations of them,
by no means exhaust the possibilities. Probably they do not even
begin to indicate the range of reasonable possibilities to be explored.
There is always a danger that minor empirical advances in
educational research in particular, may lead to excessive dependence
on procedures that are associated with the progress. However, in
spite of the fact that some of the pragmatic procedures thus far
investigated do appear to work substantially better than their discrete
point predecessors, there is little doubt that pragmatic tests can also
be refined and expanded. It is iJ.11portant that the procedures which
now exist and which have been studied should not limit our vision
concerning .other possibilities. Rather, they should serve as
guideposts for subsequent refinement and development of still more
effective and more informative testing procedures.
Therefore, the point of this section (and in a broader sense, this
entire book) is not to provide a comprehensive list of possible
pragmatic testing procedures, but rather to illustrate some of the
possible types of procedures that meet the naturalness criteria
concerning the temporal constraints on language in use, and the
pragmatic mapping of linguistic contexts onto extralinguistic ones.
Below, in section E of this chapter, we will discuss evidence
concerning the validity of pragmatic tests. (Also, see the Appendix.)
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
45
Combined cloze and dictation. The examinee reads material from
which certain portions have been deleted and simultaneously (or
subsequently) hears the same material without deletions either live or
on tape. The examinee's task is to fill in the missing portions the same
as in the usual cloze procedure, but he has the added support of the
auditory signal to help him fill in the missing portions. Many
variations on this procedure are possible. Single words, or even parts
of words, or sequences of words, or even whole sentences or longer
segments may be deleted. The less material one deletes, presumably,
the more the task resembles the standard cloze procedure, and the
more one deletes, the more the task looks like a standard dictation.
Oral cloze procedure. Instead of presenting a cloze passage in a
written format, it is possible to use a carefully prepared tape
recording of the material with numbers read in for the blanks, or with
pauses where blanks occur. Or, it is possible merely to read the
material up to the blank, give the examinee the opportunity to guess
the missing word, record the response, and at that point either tell the
examinee the right answer (i.e., the missing word), or simply go on
without any feedback as to the correctness of the examinee's
response. Another procedure is to arrange the deletions so that they
always come at the end ofa clause or sentence. Any of these oral cloze
techniques have the advantage of being usable with non-literate
populations.
Dictation with interfering noise. Several varieties of this procedure
have been used, and for a wide range of purposes. The best known
examples are the versions of the Spolsky-Gradman noise tests used
with non-native speakers of English. The procedure simply involves
superimposing white noise (a wide spectrum of random noise
sounding roughly like radio static or a shhhhshing sound at a
constant level) onto taped verbal material. If the linguistic context
under the noise is fully meaningful and subject to the normal
extralinguistic constraints, this procedure qualifies as a pragmatic
testing technique. Variations include noise throughout the material
versus noise over certain portions only. It is argued, in any event, that
the noise constitutes a situation somewhat parallel to many of the
everyday contexts where language is used in less than ideal acoustic
conditions, e.g., trying to have a conversation in someone's
livingroom when the television and air conditioner are producing a
high level of competing noise, or trying to talk to or hear someone else
46 LANGUAGE TESTS AT SCHOOL
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
in the crowded lobby of a hotel, or trying to hear a message oVeL,ii
1
public address system in a busy air terminal, etc.
0
Paraphrase recognition. In one version, examinees are asked to
read a sentence and then to select from four or five alternatives the
best paraphrase for the given sentence. The task may be made
somewhat more difficult by having examinees read a paragraph or
longer passage and then select from several alternatives the one which
best represents the central meaning or idea of the given passage. This
task is somewhat similar to telling what a conversation was about, or
what the main ideas of a speech were, and the like. Typically, such
tests are interpreted as being tests of reading comprehension.
However, they are pragmatic language tests inasmuch as they meet
the naturalness criteria related to meaning and temporal constraints.
A paraphrase recognition task may be either in a written format or
an oral format or some combination of them. An example of an ora(
format comes from the Test of English as a Foreign Language
produced by Educational Testing Service, Princeton, New Jersey.
Examinees hear a sentence like, John dropped the letter in the mailbox). ,
Then they must choose between (a) John sent: the letter; (b) John
opened the letter; (c) John lost the letter; (d) John destroyed the letter. 2 .
Of course, considerably more complicated items are possible. The
discrete point theorist might object that since the first stimulus is ,
presented auditorily and since the choices are then presented in a
written format, it becomes problematic to say what the test is a test of
- whether listening comprehension, or reading comprehension, or
both. This is an issue that we will return to in Chapters 8 and 9, and
which will be addressed briefly in the section on the validity of·
pragmatic tests below. Also, see the Appendix.
Question answering. In one section ofthe TOEFL, examinees are
required to select the best answer from a set of written alternatives to
an auditorily presented question (e~ther on record or tape). For
instance, the examinee might hear, When did Tom come here? In the
test booklet he reads, (a) By taXi; (b) Yes, he did; (c) To study history;
and (d) Last night. He must mark on his answer sheet the letter
corresponding to the best answer to the given question.
This example and subsequent ones from the TOEFL are based on mimeographed
hand-outs prepared by the staff at Educational Testing Service to describe the new
format ofthe TOEFL in relation to the format used from 1961-1975.
2
47
A slightly different question answering task appears in a different
section of the test. The examinee hears a dialogue such as:
MAN'S VOICE: Hello Mary. Thi~ is Mr Smith at the office. Is
Bill feeling any better today?
WOMAN'S VOICE: Oh, yes, Mr Smith. He's feeling much better
now. But the doctor says he'll have to stay in
bed until Monday.
THIRD VOICE: Where is Bill now?
possible answers from which the examinee must choose include: (a) A t
the office; (b) On his way to work; (c) Home in bed; and (d) Away on
vacation.
Perhaps the preceding example, and other multiple choice
examples may seem somewhat contrived. For this and other reasons
to be discussed in Chapter 9, good items of the preceding type are
quite difficult to prepare. Other formats which allow the examinee to
supply answers to questions concerning less contrived contexts may
be more suitable for classroom applications. For instance, sections of
a television or radio broadcast in the target language may be taped.
Questions formed in relation to those passages could be used as part
of an interview technique aimed at testing oral skills.
A colorful, interesting, and potentially pragmatic testing technique
is the Bilingual Syntax Measure (Burt, Dulay, and Hernandez, 1975).
It is based on questions concerning colorful cartoon style pictures
like the one shown in Figure 1, on page 48.
The test is intended for children between the ages of four and
nine, from kindergarten through second grade. Although the authors
of the test have devised a scoring procedure that is essentially aimed
at assessing control of less than twenty so-called functors (morphological and syntactic markers like the plural endings on nouns, or
tense markers on verbs), the procedure itself is highly pragmatic.
First, questions are asked in relation to specific extralinguistic
contexts in ways that require the processing of sequences of elements
in English, or Spanish, or possibly some other language. Second,
those meaningful sequences of linguistic elements in the form of
questions must be related to the given extralinguistic contexts in
meaningful ways.
For instance, in relation to a picture such as the one shown in
Figure 1, the child might be asked something like, How come he's so
skinny? The questioner indicates the guy pushing the wheelbarrow.
The situation is natural enough and seems likely to motivate a child to
I
.p
48
LANGUAGE TESTS AT SCHOOL
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
49
long time that subjective rankings of passages of. prose are
:ometimes more reliable than rankings (for relative difficulty) based
on readability formulas (Klare, 1974).
An institutional technique that has been fairly well standardized by
the Foreign Service Institute uses a training procedure for judges who
are taught to conduct interviews and to judge performance on the
basis of carefully thought-out rating scales. This procedure is
discussed along with the /lyin Oral Interview (Ilyin, 1972) and
Upshur's Oral Communication Test (no date), in Chapter 11.
Figure 1. A cartoon drawing illustrating the style of the Bilingual Syntax
Measure.
want to respond. We return to the Bilingual Syntax Measure and a
number of related procedures in Chapter 11.
Oral interview. In addition to asking specific questions about
pictured or real situations, oral tests may take a variety of other
forms. In effect, every opportunity a learner is given to talk in an
educational setting can be considered a kind of oral language test.
The score on such a test may be only the sUbjective impression that it
makes on the teacher (or another evaluator), or it may be based on
some more detailed plan of counting errors. Surprisingly perhaps, the
so-called objective procedures are not necessarily more reliable. In
fact, they may be less reliable in some cases. Certain aspects of
language performances may simply lend themselves more to
sUbjective judgement than they do to quantification by formula. For
instance, Richards (1 970b) has shown that naive native speakers are
fairly reliable judges of word frequencies. Also, it has been known for
Composition or essay writing. Most free writing tasks necessarily
qualify as pragmatic tests. Because it is frequently difficult to judge
examinees relative to one another when they may have attempted to
say entirely different sorts of things, and because it is also difficult to
say what constitutes an error in writing, various modified writing
tasks have been used. For example, there is the so-called dehydrated
sentence, or dehydrated essay. The examinee is given a telegraphic
message and is asked to expand it. An instance of the dehydrated
sentence is child/ride/bicycle/off embankment/last month. A dehydrated narrative might continue, was taken to hospital/lingered near
death/family reunited/back to school/two weeks in hospital.
Writing tasks may range from the extreme case of allowing
examinees to select their own topic and to develop it, to maximally
controlled tasks like filling in blanks in a pre-selected (or even
contrived) passage prepared by the teacher or examiner. The blanks
might require open-ended responses on the order of whole
paragraphs, or sentences, or phrases, or words. In the last case, we
have arrived back at a rather obvious form of doze procedure.
Another version of a fairly controlled writing task involves either
listening to or reading a passage and then trying to reproduce it from
recall. If the original material is auditorily presented, the task
becomes a special variety of dictation. This procedure and a variety of
others are discussed in greater detail in Chapter 13.
Narration. One of the techniques sometimes used successfully to
elicit relatively spontaneous speech samples is to ask subjects to talk
about a frightening experience or an accident where they were almost
'shaded out of the picture' (Paul Anisman, personal communication).
With very young children, story re-telling, which is a special version
of narration, has been used. It is important that such tasks seem
natural to the child, however, in order to get a realistic attempt from
50
DISCRETE POINT , INTEGRATIVE OR PRAGMATIC TESTS
LANGUAGE TESTS AT SCHOOL
comprehension for adult foreign students in American universities:
does the test require the learner to do the sort of thing that it
supposedly measures his ability to do? Or, for a test that purports to
measure the degree of dominance of bilingual children in classroom
contexts that require listening and speaking, we might ask: does the
test require the children to say and do things that are similar in some
fundamental way to what they are normally required to do in the
classroom? These are questions about content validity.
With respect to concurrent validity, the question of interest is to
what extent do tests that purport to measure the same skill(s), or
component(s) of a skill (or skills) correlate statistically with each
other? Below, we will digress briefly to consider the meaning of
statistical correlation. An example of a question concerning
concurrent validity would be: do several tests that purport to
measure the same thing actually correlate more highly with each
other than with a set of tests that purport to measure something
different? For instance, do language tests correlate more highly with
each other than with tests that are labeled IQ tests? And vice versa.
Do tests which are labeled tests of listening comprehension correlate
better with each other than they do with tests that purport to
measure reading comprehension? And vice versa.
A special set of questions about concurrent validity relate to the
matter of test reliability. In the general sense, concurrent validity is
about whether or not tests that purport to do the same thing actually
do accomplish the same thing (or better, the degree to which they
accomplish the same thing). Reliability of tests can be taken as a
special case of concurrent validity. If all of the items on a test labeled
as a test of writing ability are supposed to measure writing ability,
then there should be a high degree of consistency of performance on
the various items on that test. There may be differences of difficulty
level, but presumably the type of skill to be assessed should be the
same. This is like saying there should be a high degree of concurrent
validity among items (or tests) that purport to measure the same
thing. In order for a test to have a high degree of validity of any sort, it
can be shown that it must first have a high degree of reliability.
In addition to these empirical (and statistically determined)
requirements, a good test must also be practical and, for educational
purposes, we might want to add that it should also have instructional
value. By being practical we mean that it should be usable within the
limits of time and budget available. It should have a high degree of
cost effectiveness.
the examinee. For instance, it is important that the person to whom
the child is expected to re-tell the story is not the same person who has
just told the story in the first place (he obviously knows it). It should
rather be someone who has not (as far as the child is aware) heard the
story before - or at least not the child's version.
Translation. Although translation, like other pragmatic procedures, has not been favored by the testing experts in recent years, it
still remains in at least some of its varieties as a viable pragmatic
procedure. It deserves more research. It would appear from the study
by Swain, Dumas, and Naiman (1974) that if it is used in ways that
approximate its normal application in real life contexts, it can
provide valuable information about language proficiency. If the
sequences of verbal material are long enough to challenge the shortterm memory of the examinees, it would appear that the technique is
a special kind of pragmatic paraphrase task.
E. Research on the validity of pragmatic tests
We have defined language use and language learning in relation to the
theoretical construct of a pragmatic expectancy grammar. Language
use is viewed as a process of interacting plans and hypotheses
concerning the pragmatic mapping of linguistic contexts onto
extralinguistic ones. Language learning is viewed as a process of
developing such an expectancy system. Further, it is claimed that a
language test must invoke and challenge the expectancy system of the
learner in order to assess its efficiency. In all of this discussion, we are
concerned with what may be called the construct validity of pragmatic
language tests. If. they were to stand completely alone, such
considerations would fall far short of satisfactorily demonstrating the
validity of pragmatic language tests. Empirical tests must be applied
to the tests themselves to determine whether or not they are good tests
according to some purpose or range of purposes (see Oller and
Perkins, 1978).
In addition to construct validity which is related to the question of
whether the test meets certain theoretical requirements, there is the
matter of so-called content validity and of concurrent validity. Content
validity is related to the question of whether the test requires the
examinee to perform tasks that are really the same as or
fundamentally similar to the sorts of tasks one normally performs in
exhibiting the skill or ability that the test purports to measure. For
instance, we might ask of a test that purports to measure listening
51
~
-
52
LANGUAGE TESTS AT SCHOOL
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
By having instructional value we mean that it ought to be possible
to use the test to enhance the delivery of instruction % student
populations. This may be accomplished in a foreign language
classroom by diagnosing student progress (and teacher effectiveness)
in more specific ways. In some cases the test itself becomes a teaching
procedure in the most obvious sense. In multilingual contexts better
knowledge of student abilities to process information coded verbally
in one or. more languages can help motivate curricular decisions.
Indeed, in monolingual contexts curricular decisions need to be
related as much as is possible to the communication skills of students
(see Chapter 14).
It has been facetiously observed that what we are concerned with
when we add the requirements of practicality and instructional value
is something we might call true validity, or valid validity. With so
many kinds of validity being discussed in the literature today, it does
not seem entirely inappropriate to ask somewhat idealistically (and
sad to say, not superfluously) for a valid variety of validity that
teachers and educators may at least aim for. Toward this end we
might examine the results of theoretical investigations of construct
validity, practical analyses of the content of tests, and careful study of
the intercorrelations among a wide variety of testing procedures to
address questions of concurrent validity. 3
We will return to the matter of validity of pragmatic tests arid their
patterns of interrelationship as determined by concurrent validity
studies after a brief digression to consider the meaning of correlation
in the statistical sense of the term. The reade~ who has· some
background in statistics·or in the mathematics underlying statistical
correlation may want to skip over the next eleven paragraphs and go
directly to the discussion of results of statistical correlations between
various tests that have been devised to assess language skills.
53
statistically trained rell;der to understand the meaning of correlation
enough to appreciate some of the interesting findings of recent
research on the reliability and validity of various language testing
techniques. There are many excellent texts that deal with correlation
more thoroughly and with its application to research designs. The
interested reader may want to consult one of the many available
references.4; No attempt is made here to achieve· any sort of
mathematical rigor - and perhaps it is worth noting that most
practical applications of statistical procedures do not conform to all
of the nicetie~ necessary for mathematical precision attainable in
theory (see Nunnally, 1967, pp. 7-10; for a discussion of this point).
Few researchers, however, would therefore deny the usefulness of the
applications.
Here we are concerned with simple correlation, also known as
Pearson product-moment .correlation. To understand the meaning
of this statistic, it is first necessary to understand the simpler statistics
of the arithmetic mean, the variance, and the standard deviation on
which it is based. The arithmetic mean of a set of scores is computed
by adding up all of the scores in the set of interest and dividing by the
number of scores in the set. This procedure provides a measure of
central tendency of the scores. It is like an answer to the question, if
we were to take all the amounts of whatever the test measures and
distribute an equal amount to each examinee, how much would each
one get with none left over? Whereas the mean is an index of where
the true algebraic center of the scores is, the variance is an index of
how much scores tend to differ from that central point.
Since the true degree of variability of possible scores on a test tends
to be somewhat larger than the variability of scores made by a given
group of examinees, the computation oftest variance must correct for
this bias. Without going into any detail, it has been proved
mathematically that the best estimate of true test variance can be
made as follows: first, subtract the mean score from each of the scores
on the test and record each of the resulting deviations from the mean
(the deviations will be positive quantities for scores larger than the
mean, and negative quantities for scores less than the mean); second,
square each of the deviations (i.e., multiply each deviation by itself)
and record the :t;"esult each time; third, add up all of the squares (note
that all of the quantities must be either zero or a positive value since
1, The meaning of correlation. The "purpose here is not to teach the
reader to apply correlation necessarily,_ but to help the non3 Another variety of validity sometimes referred to in the literature is face validity.
Harris (1969) defines it as 'simply the way the test looks - to the examinees, test
administrators, educators, and the like' (p. 21). Since these kinds of opimons are often
based on mere experiences with things that have been called tests of such and such a
skill in the past, Harris notes that they are not a very important part of determining the
validity of tests. Such opinions are ultimately important only to the extent that they
affect performance on the test. Where judgements of face validity can be shown to be
ill-informed, they should not serve as a basis for the evaluation of testing procedures at
all.
An excellent text written principally· for ed~cators is Merle Tate, Statistics in
Education and Psychology: A First Course. New York: Macmillan, 1965, especially
Chapter VII. Or, see Nunnally (l967),or Kerlinger and Pedhazur (1973).
4
/
.i,
54
LANGUAGE TESTS AT SCHOOL
the square of a negative value is always a positive value); fourth,
divide the sum of squares by the number of scores minus one (the
subtraction of one at this point is the correction of estimate bias noted
at the beginning of this paragraph). The result is the best estimate of
the true variance in the population sampled.
The standard deviation of the same set of scores is simply the
square root of the variance (i.e., the positive number which times
itself equals the variance). Hence, the standard deviation and the
variance are interconvertible values (the one can be easily derived
from the other). Each of them provides an index ofthe overall tendency
of the scores to vary from the mathematically defined central quantity
(the mean). Conceptually, computing the standard deviation is
something like answering the question: if we added to the mean and
subtracted from the mean amounts of whatever the test measures, how
much would we have to add and subtract on the average to obtain the
original set of scores? It can be shown mathematically that for normal
distributions of scores, the mean and the standard deviation tell
everything there is to know about the distribution of scores. The
mean defines the central point about which the scores tend to cluster
and their tendency to vary from that central point is the standard
deviation.
We can now say what Pearson product-moment correlation means.
Simply stated, it is an index of the tendency for the scores of a group
of examinees on one test to covary (that is, to differ from their
respective mean in similar direction and magnitUde) with the scores
of the same group of examinees on another test. If, for example, the
examinees who tend to make high scores on a certain cloze test also
tend to make high scores on a reading comprehension test, and if
those who tend to make low scores on the reading test also tend to
make low scores on the cloze, the two tests are positively correlated.
The square of the correlation between any two tests is an index of
the variance overlap between them. Perfect correlation will result if
the scores of examinees on two tests differ exactly in proportion to
each other from their respective means.
One of the conceptually simplest ways to compute the productmoment correlation between two sets of test scores is as follows: first,
compute the standard deviation for each test; second, for each
examinee, compute the deviation from the mean on the first test and
the deviation from the mean on the second test; third, multiply the
deviation from the mean on test one times the deviation from the
mean on test two for each examinee (whether the value of the
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
55
deviation is positive or negative is important in this case because it is
possible to get negative values on this operation); fourth, add up the
products of deviations from step three (note that the resulting
quantity is conceptually similar to the sum of squares of deviations in
the computation of the variance of a single set of scores); finally,
divide the quantity from step four by the standard deviation of test
one times the standard deviation of test two times one less than the
number of examinees. The resulting value is the correlation between
the two tests.
Correlations may be positive or negative. We have already
considered an example of positive correlation. An instance of
negative correlation would result if we counted correct responses on
say, a cloze test, and errors, say, on a dictation. Thus, a high score on
the cloze test would (if the tests were correlated positively, as in the
previous example) correspond to a low score on the other. High
scorers on the cloze test would typically be low scorers on the
dictation (that is, they would make fewer errors), and low scorers on
the doze would be high scorers on the dictation (that is, they would
make many errors). However, if the score on the doze test were
converted to an error count also, the correlation would become
positive instead of negative. Therefore, in empirical testing research,
it is m9st often the magnitude of correlation between two tests that is
of interest rather than the direction of the relationship. However, the
value of the correlation (plus or minus) becomes interesting whenever
it is surprising. We will consider several such cases in Chapter 5 when
we discuss empirical research with attitudes and motivations.
What about the magnitude of correlations? When should a
correlation be considered high or low? Answers to such questions can
be given only in relation to certain purposes, and then only in general
and somewhat imprecise terms. In the first place, the size of
correlations cannot be linearly interpreted. A correlation of .90 is not
three times larger than a correlation of .30 - rather it is nine times
larger. It is necessary to square the correlation in each case in order to
make a more meaningful comparison. Since .90 squared is .81 and .30
squared is .09, and since .81 is nine times larger than .09, a correlation
of .90 is actually nine times larger than a correlation of .30.
Computationally (or perhaps we should say mathematically), a
correlation is like a standard deviation, while the square of the
correlation (or the coefficient of determination as it is called) is on the
same order as the variance. Indeed, the square of the correlation of
two tests is an index of the amount of variance overlap between the
f
56
LANGUAGE TESTS AT SCHOOL
two tests - or put differently, it is an index of the amount of variance
that they have in common. (For more thorough discussion, see Tate,
1965, especially Chapter VII.)
With respect to reliability studies, correlations above .95 between,
say, two alternate forms of the same test are considered quite
adequate. Statistically, such a correlation means that the test forms
overlap in variance at about the .90 level. That is, ninety percent of
the total variance in both tests is present in either one by itself. One
could feel quite confident that the tests would tend to produce very
similar results if administered to the same population of subjects.
What can be known from the one is almost identical to what can be
known from the other, with a small margin of error.
On the other hand, a reliability index of .60 for alternate forms of
the same test would not be considered adequate for most purposes.
The two tests in this latter instance are scarcely interchangeable. It
would hardly be justifiable to say that they are very reliable measures
of whatever they are aimed at assessing. (However, one cannot say
that they are necessarily measuring different things on the basis of
such a correlation. See Chapter 7 on statistical traps.)
In general, whether the question concerns reliability or validity,
low correlations are less informative than high correlations. An
observed low correlation between two tests that are expected to
correlate highly is something like the failure of a prospector in search
of gold. It may be that there is no gold or it may be that the prospector
simply hasn't turned the right stones or panned the right spots in the
stream. A low correlation may result from the fact that one of the
tests is too easy or too hard for the popUlation tested. It may mean
that one of the tests is unreliable. Or that both ofthem are unreliable.
Or a low correlation may result from the fact that one or both tests do
not measure what they are supposed to measure (i.e., are not valid),
or merely that one of them (or both) has (or have) a low degree of
validity.
A very high correlation is less difficult to interpret. It is more like a
gold strike. The richer the strike, that is, the higher the correlation,
the more easily it can be interpreted. A correlation of .85 or .90
between two tests that are superficially very different would seem to
be evidence that they are tapping the same underlying skill or ability.
In any event, it means at face value that the two tests share.72 or .81
of the total variance in both tests. That is, between 72 and 81 percent
of what can be known from the one can be known equally well from
the other.
r
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
57
A further point regarding the interpretation of reliability estimates
should be made. Insofar as a reliability estimate is accurate, its square
may be interpreted as the amount of non-random variance in the test
in question. It follows that the validity of a test can never exceed its
reliability, and further that validity indices can equal reliability
indices only in very special circumstance~ - namely, when all the
reliable (non-random) variance in one test is also generated by the
other. We shall return to this very important fact about correlations
~iability indices and correlations as validity indices below. In the
meantime, we should keep in mind that a correlation between two
tests should normally be read as a reliability index if the two tests are
"Coiisidered to be different forms of the same test or testing procedure.
However, if the two tests are considered to be different tests or testing
pfO"'cedures, the correlation between them should normally be read as
a validity index.
2. Correlations between different language tests. One of the first
studies that showed surprisingly high correlations between substantially different language tests was done by Rebecca Valette (1964)
in connection with the teaching of French as a foreign language at the
college level. She used a dictation as part of a final examination for a
course in French. The rest of the test included: (1) a listening
comprehension task in a multiple choice format that contained items
requiring (a) identification of a phrase heard on tape, (b) completion
of sentences heard 0]1 tape, and (c) answering of questions concerning
paragraphs heard ol'ltape; (2) a written sentence completion task of
the fill-in-the-blank variety; and (3) a sentence writing task where
students were asked to answer questions in the affirmative or negative
or to follow instructions entailed in an imperative sentence like, Tell
John to come here, where a correct written response might be, John,
come here.
For two groups of subjects, all first semester French students, one
of which had practiced taking dictation and the other of which had
not, the correlations between" dictation scores and the other test
scores combined were .78 and .89, respectively. Valette considered
these correlations to be notably high and concluded that the 'dictee'
was measuring the same basic overall skills as the longer and more
difficult to prepare French examination.
Valette concluded that the difference in the two correlations could
be explained as a result of a practice effect that reduced the validity of
dictation as a test for students who had practiced taking dictation.
58
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
LANGUAGE TESTS AT SCHOOL
59
better with the Listening Comprehension subscore on the TOEFL
than it did with any other part of that examination (which also
includes a subtest aimed at reading comprehension and one aimed at
vocabulary knowledge). The correlation between the cloze test and
the Listening Comprehension subtest was .73 and the correlation
with the total score on all subtests of the TOEFL combines was .82.
In another study of the UCLA ESLPE 2A Revised, correlations of
.74, .84, and .85 were observed between dictations and cloze tests
(Oller, 1972). Also, three different cloze tests used with different
populations of subjects (above 130 in number in each case) correlated
above. 70 in six cases with grammar tasks and paraphrase recognition
tasks. The cloze test was scored by the contextually-appropriate
('acceptable word') method; see Chapter 12.
While Valette was looking at the performance of students in a
formal language classroom context where the language being studied
was not spoken in the surrounding community, the studies at UCLA
and the one by Darnell (at the University of Colorado) examined
populations of students in the United States who were in social
contexts where English was in fact the language of the surrounding
community. Yet the results were similar in spite of the contrasts in
tested populations and regardless of the contrasts in the tests used
(the various versions of UCLA ESLPE, the TOEFL, and the foreign
language French exams).
Similar results are available, however, from still more diverse
settings. Johansson (1972) reported on the use of a combined cloze
and dictation procedure which produced essentially the same results
as the initial studies with dictation at UCLA. He found that his
combined doze and dictation procedure correlated better with scores
on several language tests than any of the other tests correlated with
each other. It is noteworthy that Johansson's subjects were Swedish
college students who had learned English as a foreign language. The
correlation between his cloze-dictation and a traditional test of
listening comprehension was .83.
In yet another context, Stubbs and Tucker (1974) found that a
cloze test was generally the best predictor of various sections on the
American University at Beirut English Entrance Examination. Their
subject population included mostly native speakers of Arabic
learning English as a foreign or second language. The cloze test
appeared to be superior to the more traditional parts of the EEE in
spite of greater ease of preparation of the cloze test. In particular, the
cloze blanks seemed to discriminate better between high scorers and
However, the two groups also had different teachers which suggests
another possible explanation for the differences. Moreover, Kirn
(1972) in a study of dictation as a testing technique at UCLA found
that extensive practice in taking dictation in English did not result in
substantially higher scores. Another possible explanation for the
differences in correlations between Valette's two groups might be that
dictation is a useful teaching procedure in which case the difference
might be evidence for real learning.
Nevertheless, one of the results of Valette's study has been
replicated on numerous occasions with other tests and with entirely
different populations of subjects - namely, that dictation does
correlate at surprisingly high levels with a vast array of other
language tests. For instance, in a study at UCLA a dictation task
included as part of the UCLA English as a Second Language
Placement Examination Form 1 correlated better with every other
part of that test than any other two parts correlated with each other
(Oller, 1970, Oller and Streiff, 1975). This would seem to suggest that
dictation was accounting for more of the total variance in the test
than any other single part of that test. The correlation between
dictation and the total score on all other test parts not including the
dictation (Vocabulary, Grammar, Composition, and Phonology for description see Oller and Streiff, 1975, pp. 73-'5) was .85. Thus the
dictation was accounting for no less than 72 %of the variance in the
entire test.
In a later study, using a different form of the UCLA placement test
(Form 2C), dictation correlated as well with a cloze test as either of
them did with any of the other subtests on the ESLPE 2C (Oller and
Conrad, 1971). This was somewhat surprising in view of the striking
differences in format of the two tests. The dictation is heard and
written, while the cloze test is read with blanksto be filled in. The one
test utilizes an auditory mode primarily whereas the other uses
,
mainly a visual mode.
Why would they not correlate better with more similar tasks than
with each other? For instance, why would the cloze test not correlate
better with a reading comprehension task or a vocabulary task (both
were included among the subtests on the ESLP E 2C)? The
correlation between cloze and dictation was .82 while the correlations
between cloze and reading, and cloze and vocabulary were .80 and
.59, respectively. This surprising result confirmed a similar finding of
Darnell (1968) who found that a cloze task (scored by a somewhat
complicated procedure to be discussed in Chapter 12) correlated
l
4
60
LANGUAGE TESTS AT SCHOOL
low scorers than did the traditional discrete point types of items (see
Chapter 9 on item analysis and especially item discrimination).
A study by Pike (1973) with such diverse techniques as oral
interview (FSI type), essay ratings, cloze scores, the subscores on the
TOEFL and a variety of other tasks yielded notably strong
correlations between tasks that could be construed as pragmatic tests.
He tested native speakers of Japanese in Japan, and Spanish speakers
in Chile and Peru. There were some interesting surprises in the simple
correlations which he observed. For instance, the essay scores
correlated better with the subtest labeled Listening Comprehension
for all three populations tested than with any of the other tests, and
the cloze scores (by Darnell's scoring method) correlated about as
highly with interview ratings as did any other pairs of subtests in the
data.
The puzzle remains. Why should tests that look so different in
terms of what they require people to do correlate so highly? Or more
mysterious still, why should tests that purport to measure the same
skill or skills fail to correlate as highly with each other as they
correlate with other tests that purport to measure very different
skills? A number of explanations can be offered, and the data are by
no means all in at this point. It would appear that the position once
favored by discrete point theorists has been excluded by experimental
studies ~ that position was that different forms of discrete point tests
aimed at assessing the same skill, or aspect of a skill, or component of
a skill, ought to correlate better with each other than with, say,
integrative (especially, pragmatic) tests of a substantially different
sort. This position now appears to have been incorrect.
There is considerable evidence to show that in a wide range of
studies with a substantial variety of tests and a diverse selection of
subject populations, discrete point tests do not correlate as well with
each other as they do with integrative tests. Moreover, integrative
tests of very different types (e.g., cloze versus dictation) correlate even
more highly with each other than they do with language tests which
discrete point theory would identify as being more similar. The
correlations between diverse pragmatic tests, in other words,
generally exceed the correlations observed between quite similar
discrete point tests. This would seem to be a strong disproof of the
early claims of discrete point theories of testing, and one will search in
vain for an explanation in the strong versions of discrete point
approaches (see especially Chapter 8, and in fact all of Part Two).
Having discarded the strong version of what might be termed the
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
61
discrete point hypothesis ~ namely, that tests aimed at similar
elements, components, aspects of skills, ot skills ought to correlate
more highly than tests that are apparently requiring a greater diversity
of performances ~ we must look elsewhere for an explanation of the
pervasive results of correlation studies. Two explanations have been
offered. One is based on the pragmatic theory advocated in Chapter 2
of this book, and the other is a modified version of the discrete point
argument (Upshur, 1976, discusses this view though it is doubtful
whether or not he thinks it is correct). From an experimental point of
view, it is obviously preferable to avoid advocacy and let the available
data or obtainable data speak for themselves (Platt, 1964).
One hypothesis is that pragmatic language tests must correlate
highly if they are valid language tests. Therefore, the results of
correlation studies can be easily understood or at least straightforwardly interpret~ as evidence of the fundamental validity of the
variety of language tests that have been shown to correlate at such
remarkably high levels. The reason that a dictation and a cloze test
(which are apparently such different tasks) intercorrelate so strongly
is that both are effective devices for assessing the efficiency of the
learner's developing grammatical system, or language ability, or
pragmatic expectancy grammar, or cognitive network of the
language or whatever one chooses to call it. There is substantial
empirical evidence to suggest that there may be a single unitary factor
that accounts for practically all of the variance in language tests
(Oller, 1976a). Perhaps that factor can be equated with the learner's
developing grammatical system.
One rather simple but convincing source of data on this question is
the fact that the validity estimates on pragmatic tests of different sorts
(i.e. the correlations between different ones) are nearly equal to the
reliability estimates for the same tests. From this it follows that the
tests must be measuring the same thing to a substantial extent.
Indeed, if the validity estimates were consistently equal to, or nearly
equal to the reliability estimates we would be forced to conclude that
the tests are essentially measures of the same factor. This is an
empirical question, however, and another plausible alternative
remains to be ruled out.
Upshur (1976) suggeststhat perhaps the grammatical system of the
learner will account for a large and substantiai portion of the variance
in a wide variety of language tests. This central portion of variance
might explain the correlations mentioned above; but there could still
be meaningful portions of variance left which would be attributable
62
LANGUAGE TESTS AT SCHOOL
to components of grammar or -aspects of language skill, or the
traditional skills themselves.
Lofgren (1972) concluded that 'there appear to be four main
factors which are significant for language proficiency. These have
been named knowledge of words and structures, intelligence,
pronunciation, and fluency' (p. 11). He used a factor analytic
approach (a sopnisticated variation on correlation techniques) to test
'Lado's idea ... that language' can be broken down into 'smaller
components in order to find common elements' (p. 8). In particular,
Lofgren wanted to test the view thllt language skills could be
differentiated into listening, speaking, reading, writing, and possibly
translating fact,ors. His evidence would seem to support either the
unitary pragmatic factor hypothesis, or the central grammatical
factor with meaningful peripheral components as suggested by
Upshur. His data seem to exclude the possibility that meaningful
variances will emerge which are unique to the traditionally
recognized skills. Clearly, more research is needed in relation to t~is
important topic (see the Appendix).
'
Closely related to the questions about the composition oflanguage
skill (and these questions have only recently been posed with
reference to native speakers of any given language), are questions
about the important relation- of language skill(s) to IQ and other
psychological constructs. If pragmatic tests are actually more valid
tests thaq other widely used measures of language skills, perhaps
these new measurement techniques can be used to determine the'
relationship between ability to perform meaningful language
proficiency tasks and ability to answer questions on so-called IQ
tests, and educational tests in general. Preliminary results reported in
Oller and Perkins (1978) seem to support a single factor ~olution
where language proficiency accounts for nearly all of the reliaole
variance in IQ and achievement tests.
,
Most of the studies oflanguage test validity have dealt with second
or foreign language learners who are either adults or postadolescents. Extensions to native speakers and to children who are
either native or non-native speakers of the language tested are more
recent. Many careful empirical investigations are now under way or
have fairly recently been reported with younger subjects. In a
pioneering doctoral study at the University of New Mexico, Craker
(1971) used an oral form of the cloze procedure to assess language
skills of children at the first grade level from four ethnic backgrounds.
She reported significant discrimination between the four groups
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
63
suggesting that the procedure may indeed be sensitive to variations in
levels of proficiency for children who are either preliterate or are just
beginning to learn to read.
Although data are lacking on many of the possible pragmatic
testing procedures that might be applied' with children, the cloze
procedure has recently been used with literate children in the
elementary grades in contexts ranging from the Alaskan bush
country to the Mrican primary school. Streiff (1977) investigated the
effects of the availability of reading resources on reading proficiency
among Eskimo children from the third to sixth grade using cloze tests
as 'the criterion measure for reading proficiency. Hofman (1974) used
cloze tests as measures of reading proficiency in Uganda schools from
grades 2 through 9. Data were collected on children in 14 schools (12
African and 2 European). Since the tests were in a second language
for many of the African children, and in the native language for many
of the European children, some interesting comparisons are possible.
Concerning test reliabilities and internal consistencies of the various
cloze tests used, Hofman reports somewhat lower values for the 2nd
graders, but even including them, the average reliability estimate for
all nine test batteries is .91 - and none is below .85. These data were
based on a mean sample size of264 subjects. The smallest number for
any test battery was 232.
In what may be the most interesting study oflanguageproficiency
in young children to date, Swain, Lapkin, and Barik (1976) have
reported data closely paralleling results obtained with adult second
language learners. In their research, 4th grade bilinguals (or
English speaking children who are becoming bilingual in French)
were tested. Tests of proficiency in French were used and correlated
with a cloze test in French scored by the exact and acceptable word
methods (see Chapter 12 for elaboration on scoring ,methods).
Proficiency tests for English ability were also correlated with a cloze
test in English (also scored by both methods). In every case, the
correlations between cloze scores and the other measures of
proficiency used (in both languages) were higher than the correlations
between any of the other pairs of proficiency tests. This result would
seem to support the conclusion that the cloze tests were, simply
accounting for more ofthe available meaningful variance in both the
native language (English) and in the second language (French).
The authors conclude, 'tbis study has indicated that the cloze tests
can be used effectively with young children ... the cloze technique has
been shown to be a valid and reliable means of measuring second
66
LANGUAGE TESTS AT SCHOOL
or experimenter; neither was spontaneous). They reasoned that if the
sentences children were to repeat exceeded immediate memory span,
then elicited imitation ought to be a test both of comprehension and
production skills. Translation on the other hand could be done in two
directions, either from the native language of the children (in this case
English) to the target language (French), or from the target language
to the native language. The first sort of translation, they reasoned,
could be taken as a test of productive ability in the target language,
whereas the second could be taken as a measure of comprehension
ability in the target language (presumably, if the child could
understand something in French he would have no difficulty in
expressing it in his native language, English).
In order to rule out the possibility that examinees might be using
different strategies for merely repeating a sentence in French as
opposed to translating it into English, Swain, Dumas, and Naiman
devised a careful comparison of the two procedures. One group of
children was told before each sentence whether they were to repeat it
or to translate it into English. If subjects used different strategies for
the two tasks, this procedure would allow them to plan their strategy
before hearing the sentence. A second group was given each sentence
first and told afterward whether they were to translate it or repeat it.
Since the children in this group did not know what they were to do
with the sentence beforehand, it would be impossible for them
consistently to use different strategies for the imitation task and the
translation task. Differences in error types or in the relative difficulty
of different syntactic structures might have implied different
strategies of processing, but there were no differences. The imitation
task was somewhat more difficult, but the types of errors and the rank
ordering of syntactic structures were similar for both tasks and for
both groups. (There were no significant differences at all between the
two groups.)
There were some striking similarities in performance on the two
rather different pragmatic tasks. 'For example, 75 % of the children
who imitated the past tense by "a + the third person of the present
tense of the main verb" made the same error in the production task.
... Also, 69 % of the subjects who inverted pronoun objects in
imitation made similar inversions in production .... In sum, these
findings lead us to reject the view held by Fraser, Bellugi, and Brown
(1963) and others that imitation is only a perceptual-motor skill'
(Swain, Dumas, and Naiman, 1974, p. 72). Moreover, in a study by
Naiman (1974), children were observed to make many of the same
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
67
errors in spontaneous speech as they made in elicited translation from
the native language to the target language.
In yet another study, Dumas and Swain (1973) 'demonstrated that
when young second language learners similar to the ones in Naiman's
study (Naiman, 1974) were given English translations of their own
French spontaneous productions and asked to translate these
utterances into French, 75 % of their translations matched their
original spontaneous productions' (Swain, Dumas, and Naiman,
1974, p. 73).
Although much more study needs to be done, and with a greater
variety of subject populations and techniques of testing, it seems
reaso,nable to say that there is already substantial evidence from
studies oflearner outputs that quite different pragmatic tasks may be
tapping the same basic underlying competence. There would seem to
be two explanations for differences in performance on tasks that
require saying meaningful things in the target language versus merely
indicating comprehension by, say, translating the meaning of what
someone else has just said in the target language into one's own native
language. Typically, it is observed that tasks of the latter sort are
easier than the former. Learners can often understand things that
they cannot say; they can often repeat things that they could not have
put together without the support of a model; they may be able to read
a sentence which is written down that they could not have understood
at all if it were spoken; they may be able to comprehend written
material that they obviously could not have written. There seems to
be a hierarchy of difficulties associated with different tasks. The two
explanations that have been put forth parallel the two competing
explanations for the overlap in variance on pragmatic language tests.
Discrete point testers have long insisted on the separation of tests
of the. traditionally recognized skills. The extreme version of this
argument is to propose that learners possess different grammars for
different skills, aspects of skills; compot1et1tsof aspects of skills, and
so on right down to the individual phonemes, morphemes, etc. The
disproof of this view seems to have already been provided now many
times over. Such a theoretical argument cannot embrace the data
from correlation studies or the data from error analyses. However, it
seems that a weaker version cannot yet be ruled out.
lt is possible that there is a basic grammatical system underlying
all uses oflanguage, but that there remain certain components which
are not part of the central core that account for what are frequently
referred to as differences in productive and receptive repertoires (e.g.
»
"'1
68
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
LANGUAGE TESTS AT SCHOOL
69
catches up with what he intended to say in form and meaning all the
while continuing to plan and actively construct further forms and
meanings, the listener needs only to monitor his inferences
concerning the speaker's intended meanings, and to help him do this,
the listener has the already constructed sensory signals (that he can
hear and see) which the speaker is outputting. A similar explanation
can be offered for the fact that reading is somewhat less taxing than
writing, and that reading is somewhat easier than listening, and so
forth for each possible contrast. Swain, Dumas, and Naiman (1974)
anticipate this sort of explanation when they talk about 'the influence
of memory capacity on some of the specific aspects of processing
involved in tasks of imitation and translation' (p. 75).
Crucial experiments to force the choice between the two competing
explanations for hierarchies of difficulties in different language tasks
remain to be done. Perhaps a double-edged approach from both
correlation studies and error analysis will disprove one or the other of
the two competing alternatives, or possibly other alternatives remain
to be put forward. Irrthe meantime, it would seem that the evidence
from error analysis supports the validity of pragmatic tasks.
Certainly, it would appear that studies of errors on different
elicitation procedures are capable of putting both of the interesting
alternatives in the position of being vulnerable to test. This is all that
science requires (Platt, 1964).
Teachers and educators, of course, require more. We cannot wait
for all of the data to come in, or for the crucial experiments to be
devised and executed. Decisions must be made and they will be made
either wisely 01: unwisely, for better or for worse. Students in
classrooms cannot be left to sit there without a curriculum, and the
decisions concerning the curriculum must be made with or without
the aid of valid language tests. The best available option seems to be
to go ahead with the sorts of pragmatic language tests that have
proved to yield high concurrent validity statistics and to provide a
rich supply of information concerning 'sa Ii •_ _ 1111 Ii
In the next chapter we will consider aspects oflanguage assessment in
multilingual contexts. In Part Two, we will discuss reasons for
rejecting certain versions of discrete point tests in favor of pragmatic
testing, and in Part Three, specific pragmatic testing procedures are
discussed in greater detail. The relevance of the procedures
recommended there to educational testing in general is discussed at
numerous points throughout Part Three.
phonology for speaking versus for listening), or differences in
productive and receptive abilities (e.g. speaking and writing versus
listening and reading), or differences in oral and visual skills (e.g.
speaking and listening versus reading and writing), or components
that are associated with special abilities such as the capacity to do
~simultaneous translation, or to imitate a wide variety of accents, and
so on, and on. This sort of reasoning would harmonize with the
hypothesis discussed by Upshur (1976) concerning unique variances
associated with tests of particular skills or components of grammar.
Another plausible alternative exists, however, an,d was hinted at in
the article by Swain, Dumas, and Naiman (1974). Consider the rather
simple possibility that the differences in difficulties associated with
different language tasks may be due to differences in the load they
impose on the brain. It is possible that the grammatical system (call it
an expectancy grammar, or call it something else) functions with
different levels of efficiency in different language tasks - not because it
is a different grammar - but because of differences in the load it must
bear (or help consciousness to bear) in relation to different tasks. No
one would propose that because a man can carry a one hundred
pound weight up a hill faster than he can carry a one hundred and
fifty pound weight that he therefore must have different pairs of legs
for carrying different amounts of weight. It wouldn't even seem
reasonable to suggest that he has more weight-moving ability when
carrying one hundred pounds rather than when carrying one hundred
and fifty pounds. Would it make Sense to suggest that there is an
additional component of skill that makes the one hundred pound
weight easier to carry?
The case of, say, speaking versus listening skill is obviously much
more complex than the analogy, which is intentionally a reduction to
absurdity. But the argument can apply. In speaking, the narrow
corridor of activity known as attention or consciousness must
integrate the motor coordination of signals to the articulators telling
them what moves to make and in what order, When to turn the voice
on and off; when to push air and when not to; syllables must be timed,
monitoring to make certain the right ones get articulated in the right
sequence; facial expressions, tones, gestures, etc. must be synchronized with the stream of speech; and all of the foregoing must be
coordinated with certain ill-defined intentions to communicate (or
with pragmatic mappings of utterances onto extralinguistic contexts,
if you like). In listening, somewhat less is required. While the speaker
must both plan and monitor his articulatory output to make sure it
c
I.
.,
:
, ,
70
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
LANGUAGE TESTS AT SCHOOL
71
13. A high correlation is more informative and easier to interpret than a low
KEY POINTS
1. Discrete point tests are aimed at specific elements of phonology, syntax,
or vocabulary within a presumed aspect (productive or receptive, oral or
visual) of one of the traditionally recognized language skills (listening,
speaking, reading, or writing).
2. The strong version of the discrete point approach argues that different
test items are needed to assess different elements of knowledge within
each component of grammar, and different subtests are needed for each
different component of each different aspect of each different skill.
Theoretically, many different tests are required.
3. Integrative tests are defined as antithetical to discrete point' tests.
Integrative tests lump many elements and possibly several components,
aspects and skills together and test them all at the same time.
4. While it can be argued that discrete point tests and integrative tests are
merely two extremes on a continuum, pragmatic tests constitute a
special class of integrative tests. It is possible to conceive of a discrete
point test as being more or less integrative, and an integrative test as
being more or less discrete, but pragmatic tests are more precisely
defined.
5. Pragmatic language tests must meet two naturalness criteria: first, they
must require the learner to utilize normal contextual constraints on
sequences in the language; and, second, they must require comprehen~ion (and possibly production also) of meaningful sequences of
elemen,tsin the limguage in relation to extralinguistic contexts.
6. Discrete point tests cannot be pragmatic tests.
7. The question whether or not a task is pragmatic is an empirical one. It
can be decided by logic (that is by definition) and by experiment, but not
by opinion polls.
8. Dictation and cloze procedure are examples of pragmatic tests. First,
they meet the requirements of the definition, and second, they function in
experimental applications in the predicted ways.
9. Other examples include combinations of cloze and dictation, oral cloze
tasks, dictation with interfering noise, paraphrase recognition, questionanswering, oral interview, essay writing, narration, and translation.
10. A test is. v~}idtot;Jae}(}xt€nt thatitm¢a~l,lres\Vhat.itissuElflosedtf5
tiu~asure. Constr.uctyalidity has to do with theoretic111j:t,jstitie~tiofliofa
testing procedllre; cqnteflt. ~alidity ha~ to' d0.with th<:\ falthfulnesswit)l
'Yhi~h,a,~~t~eft~\;t~t~eI}oPI).al US~S ofIall:guage towhichit isrelated as{}
meast1!eofart exafiil11e~'sskill; con~urrent validityhfs t()~R,With} tlie
'strength of correhtti'ons betweetrteststf1'arfYtliportt6ril:~asure:tn~safi1e
thing.
II. Correlation is a statistical index of the tendency of scores on two tests to
vary proportionately from their respective means. It is an index of the
square root of variance overlap, or variance common to two tests.
12. The square of the simple correlation between two tests is an unbiased
estimate of their variance overlap. The technical term for the square of
the correlation is the coefficient of determination. Correlations cannot be
compared linearly, but their squares can be.
14.
IS.
16.
17.
18.
19.
20.
one. While a high correlation does mean that some sort of strong
relationship exists, a low correlation does not unambiguously mean that
a strong relationship does not exist between the tested variables. There
are many more explanations for low correlations than for high ones.
Generally, pragmatic tests of apparently very different types correlate at
higher levels with each other than they do with other tests. However, they
also seem to correlate better with the more traditional discrete point tests
than the ,latter do with each other. Hence, pragmatic tests seem to be
generating more meaningful variance than discrete item tests.
Two possibilities seem to exist: there may be a large factor of
grammatical knowledge of some sort in every language test, with certain
residual variances attributable to specific components, aspects, skills,
and the like; or, language skill may be a relatively unitary factor and
there may be no unique meaningful variances that can be attributed to
specific components, etc.
The relation oflanguage proficiency to intelligence (or more specifically
to scores on so-called IQ tests) remains to be studied more carefully.
Scores on achievement tests and educational measures of all sorts should
also be examined critically with careful experimental procedures.
Error analysis, or interlanguage analysis, can provide additional validity
data on language tests. If language tests are viewed as elicitation
procedures, and if errors are analyzed carefully, it is possible to make
very specific observations about whether certain tests measure different
things or the same thing.
Available data suggest that very different pragmatic tasks, such as
spontaneous speech, or elicited imitation, or elicited translation tend to
produce the same kinds of learner errors.
Differences in difficulty across tasks may be explained by considering the
relative load on mental mechanisms.
Teachers and educators can't wait for all of the research data to come in.
At present, pragmatic testing seems to provide the most promise as a
reliable, valid, and usable approach to the measurement of language
ability.
DISCUSSION QUESTIONS
I. What sort of testing procedure is more common in your school or in
educational systems with which you are familiar; discrete point testing,
or integrative testing? Are pragmatic tests used in classrooms that you
know of? Have you used dictation or cloze procedure in teaching a
foreign language? Some other application? For instance, assessing
reading comprehension, or the suitability of materials aimed at a certain
grade level?
V2. Discuss ways that you evaluate the language proficiency of children in
your classroom, or students at your university, or perhaps how you
might estimate your own language proficiency in a second language you
have studied.
3. What are some of the drawbacks or advantages to a phonological
-
72
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS
LANGUAGE TESTS AT SCHOOL
discrimination task where the examinee hfiars a sentence like, He thinks
'he will sail his boat at the lake, and must decide whether he heard sell or
sail. Try writing several items of this type and discuss the factors that
enter into determining the correct answer. What expectancy biasas may
arise?
4. In what way is understanding the sentence, Shove off, Smith, I'm tired of
talking to you, dependent on knowing the meaning of off? Could you test
such knowledge with an appropriate discrete item 1
5. Try doing a cloze test and a dictation. Reflect on the strategies you use in
performing one or the other. Or give one to a class and ask them to tell
you what they attended to and what they were doing mentally .while they
filled in the blanks or wrote down what they had heard.
6. Propose other forms oftasks that you think would qualify as pragmatic
tasks. Consider whether they meet the two naturalness criteria. If so try
them out and see how they work.
7. Can you think of possible objections to a dictation with noise? What are
some arguments for and against such a procedure?
"8. Given a low correlation between two language tests, say, .30, what are
some of the possible conclusions? Suppose the correlation is .90. What
could you conclude for the latter?
9. Keep a diary on listening errors and speech errors that you or people
around you make. Do the processes appear to be distinct? Interrelated? r
10. Discuss Alice's query about not knowing what she would say next till she
had already said it. How does this fitthe strategies you follow when you
, spelj:k? In, what cases might the statement apply to your own speech?
11. Can a high correlation between two tests be taken as an indication of test
validity? When is it merely an indication of test reliability? What is the
difference? Can a test be valid without being reliable? How about the
reverse?
.il2. Try collecting samples of learner outputs in a variety of ways. Compare error types. Do dissimilar pragmatic tests elicit similar or dissimilar
errors?
v" 13. Consider the prbs and cons of the long despi~ed technique of translation
as a teaching and as a testing device. Are the potential uses similar? Can
v
you define clearly abuses to be avoided?
14. What sorts of tests do you think will yield superior diagnostic
information to help you to know what to do to help a learner by teaching
strategies? Consider what you do now with the test data that are
available. Are there reading tests? Vocabulary tests? Grammar tests?
What do you do differently because of the information that you get from
the tests you use?
v 15. Compute the means, variances, and standard deviations for the
following sets of scores:
George
Sarah
Mary
TestA
1
2
3
TestB
5
10
15
(Note that these data are! highly artificial. They are contrived purely to
73
illustrate'the meaning' of correlation while keeping the computations
extremely simple and manageable.)
"-16. What is the correlation between tests A and B?
/ 17. What would your interpretation of the correlation be if Tests A and B
were alternate forms of a placement test? What ifthey were respectively a
vi' reading comprehension test and an oral interview?
18. Repeat questions 15, 16, and 17 with the following data:
TestC
George
Sarah
Mary
5
10
15
TestD
7
6
5
(Bear in mind that correlations would almost never be done on such
small numbers of subjects. Answers: correlation between A and B is
+ 1.00, between C and D it is -1.00.)
SUGGESTED READINGS
1. Anne Anastasi, Psychological Testing. New York: Macmillan, revised
edition, 1976. See especially Chapters 5 and 6 on test validity.
2. John B. Carroll, 'Fundamental Considerations in Testing for English
Language Proficiency of Foreign Students,' Testing, Washington, D.C.:
Center for Applied Linguistics, 1961, 31-40. Reprinted in H. B. Allen'
and R. N. Campbell (eds.) Teaching English as a Second Language: A
Book of Readings. New York: McGraw Hill, 1972,313-320.
3. L. J. Cronbach, Essentials of Psychological Testing. New York: Harper
and Row; 1970. See especially the discussion of different types of validity
in Chapter 5.
4. Robert Lado, Language Testing. New York: McGraw Hill, 1961. See
especially his discussion of discrete point test rationale pp. 25-29 and,
39-203.
5. John W. Oller, Jr., 'Language Testing,' in Ronald Wardhaugh and
H. Douglas Brown (eds.) Survey' of Applied Linguistics. Ann Arbor,
Michigan: University of Michigan, 1976,275-300.
6. Robert L. Thorndike, 'Reliability,' Proceedings of the 1963 Invitational
Conference on Testing Problems. Princeton, N.J.: Educational Testing
Service, 1964. Reprinted in Glenn H. Bracht, Kenneth D. Hopkins, and
Julian C. Stanley (eds.) Perspectives in Educational and Psychological
Measurement. Englewood Cliffs, N.J.: Prentice-Hall, 1972,66-73.
--,
MULTILINGUAL ASSESSMENT
Multilingual Assessment
A. Need
B. Multilingualism versus ~~/
multidialectalism
C. Factive and emotive aspects of
multilin,gualism
D. On test biases
E. Translating tests or items
F~ Dominance and proficiency
G. Tentative suggestions
Multiling~alism is a pervasive modern reality. Ever since that cursed
Tower was erected, the peoples of the world have had this problem.
In the United States alone, there are millions of people in every major
urban center whose home and neighborhood language is not one of
the majority varieties of English.' Spanish, Italian, German, Chinese
and a host of othet 'foreign' languages have actually become
American languages. Furthermore, Navajo, Eskimo, Zuni, Apache,
and many other native languages of this continent can hardly be
called 'foreign' languages. The implications for education are
manifold. How shall we deliver curricula to children whose language
is not English? How shall we determine what their language skills
are?
A. Need
Zirkel (1976) concludes an article entitled 'The why's and way,s of
testing bilinguality before teaching bilingually,' with the following
paragraph:
74
75
The movement toward an effective and efficient means of
testing bilinguality before teaching bilingually is in progress. In
its wake is the hope that in the near future 'equality of
educational opportunity' will become more meaningful for
linguistically different pupils in our nation's elementary schools
(p.328).
Earlier he observes, however, that 'a substantial number of bilingual
programs do not take systematic st~ps to determine the language
dominance of their pupils' (p. 324).
Since the 1974 Supreme Court ruling in the case of Lau versus
Nichol~, the interest i~. multilingual testing in the schools of the
United States has taken a sudden upswing. The now famous court
case involved a contest between a Chinese family in San Francisco
and the San Francisco school system. The following quotation from
the Court's Syllabus explains the nature of the case:
The failure of the' San Francisco school system to provide
English language instruction to approximately 1,800 students
of Chinese ancestry who do not speak English denies them a
meaningful opportunity to participate in the public educational
program and thus violates §601 of the Civil Rights Act of 1964,
which bans discrimination based 'on the ground ofrace, color,
or national origin,' (Lau vs. Nichols, 1974, No. 72-6520).
In page 2 of an opinion by Mr. Justice Stewart concurred in by The
Chief Justice and Mr. Justice Blackmun, it is suggested that 'no
specific remedy is urged upon us. Teaching English to the students of
Chinese ancestry who do not speak the language is one choice. Giving
instruction to this group in Chinese is another' (1974, No. 72-6520).
Further, the Court argued:,
. Basic English skills are at the very COre of what these ,public
schools teach. Imposition of a requiremeqt that; before a-child
can effectively participate in the educational program, he must
already have acquired those basic skills is to make a mockery of
public education. We know that those who do not understand
English are certain to find their classroom experiences wholly
incomprehensible and in no way meaningful (1974, No.
72-6520, p. 3).
As a result of the interpretation rendered by the Court, the U.S.
Office of Civil Rights convened a Task Force which recommended
certain so-called 'Lau Remedies'. Among other things, the main
document put together by the Task Force requires language
assessment procedures to determine certain facts about language use
and it requires the rating of bilingual proficiency on a rough five point
r
76
MULTILINGUAL ASSESSMENT
LANGUAGE tEsTS AT SCHOOL
77
results are in, but it is equally true that we cannot afford to play
political games of holding out for more budget to make changes in
ways the present budget is being spent, especially when those changes
were needed years ago. This year's budget and next year's (if there is·a
next year) will be spent. People in the schools will be busy at many
tasks, and all of the av~able time will get used up. Doing all of it the
way it was done last year is proof only of the disappointing fact that a:
system that purports to'teach doesn't necessarily learn. Indeed, it is
merely another commtmton the equally discouraging fact that many
students in the schools (and universities no doubt) must learn in spite
of the system which becomes an adversary instead of a servant to the
needs oflearners.
The problems are not unique to the United States. They are worldwide problems. Hofman (1974) in reference to the schools of
Rhodesia says, 'It is important to get some idea, one that should have
been reached ten years ago, of the state of English in the primary
school' (p. 10). In the case of Rhodesia, and the argument can easily
be extended to many of the world's nations, Hofman questions the
blind and uninformed language policy imposed on teachers and
children in the schools. In the case of Rhodesia, at least until the time
his report was written, English was the required school language from
1st grade onward. Such policies have recently been challenged in
many parts of the world (not just in the case of a super-imposed
English) and reports of serious studies examining important variables
are beginning to appear (see for instance, Bezanson and Hawkes,
1976, and Streiff, 1977). This is not to say that there may not be much
to be gained by thorough knowledge of one of the languages of the
world currently enjoying much power and prestige (such as English is
at the present moment), burthere are many questions concerning the
price that must be paid for such knowledge. Such questions can
scarcely be posed without serious multilingual testing on a much
wider scale than has been common up till now.
scale (1. monolingual in a language other than English; 2. more
proficient in another language than in English; 3. balanced bilingual
in English and another language; 4. more proficient in English than
in another language; 5. monolingual in English).
Multilingual testing seems to have come to stay for a while in U.S.
schools, but as Zirkel and others have noted, it has come very late. It
is late in the sense of antiquated and inhumane educational pro~rams
that placed children of language backgrounds other than English in
classes for the 'mentally retarded' (Diana versus California State
Education Department, 1970, No. C-7037), and it is late in terms of
bilingual education programs that were started in the 1960s and even
in the early 1970s on a hope and a promise but without adequate
assessment of pupil needs and capabilities(cf. John and Horne'r,
1971, and Shore, 1974, cited by Zirkel, 1976). In fact, as recently as
, 1976, Zirkel observes, 'a fatal flaw in many bilingual programs lies in
the linguistic identification of pupils at the critical point of the
planning and placement process' (p. 324).
Moreover, as Teitelbaum (1976), Zirkel (1976), and others have
often noted, typical methods of assessment such as surname surveys
(to identify Spanish speakers, for instance) or merely asking about
language preferences (e.g., teacher or student questionnaires) are
largely inadequate. The one who is most likely to be victimized by
such inadequate methods is the child in,the school. One second grader
indicated that he spoke 'English-only' .on a rating sheet, but when
'casually asked later whether his parents spoke Spanish, the child
responded without he'sitation: Si, ellos hablan espafiol- pero yo, no.'
Perh.aps someone had convinced him that he was not supposed.to
speak Spanish 1
Surely, for the sake of the child, it is necessary to obtain reliable
and valid information about what language(s) he speaks and
understands {ang how well) before decisions are reached about
curriculum delivery and the language policy of the classroom. 'But,'
some sincere administrator may say, 'we simply can't afford to do
systematic testing on a wide scale. We don't have the people, the
budget, or the time.' The answer to such an objection must tie
indignant if it is equally sincere and genuine. Can the schools afford
to do this year what they did last year? Can they afford to continue to
deliver instruction in a language that a substantial number of the
children cannot understand? Can they afford to do other wide scale
standardized testing programs whose results may be less valid?
It may be true that the educators cannot wait until all the research
B. Multilingualism versus multidialectalism
The problems of language testing in bilingual or multilingual
contexts seem to be compounded to a new order of magnitude each
time a new language is added to the system. In fact, it would seem that
the problems are more than just doubled by the presence of more than
one language because there must be social and psychological
interactions between different language communities producing'
L
78
MULTILINGUAL ASSESSMENT
LANGUAGE TESTS AT SCHOOL
complexities not present in monolingual commumtles. However,
there are some parallels between so-called monolingual communities
and multilingual societies. The former display a rich diversity of
language varieties in much the way that the latter exhibit a variety of
languages. To the extent that differences in language varieties parallel
differences in languages, there may be less contrast between the two
sorts of settings than is commonly thought. In both cases there is the
need to assess performance in relation to a plurality of group norms.
In both cases there is the difficulty of determining what group norms
are appropriate at different times and places within a given social
order.
It has sometimes been argued that children in the schools should be
compared only against themselves and never against group norms,
but this argument implicitly denies the nature of normal human
communication. Evaluating the language ability of an individual by
comparing him only against himself is a little like clapping with one·
hand. Something is inissing. It only makes sense to say that a person
knows a language in relation to the way that other persons who also
know that language perform when they use it. Becoming a speaker of
a particular language is a distinctively socializing process. It is a
process of identifying with and to some degree functioning as a
member of a social group.
In multilingual societies, where many mutually unintelligible
languages are common fare in the market places and urban centers,
the need for language proficiency testiI1g as a basis for informing.
edu6~tional policy is perhaps more obvious than in so-called
monolingual societies. However, the case of monolingual societies;
which are typically multidialectal, is deceptive. Although different
varieties of a language may be mutually intelligible in many
situations, in others they are not. At least since 1969, it has been
known that school children who speak different varieties of English
perform about equally badly in tests that require the repetition of
sentences in the other group's variety (Baratz, 1969). Unfortunately
for children who speak a non-majority variety of English, all of the
other testing in the schools is done in the majority variety (sometimes·
referred to as 'standard English'). An important question currently
being researched is the extent to which educational tests in general
may contain a built-in language variety bias and related to it is the
more general question concerning how much of the variance in .
educational tests in general can be explained by variance in language
proficiency. tests (see Stump, 1978 Gunnarsson, 1978 and
Pharis and Perkins, in press; also see the Appendix).
The parallel between multilingualism and multidialectalism is still
more fundamental. In fact, there is a serious question of principle
concerning whether it is possible to distinguish languages and
dialects. Part of the trouble lies in the fact that for any given language
(however we define it), there is no sense at all in trying to distinguish it
from its dialects or varieties. The language is its varieties. The only
sense in which a particular variety of a language may be elevated to a
status above other varieties is in the manner of Orwell's pigs ~ by
being a little more equal or in this case, a little more English or French
or Chinese or Navajo or Spanish or whatever. One of the important
rungs on the status ladder for a language variety (and for a language
in the general sense) is whether or not it is written and whether or not
it can lay claim to a long literary tradition. Other factors are who
happens to be holding the reins of political power (obviously the
language variety they speak is in a privileged position), and who has
the money and the goods that others must buy with their money. The
status of a particular variety of English is subject to many of the same
influences that the status of English (in the broader sense) is
controlled by.
The question, where does language X (or variety X) leave off and
language Y (or variety Y) begin is a little like the question, where does
the river stop and the lake begin. The precision of the answer, or lack
of it, is not so much a product of clarity or unclarity of thought as it is
a product of the nature of the objects spoken of. New terms will not
make the boundaries between languages, or dialects, or between
languages and language varieties any clearer. Indeed, it can be argued
that the distinction between languages as disparate as Mandarin and
English (or Navajo and Spanish) is merely a matter of degree. For
languages that are more closely related, such as German and
Swedish, or Portuguese and Spanish, or Navajo and Apache, it is
fairly obvious that their differences are a matter of degree. However,
in relation to abstract grammatical systems that may be shared by
all human beings as part of their genetic programming, it may be the
case that all languages share much of the same universal grammatical
system (Chomsky, 1965, 1972).
Typically, varieties of a language that are spoken by minorities are
termed 'dialects' in what sometimes becomes an unintentional (or
possibly intentional) pejorative sense. For example, Ferguson and
Gumperz (1971) suggest that a dialect is a 'potential language' (p. 34).
This remark represents the tendency to institutionalize what may be
ii···
"
:::',.:'
i..i.l.
79
b
80
LANGUAGE TESTS AT SCHOOL
appropriately termed the 'more equal syndrome'. No one would ever
suggest that the language that a middle class white speaks is. a
potential language - of course not, it's a real language. But the
language spoken by inner city blacks - that's another matter. A
similar tendency is apparent in the remark by The Chief Justice in the
Lau case where he refers to the population of Chinese speaking
children in San Francisco (the 1,800 who were not being taught
English) as 'linguistically deprived' children: (Lau versus Nichols,
1974, No. 72-6520, p. 3). Such remarks may reflect a modicum of
truth, but deep within they seem to arise from ethno-centric
prejudices that define the way I do it (or the way we do it) as
intrinsically better than the way anyone else does it. It is not difficult
to extend such intimations to 'deficit theories' of social difference like
those advocated by Bernstein (1960), Bereiter and Engleman (1967),
Herrnstein (1971), and others.
Among the important questions that remain unanswered and that
are crucial to the differentiation of multilingual and monolingual
societies are the following: to what extent does normal educational
testing contain a language variety bias? And further, to what extent is
that bias lesser or greater than the language bias in educational
testing for children who come from a non-English speaking
background? Are the two kinds of bias really different in type or
merely in degree?
C. Factive and emotive aspects of multilingualism
Difficulties in communication between social groups of different
language backgrounds (qialects or language varieties included) are
apt to arise in two ways: first, there may be a failure to communicate
on the factive level; or second,there may be a failure to communicate
on the emotive level as well as the factive level. If a child comes to
a school from a cultural and linguistic background that is substantially
different from the background of the majority of teachers and
students in the school, he brings to the communication contexts of the
school many sorts of expectations that will be inappropriate to many
aspects of the exchanges that he might be expected to initiate or
participate in. Similarly, the representatives of the majority culture
and possibly other minority backgrounds will bring other sets of
expectations to the communicative exchanges that must take place.
In such linguistically plural contexts, the likelihood of misinterpretations and breakdowns in communication is increased. On the
MULTILINGUAL ASSESSMENT
81
factive level, the actual forms of the language(s) may present some
difficulties. The surface forms of messages may sometimes be
uninterpretable, or they may sometimes be misinterpreted. Such
problems may make it difficult for the child, teacher, and others in the
system to c;ommunicate the factive-level information that is usually
the focus of classroom activities - namely, transmitting the subject
matter content of the curriculum. Therefore, such factive level
communication problems may account for some portion of the
variance in the school performance of children from ethnically and
culturally different backgrounds, i.e., their generally lower scores on
educational tests. As Baratz (1969) has shown, however, it is
important to keep in mind the fact that if the tables were turned, if the
majority were suddenly the minority, their scores on educational tests
might be expected to plummet to the same levels as are typical· of
minorities in today's U.S. schools. Nevertheless, there is another
important cluster of factors that probably affect variance in learning
far more drastically than problems of factive level communication.
There is considerable evidence to suggest that the more elusive
emotive or attitudinal level of communication may be a more
important variable than the surface form of messages concerning
subject matter. This emotive aspect of communication in the schools
directly relates to the self-concept that a given child is developing, and
also it relates to group loyalties and ethnic identity. Though such
factors are difficult to measure (as we shall see in Chapter 5), it seems
reasonable to hypothesize that they may account for more of the
variance in learning in the schools that can be accounted for by the
selection of a particular teaching methodology for instilling certain
subject matter (factive level communication).
As the research in the Canadian experiments has shown, if the
socio-cultural (emotive level) factors are not in a turmoil and if the
child is receiving adequate encouragement and support at home, etc.,
the child can apparently learn a whole new way of coding information
factively (a new linguistic system) rather incidentally and can acquire
the subject matter and skills taught in the schools without great
difficulty (Lambert and Tucker, 1972, Tucker and Lambert, 1973,
Swain, 1976a, 1976b).
However, the very different experience of children in schools, say,
in the Southwestern United States where many ethnic minorities do
not experience such success requires a different interpretation.
Perhaps the emotive messages that the child is bombarded with in the
Southwest help explain the failure of the schools. Pertinent questions
r
82
are: how does the child see his culture portrayed in the curriculum?
How does the child see himself in relation to the other children who
may be defined as successful by the system? How does the child's
,
home experience match up with the experience in the school?
It is hypothesized that variance in rate oflearning is probably more
sensitive to the emotive level messages communicated in facial
expressions, tones of voice, deferential treatment of some children in a
classroom, biased representation of experiences in the school
curriculum, and so on, than to differences in factive level methods of
presenting subject matter, This may be more true for minority
children than it is for children who are part of the majority. A similar
view has been suggested by Labov (1972) and by Goodman and Buck
(1973).
The hypothesis and its consequences can be visualized as shown in
Figure 2 where the area enclosed by the larger circle represents the
total amount of variance in learning to be accounted for (obviously
the Figure is a metaphor, not an explanation or model). The area
enclosed by the smaller concentric circle represents the hypothetical
amount of variance that mi.ght be explainedQy emotive message
factors. Among these are messages that the child perceives
concerning his own worth, the value of his people and culture, the
viability of his language as a means" of communication, and the
validity of his life experience. The,area in the outer ring represents the
hypothetical portion of variance in learning that may be accounted
for by appeal to factive level aspects of communication in the schools,
such as methods of teaching, .subject -matter taught, language of
presentation of the material, IQ, initial achievement levels, etc.
Of all the ways struggles for ethnic identity manifest themselves,
and of all the messages that can be communicated between different
social groups in theit mutual struggles to identify and define
themselves, William James (as cited by Watzlawick, et ai, 1967, p. 86)
suggested that the most crFel possible message one human being can
communicate to another (or one group to another) is simply to
pretend that the other individual (or group) does not exist. Examples _
are too common for comfoit in the history of education. Consider the,
statement that Columbus discovered America in 1492 (Banks, 1972).
Then ask, who doesn't count? (Clue: Who was already here before:
Columbus ever got the wind in his sails?) James said,
, 'No more
fiendish punishment could' be devised, ... than that one should be
turned loose in a society and remain absolutely unnoticed by all the members thereof' (as cited by Watzlawick, et ai, 1967, p. 86).
I
.MULTILINGUAL ASSESSMENT
LANGUAGE TESTS AT SCHOOL
which
can
attributed
83
be
to
METHODS
Representation
of the validity of
the child's own the child's own
people
experience
Portrayal of the Portrayal of the
viability of the value of the
child's own
child's own
culture
Figure 2. A hypothetical view of the amount of
variance in learning to be accounted for by emotive
versus factive sorts of information (methods of
conveying subject matter are represented as
explaining variance in the outer ring, while the bulk
is explained by emotive factors).
The interpretation of low scores on tests, therefore, needs to take
account of possible emotive conflicts. While a high·- score on a
language test or any other educational test probably can be
confidently interpreted as indicating a high degree of skill in
communicating factive information as well as a good deal of
harmony between the child and the school situation on the emotive
level, a low score cannot be interpreted so easily. In this respect, low
scores on tests are somewhat like low correlations between tests (see
the discussion in Chapter 3, section E, and again in Chapter 7, section
B); they leave a greater number of options open. A low score may
occur because the test was not reliable or valid, or because it was not
suited to the child in difficulty level, or because it created emotive
reactions that interfered with the cognitive task, or possibly because
F
84
i ':
·ii
LANGUAGE TESTS AT SCHOOL
the child is really weak in the skill tested. The last interpretation,
however, should be used with caution and only after the other
reasonable alternatives have been ruled out by careful study. It is
important to keep in mind the fact that an emotive-level conflict IS
more likely to call for changes in the educational system and the way
that it affects children than for changes in the children.
In some cases it may be that simply providing the child with ample
opportunity to learn the ~anguage or language variety of the
educational system is the best solution; in others, it may be necessary
to offer instruction in the child's native language, or in the majority
language and the minority language; and no doubt other untried
possibilities exist. If the emotive factors are in harmony between the
school and the child's experience, there is some reason to believe that
mere exposure to the unfamiliar language may generate the desired
progress (Swain, 1976a, 1976b).
In the Canadian experiments, English speaking children who are
taught in French for the first several years of their school experience,
learn the subject matter about as well as monolingual French
speaking children, and they also incidentally acquire French. The
term 'submersion' has recently been offered by Swain (1976b) to
characterize the experience of minority children in the Southwestern
United States who do not fare so welL the children are probably not
all that different, but the social contexts of the two situations are
replete with blatant contrasts (Fishman, 1976).
D. On test biases
A great deal has been written recently concerning cultural bias in tests
(Briere, 1972, Condon, 1973). No doubt much of what is being said is
true. However, some welk-meaning groups have gone so far as to
suggest a 'moratorium on ~ll testing of minority children.' Their
argument goes something like this. Suppose a child has learned a
language that is very different from the language used by the
representatives of the majority language and culture in the schools.
When the child goes to school, he is systematically discriminated
against (whether intentionally or not is irrelevant to the argument).
All of the achievement tests, all of the classes, all of the informal
teacher and peer evaluations that influence the degree of Success or
failure of the child is in a language (or language variety) that he has
not yet learned. The entire situation is culturally biased against the
child. He is regularly treated in a prejudicial way by the school system
MULTILINGUAL ASSESSMENT
85
as a whole. So, some ~urge that we should aim to get the cultural bias
out of the school situation as much as possible and especially out of
the tests. A smaller group urges that all testing should be stopped
indefinitely pending investigation ~f other educational alternatives.
The arguments supporting such proposals are persuasive, but the
suggested solutions do not solve the problems they so graphically
point out. Consider the possibility of getting the cultural bias out .of
language proficiency tests. Granted that language tests though they
may vary in the pungency of their cultural flavor all have cultural bias
built into them. They have cultural bias because they present
sequences of linguistic elements of a certain language (or language
variety) in specific .contexts. In. fact, it is the purpose of such tests to
discriminate between various levels of skill, often, to discriminate
between native and non-native performance. A language test is
therefore intentionally biased against those who do not speak the
language or who do so at different levels of skill.
Hence, getting the bias out of language tests, if pushed to the
logical limits, is to get language tests to stop functioning as language
tests. On the surface, preventing the cultural bias and the
discrimination between groups that such tests provide might seem
like a good idea, but in the long run it will create more problems than
it can solve. For one, it will do harm to the children in the schools who
most need help in coping with the majority language system by
pretending that crucial communication problems do not exist. At the
same time it would also preclude the majority culture representatives
in schools from being exposed to the challenging opportunities of
trying to cope in settings that use a minority language system. If this
latter event is to occur, it will be necessary to evaluate developing
learner proficiencies (of the majority types) in terms of the norms that
exist for the minority language(s) and culture(s).
The more extreme alternative of halting testing in the scho()ls is no
real solution either. What is needed is more testing that is based on
carefully constructed tests and with particular questions in mind
followed by deliberate and careful analysis. Part of the difficulty is the
lack of adequate data - not an overabundance of it. For instance,
until recently (Oller and Perkins, 1978) there was no data on the
relative importance of language variety bias, or just plain language
bias in educational testing in general. There was always plenty of
evidence that such a factor must be important to a vast array of
educational tests, but how important? Opinions to the effect that it is
not very important, or that it is of great importance merely accent the
w
, d'
86
LANGUAGE TESTS AT SCHOOL
MULTILINGUAL ASSESSMENT
need for empirical research on the question. It is not a question that
can be decided by vote ~ not even at the time honored 'grass roots
level' ~ but that is where the studies need to be done.
There is another important way that some of the facts concerning
test biases and their probable effects on the school performance of
certain groups of children may have been over-zealously interpreted.
These latter interpretations relate to an extension of a version of the
strong contrastive analysis hypothesis familiar to applied linguists.
The reasoning is not unappealing. Since children who do poorly on
tests in school, and on tasks such as learning to read, write, and do
arithmetic, are typically (or at least often) children who do not use the
majority variety of English at home, their failure may be attributed to
differences in the language (or variety of English) that they speak at
home and the language that is used in the schools. Goodman (1965)
offered such an explanation for the lower performance of inner city
black children on reading tests. Reed (1973) seemed to advocate the
same view. They suggested that structural contrasts in sequences of
linguistic elements common to the speech of such children accounted
for their lower reading scores. Similar arguments have been popular
for years as an explanation of the 'difficulties' of teaching or learning
a language other than the native language ofthe learner group. There
is much controverting evidence, however, for either application ofthe
contrastive analysis hypothesis (also see Chapter 6 below).
F or instance, contrary to the prediction that would follow from the
contrastive analysis hypothesis, in two recent studies, black children'
understood majority English about as well as white children, but the
white children had greater difficulty with minority black English
(Norton and Hodgson, 1973, and Stevens, Ruder, and Tew, 1973).
While Baratz (1969) showed that white children tend to transform
sentences presented in black English into their white English
counterparts, and similarly, black children render sentences in white
English into their own language variety, it would appear from her
remarks that at least the black children had little difficulty
understanding white English. This evidence does not conclusively
eliminate the position once advocated by Goodman and Reed, but it
doe$ suggest the possibility oflooking elsewhere for an explanation"Df
reading problems and other difficulties of minority children in the
schools. For example, is it not possible that sociocultural factors that
are of an essentially non-linguistic sort might play an equal if not
greater part in explaining school performance? One might ask
whether black children in the communities where differences have
87
been observed are subject to the kinds of reinforcement and
punishment cOl1tingencies that are present in the experience of
comparable grOllps of children in the majority culture. Do they see
their parents reading books at home? Are they encouraged to read by
parents and older siblings? These are tentative attempts at phrasing
the right questions, but they hint at certain lines of research.
As to the contrastive explanation of the difficulties of language
learners in acquiring a new linguistic system, a question should
suffice. Why should Canadian French be so much easier for some
middle class children in Montreal to acquire, than English is for many
minority children in the Soqthwest? The answer probably does not lie
in the contrasts between the language systems. Indeed, as the data
continues to accummulate, it would appear that many of the children
in bilingual programs in the United States (perhaps most of the
children in most of the programs) are dominant in English when they
come to school. The contrastive explanation is clearly inapplicable to
those cases. For a review of the literature on second language studies
and the contrastive analysis approaches, see Oller (1979). For a
sys1ematic study based on a Spanish~English bilingual program in
Albuquerque, New Mexico, see Teitelbaum (1976).
If we reject the contrastive explanation, what then? Again it seems
we are led to emotive aspects of the school situation in relation to the
child's experience outside of the school. If the child's cultural
background is neglected by the curriculum, if his people are not
represented or are portrayed in an unfavorable light or are just
simply misrepresented (e.g., the Plains Indian pictured in a canoe in a
widely used elementary text, Joe Sando, personal communication), if
his language is treated as unsuitable for educational pursuits
(possibly referred to as the 'home language' but not the 'school
language'), probably just about any teaching method will run into
major difficulties.
It is in the area of cultural values and ways of expressing emotive
level information in general (e.g., ethnic identity, approval,
disapproval, etc.) where social groups may contrast more markedly
and in ways that are apt to create significant barriers' to communication between groups. The barriers are not so much in the
structural systems of the languages (nor yet in the educational tests)
as they are in the belief systems and ways of living of different
cultures. Such differences may create difficulties for the acceptance of
culturally distinct groups and the people who represent them. The
failure of the minority child in school (or the failure of the school) is
l
f
88
LANGUAGE TESTS AT SCHOOL
more likely to be caused by a conflict between cultures lind the
personalities they sustain rather than a lack of cognitive skills or
abilities (see Bloom, 1976).
,
In any event, none of the facts about test bias lessens the need for
sound language proficiency testing. Those facts merely accent the
demands on educators and others who are attempting to devise tests
and interpret test results. And, alas, as Upshur (1969b) noted testis
still a four letter word.
E. Translating tests or items
Although it is possible to translate tests with little apparent loss of
information, and without drastically altering the task set the
examinees under some conditions, the translation of items fOr
standardized multiple choice tests is beset by fundamental problems
of principle. First we will consider the problems oftranslating tests
item by item from one mUltiple choice format into another, and then
we will return to consider the more general problem of translating
pragmatic tasks from one language to another.. It may seem
surprising at the outset to note that the former translation procedure
is probably not feasiqle while the latter can be accomplished without
. . ~.
.
great difficulty.
A doctoral dissertation completed in 1974 at the University of New
Mexico investigated the feasibility of translating the Boehm Test of
Basic Concepts from. English into Navajo (Scoon, 1974). The test
attempts to measure the ability of school children in the early grade.s
to handle such notions as sequence in time (before versus after) and
location in space (beside, in front of, behind, under, and so on). It was
reasoned by the original test writer that children need to be able to
understand such concepts in order to follow everyday classroom
instructions, and to carry out simple educational tasks. Scoon hoped
to be able to get data from the translated test which would help t~
define instructional strategies to aid the Navajo child in the
acquisition and use of such concepts.
Even though skilled bilinguals in English and Navajo helped with
the translation task, and though allowances were made for
unsuccessful translations, dialect variations, and the like, the
tendency for the translated items to produce results similar to the
original items was surprisingly weak. It is questionable whether the
two tests can be'said to be similar in what they require of examinees.
Some of the items tha,t were among the easiest ones on the English test ..
MULTILINGUAL ASSESSMENT
89
turned out to be very difficult on the Navajo version, and vice versa.
The researcher began the project hoping to be able to diagnose
learning difficulties of Navajo childrerl in their own language. The
study evolved to an investigation of the feasibility of translating a
standardized test in a multiple choice format from English into
Navajo. Scoon conclud~d that translating standardized tests is
probably not a feasible approach to the diagnosis of educational
aptitudes and skills ..
All of this would lead us to wonder about the wisdom of translating
standardized tests of 'intelligence' or achievement. Nevertheless, such
translations exist. There are several reasons why translating a test,
item by item, is apt to produce a very different test than the one the
translators began with.
Translation of factive-Ievel information is of course possible.
However, much more is required. Translation of a multiple choice
test item requires not only the maintenance of the factive information
in the stem (or lead-in part) of the item, but the maintenance of it in
roughly the same relation to the paradigm of linguistic and
extralinguistic contexts that it calls to mind. Moreover, the
relationships between the distractors and the correct answer must
remain approximately the same in terms of the linguistic and
extralinguistic contexts that they call to mind and in terms of the
relationships (similarities and differences) between all of those
contexts. While it may sometimes be difficult to maintain the factive
content of one linguistic form when translating it into another
language, this may be possible. However, to maintain the paradigm
of interrelationships between linguistic and extralinguistic contexts in
a set of distractors is probably not just difficult - it may well be
impossible.
Translating delicately composed test items (on some of the
delicacies, see Chapter 9) is something like trying to translate a joke, a
pun, a riddle, or a poem. As 'Robert Frost once remarked, when a
poem is translated, the poetry is often lost' (Kolers, 1968, p. 4). With
test items (a lesser art form) it is the meaning and the relationship
between alternative choices which is apt to be lost. A translation of a
joke or poem often has to undergo such changes that if it were literally
translated back into the source language it would not be recognizable. With test items it is the meaning of the items in terms of their
effects on examinees that is apt to be changed, possibly beyond
recognition.
A very common statement about a very ordinary fact in English
f'
90
MULTILINGUAL ASSESSMENT
LANGUAGE TESTS AT SCHOOL
may be an extremely uncommon statement about a very extr~
ordinary fact in Navajo. A way of speaking in English maybe
incomprehensible in Navajo; for instance, the fact that you cut down
a tree before you cut it up which is very different from cutting up in a
class or cutting down your teacher. Conversely, a commonplace
saying in Navajo may be enigmatic ifliterally translated into English:
Successful translation of items requires maintaining roughly the
same style level, the same frequency of usage of vocabulary and
idiom, comparable phrasing and reference complexities, and the
same relationships among alternative choices. In some cases this
simply cannot be done. Just as a pun cannot be directly translated
from one language into another precisely because of the peculiarities
of the particular expectancy grammar that makes the pun possible, a
test item cannot always be translated so as to achieve equal effect in
the target language. This is due quite simply to the fact that the real
grammars of natural languages are devices that relate to paradigms
of extralinguistic contexts in necessarily unique ways.
The bare tip of the iceberg can be illustrated by data from word
association experiments conducted by Paul Kolers and reported in
Scientific American in March 1968. He was not concerned with the
word associations suggested by test items, but his dat~ illustrate the
nature of the problem we are' considering. The method consists of
presenting a word to a subject such as mesa (or table) and asking him
to say whatever other word comes to mind, such as silla (or chair).
Kolers was interested in determining whether pairs of associated
words were similar in both of the languages of a bilingual subject. In~
fact, he found that they were the same in only about one-fifth of the
cases. Actually he complicated the task by asking the subject to
respond in the same language as the stimulus word on two tests (one
in each of the subject's two languages), and in the opposite language
in two other tests (e.g., once in English to Spanish stimulus words,
and once in Spanish to English stimulus words). The first two tests·
can be referred to as intralingual and the second pair as interlingual.
The chart given below illustrates typical responses in Spanish and
English under all four conditions. It shows that while the response in
English was apt to be girl to the stimulus boy, in Spanish the word
muchacho generated the response hombre. As is shown in the chart,
the interlingual associations tended to be the same in about one out of
five cases.
In view of such facts, it is apparent that it would be very difficul~
indeed to obtain similar associations between sets of alternative
INTRALINGUAL
91
INTERLINGUAL
ENGLISH
table
boy
king
house
ENGLISH
dish
girl
queen
window'
ENGLISH
"table'
boy
king
house
SPANISH
sill a
nina
reina
blanco
SPANISH
mesa
muchacho
rey
casa
SPANISH
silla
hombre
reiila
madre
SPANISH
mesa
muchacho
rey
casa
ENGLISH
chair
trousers
queen
mother
TYPICAL RESPONSES in a word-association test wete given by a subject
whose native language was Spanish. He was asked to respond in Spanish to
Spanish stImulus words, in English to the same words in English and in each
language to stimulus words in the other.
choices on multiple choice items. Scoon's results showed that the
attempt at translating the Boehm test into Navajo did not produce a
comparable test. This, of course, does not prove that a comparable
test could not be devised, but it does suggest strongly that other
methods for test development should be employed. For instance, it
would be possible to devise a concept test in Navajo by writing items
in Navajo right from the start instead of trying to translate items from
an existing English test. ,
Because of the diversity of grammatical systems that different
languages employ, it would be pure luck if a translated test item
should produce highly similar effects on a population of speakers
who have internalized a very different grammatical system for
relating language sequences to extralinguistic contexts. We should
predict in advance that a large number of such translation attempts
would produce markedly different effects in the target language.
Is translation therefore never feasible? Quite the contrary.
Although it is difficult to translate puns, jokes, or isolated test items in
a mUltiple choice format, it is not terribly difficult to translate a novel,
or even a relatively short portion of prose or discourse. Despite the
idiosyncrasies of language systems with respect to the subtle and
delicate interrelationships of their elements that make poetry and
mUltiple choice test items possible, there are certain robust features
that all languages seem to share. All of them have ways of coding
factive (or cognitive) information and all of them have ways of
r
90
I
MULTILINGUAL ASSESSMENT
LANGUAGE TESTS AT SCHOOL
may be an extremely uncommon statement about a very extr;aordinary fact in Navajo. A way of speaking in English maybe
incomprehensible in Navajo; for instance, the fact that you cut down
a tree before you cut it up whichjs very different from cutting up in a
class or cutting down your teacher. Conversely, a commonplace
saying in Navajo may be enigmatic ifliterally translated into English.
Successful translation of items requires maintaining roughly the
same style level, the .same frequency of uSage of vocabulary and
idiom, comparable phrasing and reference complexities, and the
same relationships among alternative choices. In some cases this
simply cannot be done. Just as a pun cannot be directly translated
from one language into another precisely because of the peculiarities
of the particular expectancy grammar that makes the pun possible~ a test item cannot always be translated so as to achieve equal effect in
the target language. This is due quite simply to the fact that the real
grammars of natural languages are devices that relate to paradigms
of extralinguistic contexts in necessarily unique ways.
'
The bare tip of the iceberg can be illustrated by data from word
association experiments conducted by Paul Kolers;and reported in.
Scientific American in March 1968. He was not concerned with the
word associations suggested by test iteins,but rus data illustrate the
nature of the problem we are considering. The method consists of
presenting a word to a subject suCh as mesa (or table) and asking him
to say whatever other word comes to mind, such as silla (or chair).
Kolers was interested in determining whether pairs of associate!i
words were similar in both of the languages of a bilingual subject. ,~n '
fact, he found that they were the same in only about one-fifth of the
cases. Actually he complicated the task by asking the subject to
respond in the same language as the stimulus word on two tests (one
in each of the subject's two languages), and in the opposite language
in two other tests (e.g., once in English to Spanish stimulus words,
and once in Spanish to English stimulus words). The first two tests ~
can be referred to as intralingual and the second pair as interlingual.
The chart given below illustrates typical responses in Spanish and
English under all four conditions. It shows that while the response in
English was apt to be girl to the stimulus boy, in Spanish the word
muchacho generated the response hombre. As is shown in the chart,
the interlingual associations tended to be the same in about one out of
five cases.
)
In view of such facts, it is apparent that it would be very difficult
inde{(d to obtain similar associations between sets of alternative
INTRALINGUAL
91
INTERLINGUAL
ENGLISH
table
boy
king
house
ENGLISH
dish
girl
queen
Window
ENGLISH
table
boy
king
'house
SPANISH
silla
nifia
reina
blanco
SPANISH
mesa
muchacho
rey
casa
SPANISH
silla
hombre
reina
madre
SPANISH
'mesa
muchacho
rey
casa
ENGLISH
chair
trousers
queen .
mother
TYPICAL RESPONSES in a word-association test were given by a subject
whose native language was Spanish. He was asked to respond in Spanish to
Spanish stimulus words, in English to the same words in English and in each
language to stimulus words in the other.
choices on multiple choic~ items. Scoon's results showed that the
attempt at translating the Boehm test into Navajo did not produce a
comparable test. This, of course, does not prove that a comparable
test could not be devised, but it does suggest strongly that other
methods for test development should be employed. For instance, it
would be possible to devise a concept test in Navajo by writing items
in Navajo right from the start instead of trying to translate items from
an existing English test. .
Because of the diversity of grammatical systems that different
languages employ, it would be pure luck if a translated test item
should produce highly similar effects on a population of speakers
who have internalized a very different grammatical system for
relating language sequences to extralinguistic contexts. We should
predict in advance that a large number of such translation attempts
would produce markedly different effects in the target language.
Is translation therefore never feasible? Quite the contrary.
Although it is difficult to translate puns, jokes, or isolated test items in
a multiple choice format, it is not terribly difficult to translate a novel,
or even a relatively short portion of prose or discourse. Despite the
idiosyncrasies of language systems with respect to the subtle and
delicate interrelationships of their elements that make poetry and
multiple choice test items possible, there are certain robust features
that all languages seem to share. All of them have ways of coding
factive (or cognitive) information and all of them have ways of
F
92
MULTILINGUAL ASSESSMENT
LANGUAGE TESTS AT SCHOOL
expressing emotive (or affective) messages, and in doing these things,
all languages are highly organized. They are both redundant and
creative (Spolsky, 1968). According to recent research by John
McLeod (1975), many languages are probably about equally
redundant.
These latter facts about similarities oflinguistic systems suggest the
possibility of applying roughly comparable pragmatic testing
procedures across languages with equivalent effects. For instance,
there is empirical evidence in five languages that translations and th~
original texts from which they were translated can be converted into
cloze tests of roughly comparable difficulty by following the usual
procedures (McLeod, 1975; on procedures see also Chapter 12 of this .
book). The languages that McLeod investigated included Czech,
Polish, French, German, and English. However, using different
methods, Oller, Bowen, Dien, and Mason (1972) showed that it is
probably possible to create cloze tests of roughly comparable
difficulty (assuming the educational status, age, and socioeconomic
factors are controlled) even when the languages are as different as
English, Thai, and Vietnamese.
In a different application of cloze procedure, Klare, Sinaiko, and
Stolurow (1972) recommend 'back translation' by a translator who
has not seen the original passage. McLeod (1975) used this procedure
to check the faithfulness of the translations used in his study an~
referred to it as 'blind back translation'. Whereas in the item-by-item
translation that is necessary for multiple choice tests of the typical'
standardized type there will be a systematic one-for-one correspondence between the original test items and the translation
version, this is not possible in the construction of cloze tests over
translation of equivalent passages. In fact it would violate the normal
use of the procedure to try to obtain equivalences between individual
blanks in the passages.
Other pragmatic testing procedures besides the cloze technique
could perhaps also be translated between two or more languages in
order to obtain roughly equivalent tests in different languages.
Multiple choice tests that qualify as pragmatic tasks should be.
translatable in this way (for examples, see Chapter 9). Passages in
English and Fante were equated by a translation procedure tp test
reading comprehension in a standard mUltiple choice form'at by
Bezanson and Hawkes (1976).
Kolers (1968) used a brief paragraph carefully translated into
French as a basis for investigating the nature of bilingualism. He also
93
constructed two test versions in which either English or French
phrases were interpolated into the text in the opposite language. The
task required of subjects was to read each passage aloud. Kolers was
interested in whether passages requiring switching between English
and French would require more time than the monolingual passages.
He determined that each switch on the average required an extra third
of a second. The task, however, and the procedure for setting up the
task could be adapted to the interests of bilingual or multilingual
testing in other settings.
Ultimately, the arguments briefly presented here concerning the
feasibility of setting up equivalent tests by translation between two or
more languages are related to the controversy about discrete point
and integrative or pragmatic tests. Generally speaking it should be
difficult totranslate discrete point items in a multiple choice (or other)
format while maintaining equivalence, though it is probably quite
feasible to translate pragmatic tests and thereby to obtain equivalent
tests in more than one language. For any given discrete item,
translation into another language will produce (in principle and of
necessity) a substantially different item. For any given pragmatic
testing procedure on the other hand, translation into another
language (if it is done carefully) can be expected to produce a
substantially similar test. In this respect, discrete items are delicate
whereas pragmatic procedures are robust. Of course, the possibility
of translating tests of either type merits a great deal more empirical
study.
F. Dominance and proficiency
We noted above that the 'Lau Remedies' require data concerning
language use and dominance. The questions of importance would
seem to be, what language does the child prefer to use (or feel most
comfortable using) with whom and in what contexts of experience?
And, in which of two or more languages is the child most competent?
The most common way of getting information concerning language
use is by either interviewing the children and eliciting information
from them, or by addressing a questionnaire to the teacher(s) or some
other person who has the opportunity to observe the child. Spolsky,
Murphy, Holm, and Ferrel (1972) offer a questionnaire in Spanish
and one in Navajo (either of which can also be used in English) to
'classify' students roughly according to the same basic categories that
are recommended in the 'Lau Remedies' (see Spolsky, et ai, p. 79).
-
94
r
f
LANGUAGE TESTS AT SCHOOL'
Since the 'Remedies' came more than two years later than the
Spolsky, et al article, it may be safe to assume, that the scale
recommended in the 'Remedies' derives from that source. The
teacher questionnaire involves a similar five point scale (see Spolsky,
et ai, p. 81).
Two crucial questions arise. Can children's responses to questiqns
concerning their language-use patterns be relied upon for the
important educational'decisions that must be made? And, seco~d,
can teachers judge the language ability of children in their classrooms
(bear in mind that in many cases the teachers are not bilingual
themselves; in fact, Spolsky, 1974, estimates that only about 5 %~of
the teachers on the Navajo reservation and in BIA schools speak
Navajo)? Related to these crucial questions is the empirical problt;:m
of devising careful testing procedures to assess the validity of selfreported data (by the child), and teacher-reported judgements.
Spolsky, et al made several assumptions concerning the interview
technique which they used for assessing dominance in Spanish and
English:
ifJ
1. . .. ' bilingual dominance varies from domain to domain.
Subscores ~ere therefore given for the domains of home, neigh,
borhood, and school. i.
2. A child's report of-Ills own language use is likely to be quite
accurate.
3. Vocabulary fluency (word-naming) is a good measure of
knowledge of a language and it is a good method of comparing
knowledge of two languages.
4. The natural bias of the schools in Albuquerque as a testing
situation favors the use of English; this needs to be counteracted by
using a Spanish speaking interviewer (p. 78).
If such assumptions can be made safely, it ought to be possible t9
make similar assumptions in other contexts and with little
modification extend the 'three functional tests of oral proficiency'
recommended by Spolsky, et al. Yet they themselves say, somewhat
ambiguously:
Each of the tests described was developed for a specific purpose
and it would be unwise to use it more widely without careful
consideration, but the general principles involved may prove
useful to others,who need tests that can serve similar purposesL
(p.77).
One is inclined to ask what kind of consideration? Obviously a local
MULTILINGUAL ASSESSMENT
95
meeting of the School Board or some other organization will not
suffice to justify the assumptions listed above, or to guarantee the
success of the testing procedures, with or without adaptations to fit a
particular local situation (Spolsky, 1974). What is required first is
some careful logical investigation' of possible outcomes from the
procedures recommended by Spolsky, et ai, and ~the~ procedures
which can be devised for the purpose of crossvahdatlOn. Second,
empirical study is required as illustrated, for example, in the
Appendix below.
Zirkel (1974) points out that it is not enough merely to place a child
on a dominance scale. Simple logic will explain why. It is possible for
two children to be balanced bilinguals in terms of such a scale but to
differ radically in terms of their developmental levels. An extreme
case would be children at different ages. A more pertinent case would
be two children of the same age and grade level who are balanced
bilinguals (thus both scoring at the mid point of the dominance scale,
see p. 76 above), but who are radically different in language skill in
both languages. One child might be performing at an advanced level
in both languages while the other child is performing at a much lower
level in both languages. Measuring for dominance-only would not
reveal such a difference.
No experimentation is required to show the inadequacy of any
procedure that merely assesses dominance - even if it does the job
accurately, and it is doubtful whether some of the procedures being
recommended can do even the job of dominance assessment
accurately. Besides, there are important considerations in addition to
mere language dominance which can ,enter the picture only when
valid proficiency data are available for both languages (or each of the
several languages in a multilingual setting). Moreover, with care to
insure test equivalence across the languages assessed, dominance
scores can be derived directly from proficiency data - the reverse is
not necessarily possible.
Hence, the question concerning how to acquire reliable information concerning language proficiency in multilingual contexts,
including the important matter of determining language dominance,
is essentially the same question we have been dealing with throughout
this book. In order to determine language dominance accurately, it is
necessary to impose the additional requirement of equating tests
across languages. Preliminary r~sults of McLeod (1975), Klare,
Sinaiko, and Stolurow (1972), Oller, Bowen, Dien, and Mason
(1972), and Bezanson and Hawkes (1976) suggest that careful
t
96
MULTILINGUAL ASSESSMENT
LANGUAGE TESTS AT SCHOOL
97
A more important question is not whether there are contrasts
across domains, but whether the 'word-naming' task is a valid
indication oflanguage proficiency. Insufficient data are available. At
face value such a task appears to have little relation to the sorts of
things that people normally do with language, children especially.
Such a task does not qualify as a pragmatic testing procedure because
it does not require time-constrained sequential processing, and it is
doubtful whether it requires mapping of utterances onto extralinguistic contexts in the normal ways that children might perform such
mappings - naming objects is relatively simpler than even the speech
of median-ranged three-and-a-halfyear old children (Hart, 1974, and
Hart, Walker, and Gray, 1977).
Teitelbaum (1976) correlated scores on word-naming tasks (in
Spanish) with teacher-ratings; and self-ratings differentiated by four
domains (,kitchen, yard, block, school'). For a group of kindergarten
through 4th grade children in a bilingual program in Albuquerque
(nearly 100 in all), the correlations ranged from .15 to .45.
Correlations by domain with scores on an interview task, however,
ranged from .69 to. 79. These figures hardly justify the differentiation
of language dominance by domain. The near equivalence of the
correlations across domains with a single interview task seems to
show that the domain differentiation is pointless. Cohen (1973) has
adapted the word-naming task slightly to convert it into a storytelling procedure by domains. His scoring is based on the number of
different words used in each story-telling domain. Perhaps other
scoring techniques should also be considered.
The second assumption quoted above was that a child's own report
of his language use is apt to be 'quite accurate'. This may be more true
for some children than for others. For the children in Teitelbaum's
study neither the teacher's ratings nor the children's own ratings were
very accurate. In no case did they account for more than 20 % of the
variance in more objective measures oflanguage proficiency.
What about the child Zirkel (1976) referred to? What if some
children are· systematically indoctrinated concerning what language
they are supposed to use at school and at home as some advocates of
the 'home language/school language' dichotomy adyocate? Some
research with bilingual childnm seems to suggest that at an early age
they may be able to discriminate appropriately between occasions
when one language is called for and occasions when the other
language is required (e.g., answering in French when spoken to in
French, but in English when spoken to in English, Kinzel, 1964),
translation may offer a solution to the equivalence problem, and no
doubt there are other approache~ that will prove equally effective.
There are pitfalls to be avoided, however. There is no doubt that it
is possible to devise tests that do not accomplish what they were
designed to accomplish - that are not valid. Assumptions of validity
are justifiable only to the extent that assumptions of lack of validity
have been disposed of by careful research. On that theme let us
reconsider the four assumptions quoted earlier in this section. What
is the evidence that bilingual dominance 'varies from· domain to
domain'?
In 1969, Cooper reported that a word-naming task (the same sort
of task used in the Spanish-English test ofSpolsky, et ai, 1972) which
varied by domains such as 'home' versus 'school' or 'kitchen' versus
'classroom' produced different scores depending on the domain
referred to in a particular portion of the test. The task set the
examinee was to name all the things he could think of in the 'kitchen',
for example. Examinees completed the task for each domain (five in
all in Cooper's study) in both languages without appropriate
counterbalancing to avoid an order effect. Since there were significant
contrasts between relative abilities of subjects to do the task in
Spanish and English across domains, it was concluded that their
degree of dominance varied from one domain to another. This is a
fairly broad leap of inference, however.
Consider the following question: does the fact that I can name
more objects in Spanish that I see in my office than objects that I can
see under the hood of my car mean that I am relatively more
proficient in Spanish when sitting in my office than when looking
under the hood of my car? What Cooper's results seem to show (and
Teitelbaum, 1976, found similar results with a similar task) is that the
number of things a person can name in reference to one physical
setting may be smaller or greater than the number that the same
person can name in reference to another physical setting. This is not
evidence of a very direct sort about possible changes in language
domin.ance when sitting in your living room, or when sitting in a
classroom. Not even the contrast in 'word-naming' across languages
is necessarily an indication of any difference whatsoever in language
dominance in a broader sense. Suppose a person learned the names bf
chess pieces in a language other than his native language, and suppose
further that he does not know the names of the pieces in his native
language. Would this make him dominant in the foreign language
when playing chess?
tr
98
LANGUAGE TESTS AT SCHOOL
without being able to discuss this ability at a more abstract level (e.g.,
reporting when you are supposed to speak French rather than
English). Teitelbaum's data ~ reveal little correlation between
questions about language use and scores on more objective language
proficiency measures. Is it possible that a bilingual child is smart
enough to be sensitive t() what he thinks the interviewer expects him
to S!lY? Upshur (197Ia) observes, 'it isn't fair to ask a man to cut his
own throat, and even if we should ask, it isn't reasonable to expect
him to do it. We don't ask a man to rate his proficiency when an
honest answer might result in his failure to achieve some desired goal'
(p. 58). Is it fair to expect a child to respond independently of what he
may think the interviewer wants to hear?
We have dealt earlier with the third assumption quoted above, so
we come to the four,th. Suppose that we assume the interviewer
should be a speaker of the minority language (rather than English) in
order to counteract an English bias in the schools. There are several
possibilities. Such a provision may have no effect, the desired effect
. (if
indeed it is desired as it may distort .the picture of the child'§- true
capabilities along the lines of the preceding paragraph), or an effect
, that is opposite to the desired one. The only way-to determine which
re~ult 'is the actual oneisfu devise sOme etnpirical measure of the
relative magnitude of a p'ossible interViewer effect.
r
i
99
MULTILINGUAL ASSESSMENT
providing a basis for equivalent tests in different languages that will
yield proficiency data in both languages and that will simultaneously
provide dominance scores of an accurate and sensitive sort. Figure 3
offers a rough conceptualization of the kinds of equivalent measures
needed. If Scale A in the diagram represents a range of possible scores
on a test in language A, and if Scale B represents a range of possible
scores on an equivalent test in language B, the relative difference in
scores on A and B can provide the basis for placement on the
dominance scale C (modelled after the 'Lau Remedies' or the
Spolsky, et aI, 1972, scales).
0%
100%
I
Scale A
0%
100%
I
Scale B
~
G. Tentatiye suggestions
.\
What methods then ~an be recom1!lended for multilingual testing?
There are many methods that can be expected to work well and
deserve to be tried - among them are suitable adaptations of the
methods discussed in this book in Part Three. Some of the ones that
have been used with encouraging results include oral interview
procedures of a wide range of types (but designed to elicit speech from
the child and data on comprehension, not necessarily the child's own
estimate of how well he speaks a certain language) - see Chapter 11.
Elicited imitation (a kind of oral dictation procedure) has been widely
used - see Chapter 10. Versions of the cloze procedure (particularly
ones that may be administered orally) are promising and have been
used with good results - see Chapter 12. Variations on composition
tasks and story telling or retelling have also been used - see Chapter
13. No doubt many other procedures can be devised - Chapter 14
offers some suggestions and guidelines.
In brief, what seems to be ...required is a class oftesti~g procedures
A
Ab
AB
Ba
I
I
IScale C
Figure 3. A dominance scale in relation to
proficiency scales. (Scales A and B represent .
equivalent proficiency tests in languages A and B,
while scale C represents a dominance scale, as
required by the Lau Remedies. It is claimed that the
meaning of C can only be adequately defined in
relation to scores on A and B.)
I
B
I
It would be desirable to calibrate both ofthe proficiency scales with
reference to comparable groups of monolingual speakers of each
language involved (Cowan and Zarmed, 1976, followed such a
procedure) in order to be able to interpret scores in relation to clear
criteria of performance. The dominance scale can be calibrated by
defining distances on that scale in ter1fs of units of difference in
proficiency on Scales A and B.
This can be done as follows: first, subtract each subject's score on
A from the score on B. (If the tests are calibrated properly, it is not
likely that anyone will get a perfect score on either test though there
may be some zeroes.) Then rank order the results. They should range
from a series of positive values to a series of negative values. If the
group tested consists only of children who are dominant in one
language, there will be only positive or only negative values, but not
100
LANGUAGE TESTS AT SCHOOL
both. The ends of the rank will define the ends of the dominance scale
(with reference to the population tested) - that is the A and B ,points
on Scale C in Figure 3. The center point, AB on Scale C, is simply the
zero position in the rank. That is the point at which a subje-<.:t's scores
in both languages are equal. The points between the ends and the
center, namely, Ab and Ba;can be defined by finding the mid point in
'v
the rank between that end (A or B) and the center (AB).
The degree or' accuracy with which a particular subject can be
classed as A = monolingual in A, Ab = dominant in A, AB ,i: equally
bilingual in A and, B, Ba = dominant in B, B = monolingual
in B, can
.
be judged quite accurately in' terms of t];le standard error of
differences on Scales A and B. The standard error of the differences
can be computed by finding the variance of the differences (A minus
B, Jor each subject) and then dividing it by the square root of the
number of subjects tested. If the distribution of differences is
approximately normal, chances are better than 99 in 100 that a
subject's true degree ofbilinguality will fall within the range of plus or
minus three standard errors above· or below his actual attained score
on 'Scale C. If measuring off ± 3 standard errors from a subject's
~attained score sti11leaves him close t~,say, Ab on the Scale, we can be
confident in classifying him a~dominant in A'.
Thus, if the average standard error of differences in scores on tests
A and B is large, the accuracy of Scale C will be less than if the average
standard error is small. A general guideline might be to require at
least six standard errors between: each of the five points on the
domiriance scale. It remains to be seen, however, what degree, of
accuracy will be possible. For suggestions on equating scales across
languages, see Chapter 10, pp. 289-295. 1
KEY POINTS
1. There is a serious need for multilingual testing in the schools not just of
the United States, but in many nations.
2. In the Lau versus Nichols case in 1974, the Supreme Court ruled that the
San Francisco Schools were violating a section of the Civil Rights Code
which 'bans discnmination based on the ground of race, color, or
national origin' (1974, No. 72-6520). It was ruled that the schools should
either provide instruction in the native language of the 1,800 Chinese
speaking children in question, or provide special instruction in the
English language.
3. Even at the present, in academic year 197~-9, many bilingual programs
1 Also see discussion question number 10 at the end of Chapter 10. Much basic research
is needed on these issues.
MULTILINGUAL ASSESSMENT
101
and many schools which have children of multilingual backgrounds are
not doing adequate language assessment.
4. There are important parallels between multilingual and multidialectal
societies. In both cases there is a need for language assessment
procedures referenced against group norms (a plurality of them).
5. In a strong logical sense, a language is its varieties or dialects, and the
dialects or varieties are languages. A particular variety may be elevated
to a highe~ status by virtue of the 'more equal syndrome', but this' does
not necessItate that other varieties must therefore be regarded as less
than languages - mere 'potential languages ' .
6. Prejudices can be institutionalized in theories of 'deficits' or 'deprivations'. The pertinent question is, from whose point of view? The
instit~tionalization of such theories into· discriminatory educational
practIces may well create real deficits.
7. It i~ h~poth~sized that, at least for the minority child, and perhaps for the
maJonty chIld as well, variance in learning in the schools may be much
more a function of the emotive aspects of interactions within and outside
of schools than it is a function of methods of teaching and presentation
of subject matter per se.
8. When emotive level struggles arise, factive level communication usually
stops a l t o g e t h e r . .
9. Ignoring the existence of a child or social group is a cruel punishment.
Who discovered America?
_
10. ~etting the c.ultural ?ias out oflanguage tests would mean making them
mto. something beSIdes language tests. However, adapting them to
partIcular cultural needs is another matter.
11. C<;mtr~stive ~nalysis based explanations for the generally lower scores of
~mont~children on educational tests run into major empirical
difficultIes. Other factors appear tb~be much more important than the
surface forms of different languagei(;r language varieties.
.
12. !ranslating discrete point test items is roughly comparable to translating
Jokes, or puns, or poems. It is difficult if not impossible.
13. Translating pragmatic tests or testing procedures on the other hand is
mo~e like translating prose or translating a novel. It can be done, not
.
easily perhaps, but at least it is feasible.
14. 'Blind back translation' is one procedure for checking the accuracy of
translation attempts.
15. Measures of multilingual proficiencies require valid proficiency tests.
The val~dity of proposed procedures is an empirical question.
AssumptIOns must be tested, or they remain a threat to every educational
decision based on them.
16. Measuring dominance is not e~ough. To interpret the meaning of a score
on a dominance scale, it is useful to know the proficiency scores which it
derives from.
17. It is suggested that a dominance scale of the sort recommended in the Lau
~emedies c~n be calibrated in terms of the average standard error of the
dIfferences m test scores against which the scale is referenced.
18. Similarly, it is recommended that scores on the separate proficiency tests
be referenced against (and calibrated in terms of) the scores of
102
LANGUAGE TESTS {\.T SCHOOL
MULTILINGUAL ASSESSMENT
monolingual children who speak the language of the test. (It is realized
that this may not be possible to attain in reference to some very small
populations where the minority language is on the wane.)
DISCUSSION QUESTIONS
1. Ho~ are' curricular decisions regarding the delivery of instructional
,
materials made in your school(s)? Schools that you know of?
2. If you work in or know of a bilingual program, what steps are taken in
that program to assess language proficiency? How are the scores
interpreted in relation to the curriculum? If you knew that a substantial
number of the children in your school were ~approximately equally
proficient in two languages, what curricular decisions would .you
recommend? What else would you need to know in order to reQommend
policies?
.13. If you were asked to rank priorities for budgetary expenditures, where
would language testing come on the list for tests and measurements? Or,
would there be any such list?
4. What is wrong with 'surname surveys' as a basis for determining
language dominance? What can be said about a child whose name is
Ortiz? Smith? ReitzeJ? What about asking the child concerning his
language preferences? What are some of the factors that might influence
the child's response? Why not just have the teachers in the schools judge
.
the profiCiency of the children ?' . 5. What price would you say wouid be a fair one for being able to
communicate in one of the world's power languages (especially English)?
Consider the case for the child in the African -primary schools as
suggested by Hofman. Is it worth the cost?
'116. Try to conceive of a language test that need not make reference to group
norms. How would such a test relate to educational policy?
7. Consider doing a study of possible language variety bias in the tests used
in your school. Or perhaps a language bias study, or combination of the
two would be more appropriate. What kinds of scores would be available
for the study? IQ? Aptitude? Achievement? Classroom observations?
What sorts oflanguage proficiency measures might you use? (Anyone
seriously considering such a study is urged to consult Part Three, and to
ask the advice of someone who knows something about research design
before actually undertaking the project. It could be done, however, by
any teacher or administrator capable of persuading others that it is
worth doing.) Are language variety biases in educational testing different .
~~?
'
8. Begin to construct a list of the sociocultural factors that might btl
partially accountable for the widely discussed view that has been put
forth by Jensen and Herrnstein to the effect that certain races are
superior in intelligence. What is intelligence? How is it measured with
reference to your school or school populations that you know of?
9. Spend a few days as an observer of children in a classroom setting. Note
ways in which the teacher and significant others in the school
communicate emotive information to the children. Look especially for
"10.
11.
12.
13.
--"i4.
15.
16.
103
contrasts in what is said and what is meant from the child's perspective.
Consider the school curriculum with a view to its . representation of
different cultures and language varieties. Observe the behaviour of the
children. What kinds of me-ssages-' do they pick up and pass on? What
kinds of beliefs and attitudes are they being encouraged to accept or
reject?
Discuss the differences between 'submersion' and 'immersion' as
educational approaches to the problem of teaching a child a new
language and cultural system. (See Barik and Swain, 1975, in Suggested
Readings at the end of this chapter.)
Consider an example or several of them of children who are especially
low achievers in your school or in a school that you know of. Look for
sociocultural factors that may be related to the low achievement. Then,
consider some of the high achievers in the same context. Consider the
probable effects of sociocultural contexts. Do you see a pattern
emerging?
To what extent is 'cultural bias' in tests identical to 'experiential bias' i.e., simply not having been exposed to some possible set of experiences?
Can you find genuine cultural factors that are distinct from experiential
biases? Examine a widely used standardized test. If possible discuss it
with someone whose cultural experience is very different from your own.
Are there items in it that are clearly biased? Culturally? Or merely
experientially?
Discuss factors influencing children who learn to read before they come
to school. Consider those factors in relation to what you know of
children who fail to learn to read after they come to. scho.ol. Where do.es
language development fit into the picture? Boo.ks in the home? Mo.dels
who are readers that the child might emulate? Do children of widely
different dialect o.rigins ~ the United States (or elsewhere) learn to read
much the same material? Consider Great Britain, Ireland, Australia,
Canada, and other nations.
Consider the fact that test is a four letter wo.rd (Upshur, 1969b). Why?
How have tests been misused o.r simply used to make them seem so
o.mino.us, portentous, even wicked? Reconsider the definition o.f
language tests o.ffered in Chapter 1. Can you think o.f any innocuo.us and
benign procedures that qualify as tests? Ho.w co.uld such procedures be
used to reduce the threat of tests?
Try tran~lating a few discrete items o.n several types o.f tests. Have
so.meo.ne else who. has not seen the o.riginal items do a 'blind back
translation'. Check the co.mparability of the results. Try the same with a
passage o.f prose. Can co.rrections be made to. clear up the difficulties that
arise? Are there substantial contrasts between the two tasks?
Try a language interview procedure that asks children how well they
understand o.ne o.r more languages and when they use it. A procedure
such as the one suggested fo.r Spanish-English bilinguals by Spo.lsky, et
al (1972, see Suggested Read~gs fo.r Chapter 4) sho.uld suffice. Then test
the children o.n a battery o.f o.ther measures - teacher evaluations of the
type suggested by Spo.lsky, et al (1972) fo.r Navajo-English bilinguals
might also be used. Interco.rrelate the sco.res and determine empirically
--...--,-----
-- ------
104
-
---.- -
LANGUAGE TESTS AT SCHOOL
their degree of variance overlap. To what extent can the various
procedures be said to yield the same information?
17. What kinds of tests can you conceive of to assess the popular opinion
that language dominahce varies from domain to domain? Be careful to
define 'domain'in a sociolinguistically meaningful and pragmatically
'
useful way.
18. Devise proficiency tests in two or more languages and attempt to
calibrate them in the recommended ways - both against the scores of
monolingual reference groups (if these are accessible) and in terms of the
average standard error of the differences on the two tests. Relate them to
a five point dominance'scale such as the one shown in Figure 3 above.
Correlate scpres on the proficiency tests that you have devised with other
standardized tests used in the school from which your sample population
was drawn. To what extent are variances on other tests accounted for by
variances in language proficiencies (especially in English, assuming that
it is thilanguage of practically all standardized testing in the schools)?
SUGGESTED READINGS
1. H. C. Barik and Merrill Swain, 'Three-year Evaluation of a Large Scale
Early Grade French Immersion Program: the Ottawa study,' Language
Learning 25, 1975, 1-30.
2. Andrew Cohen, 'The Sociolinguistic Assessment of Speaking Skills in a
Bilingual Education Program,' i,n L. Palmer andB. Spolsky (eds.) Papers
on Language Testing. Washington, D\C.: TESOL, 1975, 173-186.
3. Paul A. Kolers, 'Bilingualism and Information Processing,' Scientific
American 218,1968,78':"84.
4. 'OCR Sets Guidelines for Fulfilling Lau Decision,' The Linguistic
Reporter 18,1975,1,5-7. (Gives addresses ofLau Centers and quotes the
text of the 'Lau Remedies' recommended by the Task Force appointed
by the Office of Civil Rights.)
5. Patricia J. Nakano, 'Educational Implications of the Lau v. Nichols
Decision,' in M. Burt, H. Dulay, and M. Finocchinaro (eds.) Viewpoints
on English as a Second Language. New York: Regents, 1977,219-34.
6. Bernard Spolsky, 'Speech Communities and Schools,' TESOL Quarterly
8, 1974, 17-26.
7. Bernard Spolsky, Penny Murphy, Wayne Holm, and Allen Ferrel,
'Three Functional Tests of Oral Proficiency,' TESOL Quarterly 6,1972,
221-235. (Also in Palmer and Spolsky, 75-90, see reference 2 above.
Page references in this text are to the latter source.)
8. Perry A. Zirkel, 'A Method for Determining and Depicting Language
Dominance,' TESOL Quarterly 8, 1974,7-16.
9. Perry A. Zirkel, 'The Why's and Ways of Testing Bilinguality Before
Teaching Bilingually,' The Elementary School Journal, March 1976,
323-330.
r
5
Measuring Attitudes
and Motivations
A. The need for validating affective
measures
B. Hypothesized relationships between
affective variables and the use and
learning oflanguage
C. Direct and indirect measures of
affect
D. Observed relationships to
achievement and remaining puzzles
A great deal of research has been done on the topic of measuring the
affective side of human experience. Personality, attitudes, emotions,
feelings, and motivations, however, are subjective things and even
our own experience tells us that they are as changeable as the wind.
The question here is whether or not they can be measured. Further,
what is the relationship between existing measures aimed at affective
variables and measures aimed at language skill or other educational
constructs?
A. The need for validating affective measures
Noone seems to doubt that attitudinal factors are related to human
performances. 1 In the preceding chapter we considered the
hypothesis that emotive or affective factors play a greater part in
determining success or failure in schools than do factive or cognitive
factors (particularly teaching methodologies). The problem is how to
determine what the affective factors might be. It is widely believed
1 Throughout this chapter and elsewhere in the book, we often use the term 'attitudes'
as a cover term for all sorts of affective variables. While there are many theories that
- distinguish between many sorts of attitudinal, motivational and personality variables,
all ofthem are in the same boat when it comes to validation.
105
106
LANGUAGET~TSATSCHOOL
that a child's self-concept (confidence or lack of it, willingness to take
social risks, and all around sense of well-being) must contribute to
virtually every sort of school performance - or performance outside
of the school for that matter. Similarly, it is believed that the child's
view of others (whether of the child's own ethnic and sodal
background, or of a different cult~ral background, whether peers.or
non-peers) will influence virtually every,aspect of his interpersonal
interactions in positive and negative ways that contribute to success
or failure (or perhaps just to happiness in life, which though a vaguer
concept, may be a better one). .
.
It is not difficult to believe that attitude variables are important to a
wide range of cognitive phenomena - perhaps the whole range - but it
is difficult to say just exactly what attitude variables are. Therefore, it
is difficult to prove by the usual empirical methods of science that
attitudes actually have the importance usually attributed to them. In
his widely read and often cited book, Beyond Freedom and Dignity,
B. F. Skinner (1971) offers the undisguised thesis that such concepts
as 'freedom' and 'dignity' not to mention 'anxiety', 'ego', and the
kinds of terms popular in the attitude literature are merely loose and
misleading ways of speaking about the kinds of events that control
behavior. He advocates sharpening up our thinking and our ways of
controlliug behavior in order to save the world - 'not ... to destroy
[the] environment or escape from it ... [but] to redesign it' (p. 39). To
Skinner, attitudes are dispensable intervening variables between
behavior and the consequences of behavior. They can thus be done
away with. If he were quite correct, it ought to be possible to observe
behaviors and their consequences astutely enough to explain all there
is to explain about human beings - however, there is a problem for
Skinner's approach. Even simpler systems than human beings are not
fully describable in that way - e.g., try observing the input and output
of a simple desk calculator and see if it is possible to determine how it
works inside - then recall that human beings are much more complex
than desk calculators.
Only very radical and very narrow (and ther~fore largely
uninteresting and very weak theories) are able to dispense completely
with attitudes, feelings, personalities, and other difficult-to-measure !
internal states and motives of human beings. It seems necessary,
therefore, to take attitudes into account. The problem begins,
however, as soon as we try to be very explicit about what an attitude
is. Shaw and Wright (1967) say that 'an attitude is a hypothetical, or
latent, variable' (p. 15). That is to say, an attitude is not the sort of
MEASURING ATTITUD~ AND MOTIVATIONS
Hi?
variable that can be observed directly. If someone reports that he is
angry, we must either take his word for it, or test his statement on the
basis of what we see him doing to determine whether or not he is
really angry. In attitude research, the chain of inference is often much
longer than just the inference from a report to an attitude or from a
behavioral pattern to an attitude. The quality or quantity of the
attitude can only be inferred from some other variable that can be
measured.
For instance, a respondent is frequently asked to evaluate a
statement about some situation (or possibly a mere proposition
about a very: general state of affairs). Sometimes he is asked to say
how he would act or feel in a given described situation. In so-called
'projective' techniques, it is further necessary Jor some judge or
evaluator to rate the response of the subject for the degree to which it
displays some attitude. In some of these techniques there are so many
steps of inference where error might arise that it is amazing that such
techniques ever produce usable data, but apparently they sometimes
do. Occasionally, we may be wrong in judging whether someone is
happy or sad, angry or glad, anxious or calm, but often we are correct
in our judgements, and a trained observer may (not necessarily will)
become very skilled in making such judgements.
Shaw and Wright (1967) suggest that 'attitude measurement
consists of the assessment of an individual's responses to a set of
situations. The set of situations is usually a set of statements (items)
about the attitude object, to which the individual responds with a set
of specific categories, e.g., agree and disagree .... The ... number
derived from his scores represents his position on the latent attitude
variable' (p. 15). Or, at least, that is what the researcher hopes and
often it is what the researcher merely asserts. For example, Gardner
and Lambert (1972) assert that the degree of a person's agreement
with the statement that 'Nowadays more and more people are prying
into matters that should remain personal and private' is a reflection of
their degree of 'anti-democratic ideology' (p. 150). The statement
appears in aJ scale supposed by its authors (Adorno, FrenkelBrunswick, Levinson, and Sanford, 1950) to measure 'prejudice and
authoritarianism'. The scale was used by Gardner and Lambert in the
1960s in the states of Maine, Louisiana, and Connecticut. Who can
deny that the statement was substantially true and was becoming
more true as the Nixon regime grew and flourished?
The trouble with most attitude measures is that they have never
been subjected to the kind of critical scrutiny that should be applied
108
LANGUAGE TESTS AT SCHOOL
to any test that is used to make judgements (or to refute judgements)
about human beings.' In spite of the vast amounts of research
completed in the last three or four decades on the topic of attitudes,
personality, and measures of related variables, precious little has
been learned that is not subject to severe logical and empirical doubts.
Shaw and Wright (1967) who report hundreds of attitude measures
along with reliability statistics ahd validity results, where available,
lament their 'impression that much effort has been wasted .... ' and
that 'nowhere is this more evident than in relation to tlie instruments
used in the measurement of attitudes' (p. ix).
They are not alone in their disparaging assessment of attitude
measures. In his preface to a tome of over a thousand pages on
Personality: Tests and Reviews, Oscar K. Buros (1970) says:
Paradoxically, the area of testing which has outstripped all
others in the quantity of research over the past thirty years is
also the area in which our testing procedures have the generally
least accepted validity .... In my own case, the preparation,of
this volume has caused me to become increasingly discouraged
at the snail's pace at which we are advancing compared to the
tremendous progress being made in the areas of medicine,
science, i;lnd technology. As a profession, we are prolific
researchers; but somehow or other there is very little agreement
aboutwhat is the resulting verifiable knowledge (p. ixx).
In an' article on a different topic and addressed to an entirely
different audience, John R. Platt offered some comments that may
help to explain the relative lack of progress in social psychology and
in certain aspects of the measurement of sociolinguistic variables. He
argued that a thirty year failure to agree is proof of a thirty year
failure to do the kind of research that produces the astounding and
remarkable advances of fields like 'molecular biology and highenergy physics' (1964, p. 350). The fact is that the social sciences
generally are among the 'areas of science that are sick by comparjson
because they have forgotten the necessity for alternative hypotheses
and disproof' (p. 350). He was speaking of his own field, chemistry,
when he coined the terms 'The Frozen Method, The Eternal
Surveyor, The Never Finished, The Great Man with a Single
Hypothesis, The Little Club of Dependents, The Vendetta, The All
Encompassing Theory which Can Never Be Falsified' (p. 350), but do
these terms not have a certain ring offamiliarity with reference to. the
social sciences? What is the solution? He proposes a return to the oldfashioned method of inductive inference - with a couple of
embellishments.
MEASURING ATTITUDES AND MOTIVATIONS
109
It is necessary to form multiple working hypotheses instead of
merely popularizing the ones that we happen to favor, and instead of
trying merely to support or worse yet to prove (which strictly
speaking is a logical impossibility for interesting empirical)
hypotheses, we should be busy eliminating the plausible alternativesalternatives which are rarely addressed in the social sciences. As
Francis Bacon stressed so many years ago, and Platt re-emphasizes,
science advances only by disproofs.
Is it the failure to disprove that explains the lack of agreement in
the social sciences? Is there a test that purports to be a measure of a
certain construct? What else, might it be a measure of? What other
alternatives are there that need to be ruled out? Such questions have
usually been neglected. Has a researcher found a significant
difference between two groups of subjects? What plausible
explanations for the difference have not been ruled out? In addition
to forming multiple working hypotheses (which will help to keep our
thinking impartial as well as clear), it is necessary always to recycle
the familiar steps of the Baconian method: (1) form clear hypotheses;
(2) design crucial experiments to eliminate as many as possible; (3)
carry out the experiments; and (4) 'refine the possibilities that remain'
(Platt, 1964, p. 347) and do it all over again, and again, and again,
always eliminating some of the plausible alternatives and refining the
remaining ones. By such a method, the researcher enhances the
chances of formulating a more powerful explanatory theory on each
cycle from data-to-theory to data-to-theory with maximum
efficiency. Platt asks if there is any virtue in plodding almost aimlessly
through thirty years of work that might be accomplished in two or
three months with a little forethought and planning.
In the sequel to the first volume on personality tests, Buros offers
the following stronger statement in 1974 (Personality Tests and
Reviews II) :
It is my considered belief that most standardized tests are
poorly constructed, of questionable or unknown validity,
pretentious in their claims, and likely to be misused more often
than not (p. xvii).
In another compendium, one that reviews some 3,000 sources of
psychological tests, R Lowell Kelly writes in the Foreword (see
Chun, Cobb, and French, 1975):
At first blush, the development of such a large number of
assessment devices in the past 50 years would appear to reflect
remarkable progress in the development of the social sciences.
110
LANGUAGE TESTS AT SCHOOL
Unfortunately, this conclusion is not justified, ... nearly three
out of four of the instruments are used but once and often only
by the developer of the instrument. Worse still, ... the more
popular instruments tend to be selected more frequently not
because they are better measuring instruments, but primarily
because of their convenience (p. v).
It is one thing to say that a particular attitude questionnaire, or
rating procedure of whatever sort, measures a specific attitude or
personality characteristic (e.g., authoritarianism, prejudice, anxiety,
ego strength/weakness, or the like) but it is an entirely different
matter to prove that this is so. Indeed, it is often difficult to conceive of
any test whatsoever that would constitute an adequate criterion for
'degree of authoritarianism' or 'degree of empathy', etc. Oft~n the
only validity information offered by the designer or user of a
particular procedure for assessing some hypothetical (at best latent)
attitudinal variable is the label that he associates with the procedure.
Kelly (1975) classes this kind of validity with several other types -as
what he calls 'pseudo-validity'. He refers to this kind of validity as
'nominal validity' - it consists in 'the assumption that the instrument
measures what its name implies'. It is related closely to 'validity by
fiat' - 'assertion by the author (no matter how distinguished 1) that t4e
instrument measures variable X, Y, or Z' (p. vii, from the Foreword
to Chun, et ai, 1975). One has visions ofa king laying the sword blade
on either shoulder of the knight-to-be and declaring to all ears, 'I dub
thee Knight.' Hence the test author gives 'double' (or 'dubious', if you
prefer) validity to his testing instrument - first by authorship and then
by name.
Is there no way out of this difficulty? Is it not possible to require
more of attitude measures than the pseudo-validities which
characterize so many of them? In a classic paper on 'construct
validity' Cronbach and Meehl (1955) addressed this important
question. If we treat attitudes as theoretical constructs and measures
of attitudes (or measures that purport to be measures of attitudes) as
tests, then the tests are subject to the same sorts of empirical arid
theoretical justification that are applied to any construct in scientific
study. Cronbach and Meehl say, 'a construct is some postula~ed
attribute of people assumed to be reflected in performance' (p. 283).
They continue, 'persons who possess this attribute will, in situation
X, act in manner Y (with a stated probability)' (p. 284).
Among the techniques for assessing the validity of a test of a
postulated construct are the following: select a group whose behavtor
r
!I
,
t
MEASURING ATTITUDES AND MOTIVATIONS
111
is well known (orean be determined) and test the hypothesis that the
behavior is related to an attitude (e.g., the classic example used by
Thurstone and others was church-goers versus non-church-goers).
Or, if the attitude or belief of a particular group is known, some
behavioral criterion can be predicted and the hypothesized outcome
can be tested by the usual method.
These techniques are rather rough and ready, and can be improved
upon generally (in fact they must be improved upon if attitude
measures are to be more finely calibrated) by devising alternative
measures of the attitude or construct and assessing the degree (and
pattern) of variance overlap on the various measures by correlation
and related procedlires(e.g., factor analysis and multiple regression
technique~). For instance, as Cronbach and Meehl (1955) observe, 'if
two tests are presumed to measure the same construct, a correlation
between them is predicted ... If the obtained correlation departs from
the expectation, however, there is no way to know whether the fault
lies in test A, test B, or the formulation of the construct. A matrix of
intercorrelations often points out profitable ways of dividing the
construct into more meaningful parts, factor analysis being a useful
computational method in such studies' (p. 287).
Following this latter procedure, scales which are supposed to
assess the same construct can be correlated to determine whether in
fact they share some variance. The extent of the correlation, or of
their tendency to correlate with a mathematically defined factor (or to
'load' on such a factor), may be taken as a kind of index of the validity
of the measure. Actually, the latter techniques are related to the
general internal consistency of measures that purport to measure the
same thing. To prove a high degree of internal consistency between a
variety of measures of, say, 'prejudice' is not to prove that the
measures are indeed assessing degree of prejudice, but if anyone of
them on indepyndent grounds can be shown to be a measure of
prejudice then confidence in all of the measures is thereby
strengthened. (For several applications of rudimentary factor
analysis, see the Appendix.)
Ultimately, there may be no behavioral criterion which can be
proposed as a basis for validating a given attitude measure. At best
there may be only a range of criteria, which taken as a whole, cause us
to have confidence in the measurement technique (for most presently
used measures, however, there is no reason at all to have confidence in
them). As Cronbach and Meehl (1955) put it, 'personality tests, and
some tests of ability, are interpreted in terms of attributes for which
)
112
LANGUAGE TESTS AT SCHOOL
there is no criterion' (p. 299). In such cases, it is principally through
statistical and empirical methods of assessing the internal consistency
of such devices that their construct validity is to be judged. They give
the example of the measurement of temperature. One criterion we
might impose on any measuring device is that it would show higher
temperature for water when it is boiling than when it is frozen solid,
or when it feels hot to the touch rather than cold to the touch - but the
criterion can be proved to be more crude than the degrees of
temperature a variety of thermometers are capable of displaying.
Moreover, the subjective judgement of temperature is more subject to
error than are measurements derived directly from the measuring
techniques.
The trouble with attitudes is that they are so out of reach, and at
the same time they are apparently subject to a kind of fluidity that
allows them to change (or perhaps be created on the spot) in resportse
to different social situations. Typically, it is the effects of attitudes
that we are interested in rather than the attitudes per se;' or
alternatively, it is the social situations that give rise to both the
attitudes and their effects which are the objects of interest. Recently,
there has been a surge of interest in the topic of how attitudes affect
language use and language behavior. We turn now to that topic and,
return below to the kinds of measures of attitudes that have been
widely used in the testing of certain research hypotheses relatt?d to it.
B. Hypothesized relationships between affective variables and the use
and learning oflanguage
At first look, the relationship of attitudes to the use and learning Qf
language may appear to be a small part ofthe problem of attitudes in
general - what about attitude toward self, significant others,
situations, groups, etc. ? But a little reflection will show that a very'
wide range of variables are compassed by the topic of this subsection.
Moreover, they are variables that have received much special
attention in recent years. In addition to concern for attitudes and the
way they influence a child's learning and use of his own native
language (or possibly his native languages in the event that he grows
up with more than one at his command), there has been much
concern for the possible effects that attitudes may have on the
learning of a second or third language and the way such attitudes
affect the choice of a particular linguistic code (language or style) in a ,
particular context. Some of the concern is generated by actuaI
MEASURING ATTITUDES AND MOTIVATIONS
113
research results - for instance, results showing a significant
relationship between certain attitude measures or motivation indices
and attainment in a second language. ProbaQly most of it, however, is
generated by certain appealIng arguments that often accompany
meager or even contradictory research findings - or no findings at all.
A widely cited and popular position is that of Guiora and
his collaborators (see Guiora, Paluszny, Beit-Hallahmi, Catford,
Cooley, and Dull, 1975). The argument is an elaborate one and has
many interesting ramific-ations and implications, but it can be
capsulized by a few selected quotations from the cited article. Crucial,
to the argument are the notions of empathy (being able to put yourself
in someone else's shoes - apropos to Shakespeare's claim that 'a
friend is another self'), and language ego (that aspect of self awareness
related to the fact that 'I' sound a certain way when'!, speak and that
this is part of 'my' identity) :
We hypothesized that this ability to shed our native pronunciation habits and temporarily adopt a different pronunciation is closely related to empathic capacity (p. 49).
One wonders whether having an empathic spirit is a necessary
criterion for acquiring a native-like pronunciation in another
language? A sufficient one? But the hypothesis has a certain appeal as
we read more.
With pronunciation viewed as the core of language ego, and as
the most critical contribution of language ego to selfrepresentation, we see that the early flexibility of ego
boundaries is reflected in the ease of assimilating native-like
pronunciation by young children; the later reduced flexibility is
reflected in the reduction of this ability in adults (p. 46).
Apart from some superfluity and circularity (or perhaps because of it)
so far so good. They continue,
At this point we can link empathy and pronunciation of a
second language. As conceived here, both require a temporary
relaxation of ego boundaries and thus a temporary modification of self-representation. Although psychology traditionally regards language performance as a cognitiveintellectual skill, we are concerned here with that particular
aspect of language behavior that is most critically tied to selfrepresentation (p. 46).
But the most persuasive part of the argument comes two pages later:
. .. superimposed upon the speech sounds of the words one
chooses to utter are sounds which give the listener information
114
MEASURING ATTITUDffi AND MOTIVATIONS
LANGUAGE TffiTS AT SCHOOL
115
\
or 3 (native-like). Data tapes are rated independently br three
native Thai speakers, trained in pronunciation evalu~tlOn .. A
distinct advantage of the STP is. that it can be u~ed W1t~ nalVe
subjects. It bypasses the necessity of first teaching subjects a
second language (p. 50).
about the speaker's identity. The listener can decide w~ether
one is sincere or insincere. Ridicule the way I sound, my dialect,
or my attempts at pronouncing French and you will have
ridiculed me.
(One might, however, be regarded as oversensitive if one spoke very
baa French,mightn't one?)
Ask me to change the way I sound and you ask me to change
myself. To speak a second language authentically is to take on a
new identity. As with empathy, it is to step into a new a~d
perhaps unfamiliar pair of shoes (p. 48).
What about the empirical studies designed to test the relationship
between degree of empathy and acquisition of unfamiliar phonological systems? What does the evidence show? The empirical data is
reviewed by Schumann (1975), who, though he is clearly sympathetic)
with the general thesis, finds the empirical evidences for it excessively
weak. The first problem is in the method for measuring 'empathy' by
the Micro Momentary Expression test - a procedure that consists
of watching silent psychiatric interviews and pushing a buttoI,l
everytime a change is noticed in the expression on the face of the
person being interviewed by the psychiatrist (presumably the film,
does not focus on the psychiatrist, but Guiora, et aI, are not clear on
this in their description of the so-called measure of empathy).
Reasonable questions might include, is this purported measure of
empathy correlated with other purported measures of th~ same'
construct? Is the· technique reliable? Are the results similar on
different occasions under similar circumstances? Can trained'
'empathizers' do the task better than untrained persons? Is there any
meaningful discrimination on the MME between persons who are
judged on other grounds to be highly empathetic and persons who are
judged to be less so? Apparently, the test, as a measure of empathy
must appeal to 'nominal validity'.
A second problem with the empirical research on the Guiora, et aI,
hypothesis is the measure of pronunciation accuracy, the Standard
Thai Procedure:
The STP consists of a master tape recording of 34 test items
(ranging in length from 1 to 4 syllables) separated by a 4 second
pause. The voicer is a female Thai native speaker.... Total
..
.
testing time is 4t minutes.: .. (p. 49).
The scoring procedure is currently under reVlSlon. The baSiC
evaluation method involves rating tone, vowel and consonant
quality for selected phonetic units on a scale of 1 (poor), 2 (fair),
I
A distinct advantage to test people who have not learned the
language? Presumably the test ought to discriminate between more
and less native-like pronouncers of a language that the subjects have
already learned. Does it? No data are offered. Obviously, the STP
would not be a very direct test of, say, the ability to pronounce
Spanish utterances with a native-like accent - or would it? N.o data
are given. Does the test discriminate between persons who are Judged
to speak 'seven languages in Russian' (said with a thick Russian
accent) and persons who are judged by respective native speaker
groups to speak two or more languages so as to pass the~se~v:es for a
native speaker of each of the several languages? Reliablhty and
validity studies of the test are conspicuously absent.
The third problem with the Guiora, et aI, hypothesis is that the
empirical results that have been reported are either only weakly
interpretable as supporting their hypothesis, or they are not
interpretable at all. (Indeed, it makes little sense to try to interpret the
meaning of a correlation between two measures about which it
cannot be said with any confidence what they are measures of.) Their
prediction is that empathetic persons will do better on the STP than
less empathetic persons.
Two empirical studies attempted to establish the connection
between empathy and attained skill in pronunciation - both are
discussed more fullY,by Schumann (1975) than by Guiora, et aI,
(1975). The first study with Japanese learners failed to show any
relationship between attained pronunciation skill and the scores on
the·MME (the STP was not used). The second study, a'more extensive
one, found significant correlations between rated pronunciatioil in
Spanish (for 109 subjects at the Defense Language Institute in a three
month intensive language course), Russian (for 201 subjects at DLI),
Japanese (13 subjects at DLI), Thai (40 subjects), and Mandarin (38).
There was a difficulty however. The correlations were positive for
the first three languages and negative for the last two.· This would
seem to indicate that if the MME measures empathy, forBome groups
it is positively associated with the acquisition of pronunciation skills
in another language, and for other groups, it is negatively associated
with the same process. What can be concluded from such data? Very
118
LANGUAGE TESTS AT SCHOOL
want to acquire another (see Lambert, 1955, and Gardner and
Lambert, 1959). A similar claim was offered for ethnocentrism
although it was expected to becnegatively correlated with attainment
in the target language. Neither of these hypotheses (if they can be
termed hypotheses) has proved to be any more susceptible to
empirical test thanthe one about types of motivation. The arguments
in each case have a lot of appeal, but the hypotheses themselves do
not seem to be borne out by the data - the results are either unclear or
contradictory. We will return to consider some of the specific
measuring instruments used by Gardner and Lambert in section C.
Yet another research tradition in attitude measurement is that of
Joshua Fishman and his co-workers. Perhaps his most important
argument is that there must be a thoroughgoing investigation of the
factors that determine who speaks what language (or variety) to
whom and under what conditions.
In a recent article, Cooper and Fishman (1975), twelve issues of
interest to researchers in the area of 'language attitude' are discussed.
The authors define language attitudes either narrowly as related to
how people think people ought to talk (following Ferguson, 1972) or
broadly as 'those attitudes which influence language behavior and
behavior toward language' (pp. 188-9). The first definition they
consider to be too narrow and the second 'too broad', (p. 189).
Among the twelve questions are: (1) Does attitude have a
characteristic structure? ... ; (2) To what extent can language
attitudes be componentialized? ... ; (3) Do people have characteristic
beliefs about their language? (e.g., that it is well suited for logic, or
poetry, or some special use); (4) Are language attitudes -really
attitudes toward people who speak the language in question, or '
toward the language itself? Or some of each? (5) What is the effect of
context on expressed stereotypes? Is one language considered
inappropriate in some social settings but appropriate in others? (6)
Where do language attitudes come from? Stereotypes that are
irrational? Actual experiences? (7) What effects do they have?
(,Typically, only modest relationships are found between attitude
measures and the overt behaviors 'Yhich such scores are designed to
predict' p. 191); (8) Is one language apt to be more effective than
another for persuading bilinguals under different circumstances? (9)
'What relationships exist among attitude scores obtained from
different classes of attitudinal measurements (physiological or
psychological reaction, situational behavior, verbal report)?' (p.
191); (10) How about breaking down the measures into ones that
MEASURING ATTITUDES AND MOTIVATIONS
119
assess reactions 'toward an object, toward a situation, and toward the
object in that situation' (p. 191)? What would the component
structure be like then? (11) 'Do indirect measures of attitude (i.e.,
1l1easures whose purpose is not apparent to the respondent) have
higher validities than direct measures'? ... (p. 192); (12) 'How
reliable are indirect measures ... ?' (p. 192).
It would seem that attitude studies generally share certain strengths
and weaknesses. Among the strengths is the intuitive appeal of the
argument that people's feelings about themselves (including the way
they talk) and about others (and the way they talk) ought to be related
to their use and learning of any language(s). Among the weaknesses is
the general lack of empirical vulnerability of most of the theoretical
claims that are made. They stand or fallon the merits of their intuitive
appeal and quite independently of any experimental qata that may be
accumulated. They are subject to wide differences of interpretation,
and there is a general (nearly complete) lack of evidence on the
validity of purported measures of attitudinal constructs. What data
are available are often contradictory or uninterpretable - yet the
favored hypothesis still survives. In brief, most attitude studies are not
empirical studies at all. They are mere attempts to support favored
'working' hypotheses - the hypotheses will go right on working even
if they turn out to predict the wrong (or worse yet, all possible)
experimental outcomes.
Part of the problem with measures of attitudes is that they require
subjects to be honest in the 'throat-cutting' way - to give information
about themselves which is potentially damaging (consider Upshur's
biting criticism of such a procedure, cited above on p. 98).
Furthermore, they require subjects to answer questions or react to
statements that sometimes place them in the double-bind of being
damned any way they turn - the condemnation which intelligent
people can easily anticipate may be light or heavy, but why should
subjects be expected to give meaningful (indeed, non-pathological)
responses to such items ? We consider some of the measurement
techniques that are of the double-bind type in the next section. As
Watzlawick, Beavin, and Jackson (1967) reason, such double-bind
situations are apt to generate pathological responses ('crazy' and
irrational behaviors) in people who by all possible non-clinical
standards are 'normal'.
And there is a further problem which even if the others could be
resolved remains to be dealt with - it promises to be knottier than all
of the rest. Whose value system shall be selected as the proper
120
LANGUAGE TESTS AT SCHOOL
criterion for the valuation of scales or tpe interpretatioh of
responses? By whose standards will the questions be interpreted?
This is the essential validity problem of attitude inventories
personality measures, and affective valuations of all sorts. It knock~
at the very doors of the unsolved riddles of human existence. Is there
meaning in life? By whose vision shall it be declared? Is there an
absolute truth? Is there a God? Will I be called to give an accounting
for what I do? Is life on this side of the grave all there is? What shall I
do with this man Jesus? What shall I do with the moral precepts of my
own culture? Yours? Someone else's? Are all solutions to the riddles
of equivalent value? Is none of any value? Who shall we have as our
arbiter? Skinner? Hitler? Kissinger?
Shaw and Wright (1967) suggest that 'the only inferential step' in
the usual techniques for the measurement of attitudes 'is the
assumption that the evaluations of the persons involved in' scale
construction correspond to those of the individuals whose attitudes
are being measured' (p. 13). One does not have to be a logician to
know that the inferential step Shaw and Wright are describing is
clearly not the only one involved in the 'measurement of attitudes;. In
fact, if that were the only step involved it would be entirely pointless
to waste time trying to devise measurement instruments --:- just ask Ithe
test constructors to divulge their own attitudes at the outset. Why
bother with the step of inference?
There are other steps that involve inferential leaps of substantial
magnitude, but the one they describe as the 'only' one is no doubt the
crucial one. There is a pretense here that the value judgements
concerning what is a prejudice or what is not a prejudice, or what is
anxiet~ or what is not anxiety: what is aggressiveness or acquiescence,
what IS strength and what IS weakness, etc. ad infinitum, can be
acceptably and impartially determined by some group consensus.
What group will the almighty academic community pick? Or will the
choice of a value system for affective measurement in the schools, be
made by political leaders'? By Marxists? Christians? Jews? Blacks?
Whites? Chicanos? Navajos? Theists? Atheists? Upper class?
Middle class? Humanists? Bigots? Existentialists? Anthropologists?
Sexists? Racists? Militarists? Intellectuals?
The trouble is the same one that Plato discussed in his Republic - it
is not a question of how to interpret a single statement OQ a
questionnaire (though this is a questiop. of importance for each such
statement), it is a question of how to .decide cases of disagreement.
Voting is one proposal, but if the minority (or a great plurality 'of
MEASURING ATTITUDES AND MOTIVATIONS
121
minorities, down to the level of individuals) get voted down, shall
their values be repressed in the institutionalization of attitude
measures?
C. Direct and indirect measures of affect
So many different kinds of techniques have been developed for the
purpose of trying to get people to reveal their beliefs and feelings that
it would be impossible to be sure that all of the types had been covered
in a single review. Therefore, the intent of this section is to discuss the
most widely used types of attitude scales and other measurement
techniques and to try to draw some conclusions concerning their
empirical validities - particularly the measures that have been used in
conjunction with language proficiency and the sorts of hypotheses
considered in section B above.
Traditionally, a distinction is made between 'direct' and 'indirect'
attitude measures. Actually we have already seen that there is no
direct way of measuring attitudes, nor can there ever be. This is not so
much a problem with measuring techniques per se as it is a problem of
the nature of attitudes themselves. There can be no direct measure of
a construct that is evidenced only indirectly, and subjectively even
then.
.
As qualities of human experience, emotions, attitudes, and values
are notoriously ambiguous in their very expression. Joy or sadness
may be evident by tears. A betrayal or an unfeigned love may be
demonstrated with a kiss. Disgust or rejoicing may be revealed by
laughter. Approval or disapproval by a smile. Physiological measures
might offer a way out, but they would have to be checked against
subjective judgements. An increase in heart rate might be caused by
fear, anger, surprise, etc. But even if a particular constellation of
glandular secretions, palmar sweating, galvanic skin response, and
other physiological responses were thought to indicate some
emotional state, presumably the test would have to be validated
against subjective judgement by asking the patient displaying the
pertinent constellation of factors, 'Do you feel angry now?' As D. M.
MacKay (1951) noted In his insightful article 'Mindlike Behavior in
Artefacts,' we could know all about the inner workings of a neon sign
"\'ithout knowing the meaning of the words that it displays. The
problem of attitudes is like that; it is distinctly a matter of
interpretation.
What measures of attitudinal variables have been used or
~r----------=---~~---
---
-!
122
!
LANGUAGE TESTS AT SCHOOL
recommended in studies oflanguage use and language learning? We
have mentioned a few above; we will look more closely at a number of
them below. They include the. notoriously unreliable 'projective'
techniques such as the Thematic Apperception Test (see Anastasi,
1976, pp. 565-7, and 579-85) which is a Rorschach-like -test
consisting of 19 deliberately vague pictures concerning wlVch
subjects are asked to tell stories. They also use the -,Tanking of
statements concerning how well they describe the respondent. More
often attitude scales are of the more traditional agree-disagree type
originated by Likert (1932). Recently, a variant on the Likert scale
technique has been developed by Zirkel and Jackson (1974) for use
with children from four different ethnic backgrounds. It employs line
drawings offaces supposedly ranging from very happy to very sad.
First, we will consider three of the attitude scales of the Likert-type
used by Gardner and Lambert (1972) - the F Scale, the E Scale, and
the Anomie Scale. The instructions used for each scale were similar:
The following statements are opinions which have often been
expressed by students about your age. They cover a wide range
of topics and it has been found that many people agree with
each statement and many disagree. There are no right or wro:Q-g
answers. You are askedlo mark each statement in th~ left-hand
margin according to your agreement or disagreement as
.
follows:
+1 : slight support,
agreement
+2: moderate support,
agreement
+3: strong support,
agreement
-1 : slight opposition,
disagreement
-2: moderate opposition,
disagreement
-3: strong opposition,
disagreement
MODIFIED F SCALE
1. Obedience and respect for authority are the most
important virtues children should learn.
2. What youth needs most is strict discipline, rugged
determination, and the will to work and fight for family
and country.
3. Nowadays when so many different kinds of people move
around and mix together .so much, a person has to protect
himself especially carefully against catching an infection or
disease from them.
4. What this country needs most, more than laws and political
programs, is a few courageous, tireless, devoted leaders in
whom the people can put their faith.
.
MEASURING ATTITUDES AND MOTIVATIONS
123
5. No weakness or difficulty can hold us back if we have
enough will power.
6. Human nature being what it is, there will always be war and
conflict.
7. A person who has bad manners, habits, and breeding can
hardly expect to get along with decent people.
8. People can be divided into two distinct classes: the weak
and the strong.
9. There is hardly anything lower than a perso~ who does not
feel a great love, gratitude, a?-d ~esI?ect for ~lS parents.
10. The true American way ofhfe is dlsappeanng so fast that
force may be necessary to preserve it.
11. Nowadays more and more people are prying into matters
that should remain personal and private.
12. If people would talk less and work more, everybody would
be better off.
13. Most people don't realize how much our lives are
controlled by plots hatched in secret places.
According to Gardner ~nd Lambert (1972, p. 150), all of the
foregoing statements 'reflect antidemocratic ideology'. In fact, the
original authors of the scale developed items from research on 'the
potentially fascistic individual' (Adorno, et aI, 1950, p. 1) which
'began with anti-Semitism in the focus of attention' (p. 2). Sources for
the items were subject protocols from 'factual short essay questions
pertaining to such topics as religion, war, ideal society, and so forth;
early results from projective questions; finally, and by far the most
important, material from the interviews and the Thematic
Apperception Tests' (p. 225). The thirteen scales given above were
selected from Forms 45 and 40 of the Adorno, et al (1950) F Scale
consisting of some 46 items according to Gardner and Lambert
(1972). Actually, item 10 given above was from Form 60 (an earlier
version of the F Scale).
There are two major sources of validity on the Fascism Scale (that
is, the F Scale) that are easily accessible. First, there are the
intercorrelations between the early versions of the F Scale with
measures that were supposed (by the original authors of the F Scale)
to be similar in content, and second, there are the same data in the
several correlation tables offered by Gardner and Lambert which can
be examined. According to Adorno, et al (1950, pp. 222-24) the
original purpose in. developing the F Scale was to try to obtain a less
obvious measure of 'antidemocratic potentiaf (p. 223, their italics)
than was available in the E Scale (or Ethnocentrism Scale) which they
had already developed.
122
MEASURING ATTITUDES AND MOTIVATIONS
LANGUAGE TESTS AT SCHOOL
recommended in studies oflahguage use and language learning? We
have mentioned a few above; we will look more closely at a number ofthem below. They include the notoriously unreliable 'projective'
techniques such as the Thematic Apperception Test (see A~astasi,
1976, pp. 565-7, and 579-85) which is a Rorschach-like~ test
consisting of 19 deliberately vague pictures concerning which
subjects are asked to tell stories. They also use the ranking of
statements concerning how well they describe the respondent. More
often attitude scales are of the more traditional agree-disagree type
originated by Likert (1932). Recently, a variant on tQe Likert scale
technique has been developed by Zirkel and Jackson (1974) for use
with children from four different ethnic backgrounds. It employs line
drawings offaces supposedly ranging from very happy to very sa<;l.
First, we will consider three of the attitude scales of the Likert-type
used by Gardner and Lambert (1972) - the F Scale, the E Scale, and
the Anomie Scale. The instructions used for each scale were similar:
The following statements are opinions which have often been
expressed by students about your age. They cover a wide range
of topics and it has been found that many people agree with
each statement and many disagree. There are no right or wrol).g
answers. You are asked to mark each statement in the left-hand
margin according to your agreement or disagreement as
follows:
'.
+1 : slight support,
agreement
+2: moderate support,
agreement
+3: strong support,
agreement
-1: slight opposition,
disagreement
-2: moderate opposition,
disagreement
-3: strong opposition,
disagreement
MODIFIED F SCALE
1. Obedience and respect for authority are the most
important virtues children should learn.
2. What youth needs most is strict discipline, rllgged
determination, and the will to work and fight for family
and country.
3. Nowadays when so many different kinds of people move
around and mix together so much, a person has to protect
himself especially carefully against catching an infection or
disease from them.
4. What this country needs most, more than laws and political
programs, is a few courageous, tireless, devoted leaders in
.
whom the people can put their faith.
123
5. No weakness or difficulty can hold us back if we have
)enough will power.
6. Human nature being what it is, there will always be war and
conflict.
.
7. A person who has bad manners, nabits, and breeding can
hardly expect to get along with decent people.
8. People can be divided into two distinct classes: the weak
and the strong.
9. There is hardly anything lower than a person who does not
feel a great love, gratitude, and respect for his parents.
10. The true American way oflife is disappearing so fast that
force may be necessary to preserve it.
11. Nowadays more and more people are prying into matters
that should remain personal and private.
12. If people would talk less and work more, everybody would
be better off.
13. Most people don't realize how much our lives are
controlled by plots hatched in secret places.
According to Gardner and Lambert (1972, p. 150), all of the
foregoing statements 'reflect antidemocratic ideology'. In fact, the
original authors of the scale developed items from research on 'the
potentially fascistic individual' (Adorno, et aI, 1950, p. 1) which
'began with anti-Semitism in the focus of attention' (p. 2). Sources for
the items were subject protocols from 'factual short essay questions
pertaining to such topics as religion, war, ideal society, and so forth;
early results from projective questions; finally, and by far the most
important; material from the interviews and the Thematic
Apperception Tests' (p. 225). The thirteen scales given above were
selected from Forms 45 and 40 of the Adorno, et al (1950) F Scale
consisting of some 46 items according to Gardner and Lambert
(1972). Actually, item 10' given above was from Form 60 (an earlier
version of the F Scale).
There are two major sources of validity on the Fascism Scale (that
is, the F Scale) that are easily accessible. First, there are the
intercorrelations between the early versions of the F Scale with
measures that were supposed (by the original authors of the F Scale)
to be similar in content, and second, there are the same data in the
several correlation tables offered by Gardner and Lambert which can
be examined. According to Adorno, et al (1950, pp. 222-24) the
original purpose is. developing the F Scale was to try to obtain a less
obvious measure of 'antidemocratic potential' (p. 223, their italics)
than was available in the E Scale (or Ethnocentrism Scale) which they
had already developed.
124
LANGUAGE TESTS AT SCHOOL
MEASURING ATTITUDES AND MOTIVATIONS
"
Immediately following is the Gardner and Lambert adaptation of
a selected set of the questions on the E Scale which was used in much
of their attitude research related to language use and language
learning. Below, we return to the question of the validity of the F
Scale in relation to the E Scale:
footnote, p. 264). From these data the conclusion can be drawn that if
either scale is tapping an 'authoritarian' outlook, both must be.
However, the picture changes .radically when we examine the data
from Gardner and Lambert (1972).
In studies with their modified (in fact, shortened) versions of the F
and E Scales, the correlations were .33 (for 96 English speaking high
school students in Louisiana), .39 (for 145 English speaking high
school students in Maine), .33 (for 142 English speaking high school
students in Connecticut), .33 (for 80 French-American high school
students in Louisiana), and .46 (for 98 French-American high school
students in Maine). In none of these studies does the overlap in
variance on the two tests exceed 22 %and the pattern is quite different
from the true relationship posited by Adorno, et al between F and E
(the variance overlap should be nearly perfect).
What is the explanation? One idea that has been offeted previously
(Liebert and Spiegler, 1974, and their references) and which seems to
fit the data from Gardner and Lambert relates to a subject's tendency
merely to supply what are presumed to be the most socially
acceptable responses. If the subject were able to guess that the
experinienter does in fact consider some responses more appropriate
than others, this would create some pressure on sensitive subjects to
give the socially acceptable responses. Such pressure would tend to
result in positive correlations across the scales.
Another possibility is that subjects merely seek to appear
consistent from one answer to the next. The fact that one has agreed
or disagreed with a certain item on yither the E or F Scale may set up a
strong tendency to respond as one has responded on previous items -'
a so-called 'response set'. If the response set factor were accounting
for a large portion of the variance in measures like E and F, then this
would also account for the high correlations observed between them.
In either event, shortening the tests as Gardner and Lambert did
would tend to reduce the amount of variance overlap between them
because it would necessarily reduce the tendency of the scales to
establish a response set, or it would reduce the saliency of socially
desirable responses. All of this could happen even if neither test had
anything to do with the personality traits they are trying to measure.
Along this line, Crowne and Marlowe (1964) report:
MODIFIED E SCALE
1. The worst danger to real Americanism during the last fifty
years has come from foreign ideas and agitators.
'
2. Now that a new world organization is set up, America must
be sure that she loses none of her independence and
complete power as a sovereign nation.
3. Certain people who refuse to salute the.flag should be forced
to conform to such a patriotic action, or else be imprisqned.
4. Foreigners are all right in their place, but they carry it too far
.
when they get too familiar with us.
5. America may not be perfect, but the American way has
brought us about as close as human beings can get to a
perfect society.
6. It is only natural and right for each person to think that his
family is better than any other.
7. The best guarantee for our national security is for America
to keep the secret of the nuclear bomb.
" 20
These items were selected by Gardner and Lambert from
original items recommended for the final form ofthe Adorno, et al E
Scale. The original authors reason, 'the social world as most
ethnocentrists see it is arranged like a series of concentric circles
around a bull's-eye. Each circle represents an ingroup-outgroup
distinction; each line serves as a barrier to exclude all outside groups
from the center, and each group is in turd excluded by a slightly
narrower one. A sample "map" illustrating the ever-narrowing
ingroup would be the following: Whites, Americans, native-born
Americans, Christians, Protestants, Californians, my family, and
finally - l' (p. 148). Thus, the items on the E Scale are expected to
reveal the degree to which the respondent is unable 'to identify with
humanity' (p. 148).
How well does the E Scale, and its more indirect counterpart the, F
Scale, accomplish its purpose? One way of testing the adequacy of
both scales is to check their intercorrelation. This was done by
Adorno, et aI, and they found correlations ranging from a low of .59
to a high of .87 (1950, p. 262). They concluded that if the tests were
lengthened, or corrected for the expected error of measurement in
any such test, they should intercorrelate at the .90 level (see their
~---
125
Acquiescence has been established as a major response
determinant in the measurement of such personality variables
as authoritarianism (Bass, 1955, Chapman and Campbell,
1957, Jackson and Messick, 1958). The basic method has been
~
...- =-=--======----~-....- - - - - - - - - - - - - - - - - - - -
r~------
126
LANGUAGE TEs'nl AT SCHOOL
to show, first of all, that a given questionnaire - say ~he
California F Scale (Adorno, et aI, 1950) - has a large propor,tlOn
of items keyed agree (or true or yes). Second, half the it.ems are
reversed, now being scored for disagreement. CorrelatlO~s are
then computed between the origin~l an? the rever~ed. IteJ?s.
Failure to find high negative correlatlOns IS, then, an mdlcatlOn
of the operation of response acquiescence. In one study of the F
Scale, in fact, significant positive correlations - stroJ;lgly
indicative of an acquiescent tendency - were found (Jackson,
Messick, and Solley, 1957), (p. 7).
'
Actually,the failure to find high negative correlations is not
nece'ssarily indicative only of a response acquiescence tendency; there
are a number of other possibilities, but all of them are fatal to~ the
claims of validity for the scale in question.
Another problem with scales like E and F involves the tendency forrespondents to differentiate factive and emotive aspects of stateme~ts
with which they are asked to agree or disagree. One may agree with
the factive content of a statement and disagree with the emotive tone
(both of which in the case of written questionnaires are coded
principally in choice of words). Consider, for instance, the factive
content and emotive tone of the following statements (the first
version is from the Anomie Scale which is discussed below):
A. The big trouble with our country is that it relies, for the
most part, on the law of the jungle: 'Get him before he gets
you.'
'
B. The most serious problem of our people is that too f~w of
them practice the Golden Rule: 'Do unto others as you
would have them do unto you.'
C. The greatest evil of our country is that we exist, for the most
part, by the law of survival: 'Speak softly and carry a big
stick.'
Whereas the factive content of the preceding statements is similar in,
all cases, and though each might be considered a rough paraphrase of
the others, they differ 'greatly in emotive tone, Concerning such
differences (which they term 'content' and 'style' respectively),
Crowne and Marlowe (1964) report:
Applying this, differentiation to the assessment of personality
characteristics or attitudes, Jackson and Messick (1958)
contended that both stylistic properties of test items a~d
habitual expressive or response styles of individuals may
outweigh the importance of item content. The wayan item is "
worded - its style of expression - may tend to increase its
frequency of endorsement (p. 8).
MEASURING ATTITUDES AND MOTIVATIONS
127
Their observation is merely a special case of the hypothesis which we
discussed in Chapter 4 (p. '82f) on the relative importance of factive
and emotive aspects of communication.
Taking all of the foregoing into account, the validity of the E and F
Scales is in grave doubt. The hypothesis that they are measures of the
same basic configuration of personality traits (or at least of similar
configurations associated with 'authoritarianism') is not the only
hypothesis that will explain the available data - nor does it seem to be
the most plausible of the available alternatives. Furthermore, if the
validity of the E and F Scales is in doubt, their pattern of
interrelationship with other variables - such as attained proficiency in
a second language - is essentially uninterpretable.
A third scale used by Gardner and Lambert is the Anomie Scale
adapted partly from Srole (1951, 1956):
I
ANOMlE SCALE
l. In the U.S. today, public officials aren't really very
interested in the problems of the average man.
2. Our country is by far the best country in which to live. (The
scale is reversed on this item and on number 8.)
3. The state of the world being what it is, it is very difficult for
\ the student to plan for his career.
4. ~n spi~e of what some people say, the lot of the average man
IS gettmg worse, not better.
'
5. These days a person doesn't really know whom he can
count on.
'6. It is hardly fair to bring children into the world with the
way things look for the future.
7. No matter how hard I try, I seem to get a 'raw deal' in
schooL
8. The opportunities offered young people in the United
States are far greater than in any other country.
9. Having lived this long in this culture, I'd be happier moving
to some other country now.
10. In this country, it's whom you know, not what you know,
that makes for success.
11. The big trouble with our country is that it relies, for the
most part, on the law of the jungle: 'Get him before he gets you.'
12. Sometimes I can't see much sense in putting so much time
into education and learning.
This test is intended to measure 'personal dissatisfaction or
discouragement with one's place in society' (Gardner and Lambert,
1972, p. 21).
128
LANGUAGE TESTS AT SCHOOL
MEASURING ATTITUDES AND MOTIVATIONS
Oddly perhaps, Gardner and Lambert (1972, an? their other works
reprinted there) have consistently predicted that hIgher sc~res on ~he
preceding scale should correspond to higher performance In learnmg
a second language - i.e., that degree of anomie .~nd attainment of
proficiency in a second language should be p~sltIvely correlated however, they have predicted that the correlatlOns for the ~ and F
Scales with attainment in a second language should be negatIve.,T~e
difficulty is that other authors have argued.t~at s~ores on the Ano~e
Scale and the E and F Scales should be pOSItiVely mtercorrelated ~Ith
each other - that, in fact, 'anomia is a fa~tor ~elated to ~he formatlOn
of negative rejective attitJ,ldes toward nnnonty groups (Srole, 1956,
p.712).
.
'
Srole cites a correlation of .43 between a scale deSIgned to meas~r~
prejudice toward minorities and his 'Anomia S~ale'-(b~th 'anonne
and 'anomia' are used in designating the scale m the hterature); as
evidence that 'lostness is one of the basic conditions out of which
some types of political authoritarianism eme~ge' (~. 714, footnote
20): Yet other authors have predicted no rel~tIonshlp at all betwe,en
Anomie scores and F scores (Christie and Gels, 1970, p. 360).
Again, we seem to be wrestling with hypotheses th~t are flavored
more by the pr(;!ferences of a particular research tech~l~ue than they
are by substantial research data. Even 'nominal' vahdity ~annot be
i~voked when the researchers fail to agree on the meanmg of the
name associated with a particular questionnaire. Gardner and
Lambert (1972) report ,generally positive correlatio~s b.etween
Anomie scores and E and F scores. This, rather than contnbutmg to a
sense of confidence in the Anomie Scale, merely makes it, ~oo, suspect
of a possible response set factor - or ~ tende~cy to glVe socIally
acceptable responses, or possibly to gIve C~~slstent responses. to
similarly scaled items (negatively toned or POSItiVely. toned). In bnef,
there is little or no evidence to show that the scale m fact measures
what it purports to measure.
Christie and Geis (1970) suggest that the F Scale was possibly the
most studied measure of attitudes for the preceding twenty year
period (p. 38). One wonders how such a measu~e sur~ves in th~ f~ce
of data which indicate that it has no substantial claIms to vahdlty.
Further, one wonders why, if such a studied test has produces! such a .
conglomeration of contradictory findings, any ~ne should expect to
be able to whip together an attitude measure (WIth much le~s study)
that will do any better. The problems are not merely techmcal ones,
associated with test reliability and validity, they are also moral ones
129
having to do with the uses to which such tests are intended to be put.
The difficulties are considered severe enough for Shaw and Wright
(1967) to put the following statement in a conspicuous location at the
end of the Preface to their book on Scales for the Measurement of
Attitudes:
The attitude scales in this book are recommended for research
purposes and for group testing. We believe that the available
information and supporting research does not warrant the
application of many of these scales as measures of individual
attitude for the purpose of diagnosis or personnel selection or
for any other individual assessment process (p. xi).
In spite of such disclaimers, application of such measurement
techniques to the diagnosis of individual performances _ e.g.,
prognosticating the likelihood of individual success or failure in a
course of study with a view to selecting students who are more likely
to succeed in 'an overcrowded program' (Brodkey and Shore, 1976)is sometimes suggested:
J
'A problem which has arisen at the University of New Mexico is
one of predicting the language-learning behavior of students in
an overcrowded program which may in the near future become
highly selective. This paper is a progress report on the design of
an instrument to predict good and poor language learning
behavior on the basis of personality. Subjects are students in the
English Tutorial Program, which provides small sized classes
for foreign, Anglo-American, and minority students with poor
college English skills. Students demonstrate a great range of
linguistic styles, including English as a second language,
English as a second dialect, and idiosyncratic problems, but all
can be characterized as lacking skill in the literate English of
college usage - a difficult 'target' language (p. 153).
In brief, Brodkey and Shore set out to predict teacher ratings of
students (on 15 positively scaled statements with which the teacher
must agree or disagree) on the basis of the student's own preferential
ordering of 40 statements about himself - some of the latter are listed
below. The problem was to predict which students were likely to
succeed. Presumably, students judged likely to succeed would be
given preference at time of admission.
Actually, the student was asked to sort the 40 statements twice _
first, in response to how he would like to be and second, how he was at
the time of performing the task. A third score was derived by
computing the difference between the first two scores. (There is no
way to determine on the basis of information given by Brodkeyand
i
I
I
I,
Ii
II[I
tI
II
II
II
130
MEASURING ATTITUDES AND MOTIVATIONS
LANGUAGE TESTS AT SCHOOL
Shore how the items were scaled - that is, it cannot be determined
whether agreeing with a particular statement contributed~ositively
or negatively to the student's score.)
THE Q-SORT STATEMENTS,
(from Appendix B of Brodkey and Shore, 1976, pp. 161-2)
1. My teacher can probably see that I am an interesting
person from reading my essays ..
2. My teachers usually enJoy readmg my essays.
'
3. My essays often make me feel good.
4. My next essay will be written mainly to please myself/
5. My essays often leave me feeling confused about m~ own
ideas.
.
6. My writing will always be poor.
-,
7. No matter how hard I try, my grades don't really improve
much.
8. I usually receive fair grades on my assignments.
io.· My next essay will be written mainly to please my teacher.
11. I dislike doing the same thing over and over again.
\
18. I often get my facts confused.
'
19. When I feellike doing something I go and do it now.
22. I have trouble remembering names and faces.
28. I am more interested in the details of ajob than just getting
29.
30.
31.
32.
it done.
I sometimes have trouble communicating witl;t others.
I sometimes make decisions too quickly.
.
I like to get one project finished before starting another.
I do my best work when I plan ahead and follow the plan.
34.. I try to get unpleasant tasks out of the way before I begin
working on more pleasant tasks.
36. I always try to do my best, even if it hurts other people's
feelings.
'
, ,
37. I sometimes hurt other people's feelings without knowing
it.
38. I often let other people's feelings influence my decisions.
39. I am not very good at adapting to changes.
40. I am usually very aware of other people's feelings. On the basis of what possible theory of personality can th,e
foregoing statements be associated with a definition of the
successful student? Suppose that some theory is proposed which
offers an unambiguous basis for scaling the items as positive or
131
negative. What is the relationship of an item like 37 to such a theory?
On the basis of careful thought one might conclude that statement 37
is not a valid description of any possible person since if such a person
hurt other people's feelings without knowing it, how would he know
it? In such a case the score might be either positively or negatively
related to logical reasoning ability - depending on whether the item is
positively, or negatively scaled. Note also the tendency throughout
to place the student in the position of potential double-binds.
Consider item 28 about the details of a job versus getting it done.
Agreeing or disagreeing may be true and false at the same time.
Further, consider the fact that if the items related to the teacher's
attitudes are scaled appropriately (that is in accord with the teacher's
attitudes about what a successful learner is), the test may be a
measure of the subject's ability to perceive the teacher's attitudes i.e., to predict the teacher's evaluation of the subject himself. This
would introduce a high degree of correlation between the personality
measure (the Q-Sort) and the teacher's judgements (the criterion of
whether or not the Q-Sort is a valid measure of the successful student)
- but the correlation would be an artefact (an artificially contrived
result) of the experimental procedure. Or consider yet another
possibility. Suppose the teacher's judgements are actually related to
how well the student understands English - is it not possible that the
Q-Sort task might in fact discriminate among more and less proficient
users of the language? These possibilities might combine to produce
an apparent correlation between the 'personality' measure and the
definition of 'success'.
No statistical correlation (in the sense of Chapter 3 above) is
reported" by Brodkey and Shore (1976). They do, however, report a
table of correspondences between grades assigned in the course of
, study (which themselves are related to the subjective evaluatio,ns of
teachers stressing 'reward for effort, rather than achievement alone',
p. 154). Then they proceed to an individual analysis of exceptional
cases: the Q-Sort task is judged as not being reliable for '5 Orientals, 1
reservation Indian, and 3 students previously noted as having serious
emotional problems' (p. 157). The authors suggest, 'a general
conclusion might be that the Q-sort is not reliable for Oriental
students, who may have low scores but high grades, and is slightly less
reliable for women than men .... [for] students 30 or older, ... Q-sort
scores seemed independent of grades .. .' (p. 157). No explanations
are offered for these apparently deviant cases, but the authors
conclude nonetheless that 'the Q-sortis on the way to providing us
o
132
MEASURING ATTITUDES AND MOTIVATIONS
LANGUAGE TESTS AT SCHOOL
133
because it may involve some affectively based judgement (see
especially the basis for course grades recommended by Brodkey and
Shore, 1976, p. 154).
We come now to the empathy measure used by Guiora, Paluszny,
Beit-Hallahmi, Catford, Cooley, and Dull (1975) and by Guiora and
others. In studying the article by Haggard and Isaacs (1966), where
the original MME (Micro-Momentary Expression) test had its
beginnings, it is interesting to note that for highly skilled judges the
technique adapted by Guiora, et ai, had average reliabilities of only
.50 and .55. The original authors (Haggard and Isaacs) suggest that 'it
would be useful to determine the extent to which observers differ in
their ability to perceive accurately rapid changes of facial expressions
and the major correlates of this ability' (p. 164). Apparently, Guiora
and associates simply adapted the test to their own purpose with little
change in its form and without attempting (or at least without
reporting attempts) to determine whether or not it constituted a
measure of empathy.
From their own description of the MME, several problems become
immediately apparent. The subject is instructed to push a button,
which is attached to a recording device, whenever he sees a change in
facial expression on the part of a person depicted on film. The first
obvious trouble is that there is no apparent way to differentiate
between hits and misses - that is, there is no way to tell for sure
whether the subject pushed the button when an actual change was
taking place or merely when the subject thought a change was taking
place. In fact, it is apparently the case that the higher the number of
button presses, the higher the judged empathy of the subject. Isn't it
just as reasonable to assume that an inordinately high rate of button
presses might correspond to a high rate of false alarms? In the data
reported :Qy Haggard and Isaacs, even highly skilled judges were not
able to agree in many cases on when changes were occurring, much
less on the meaning of the changes (the latter would seem to be the
more important indicator of empathy). They observe, 'it is more
difficult to obtain satisfactory agreement when the task is to identify
and designate the impulse or affect which presumably underlies any
particular expression or expression change' (p. 158). They expected to
be able to tell more about the nature and meaning of changes when
they slowed down the rate of the film. However, in that condition (a
condition also used by Guiora and associates, see p. 51) the reliability
was even lower on the average than it was in the normal speed
condition (.50 versus .55, respectively).
with a useful predictive tool for screening Tutorial Program
applicants' (p. 158).
In another study, reported in an earlier issue of the same journal,
Chastain (1975) correlated scores on several personality measures
with grades in French, German, and Spanish for students numbering
80, 72, and 77 respectively. In addition to the personality measures
(which included scales purporting to assess anxiety, outgoingness,
and creativity), predictor variables included the verbal and
quantitative sub-scores on the Scholastic Aptitude Test, high school
rank, and prior language experience. Chastain observes, 'surprising
as it may seem, the direction of correlation was not consistent [for test
anxiety], (p. 160). In one case it was negative (for 15 subjects enrolled
in an audio-lingual French course, - .48), and in two others it was
positive (for the 77 Spanish students, .37; and for the 72 Getman
students, .21). Chastain suggests that 'perhaps some concern about a
test is a plus while too much anxiety can produce negative results' (p.
160). Is his measure valid? Chastain's measure of Test Anxiety came
from Sarason (1958, 1961). An example item given by Sarason (1958)
is 'While taking an important examination, I perspire a great deal' (p.
340). In his ·1961 study, Sarason reports correlations with 13
measures of 'intellectual ability' and the Test Anxiety scale along with
five other measures of personality (all of them subscales on the
Autobiographical Survey). For two separate studies with 326 males~.
and 412 females (all freshman or sophomore students at the
University of Washington, Seattle), no correlations above .30 were
reported. In fact, Test Anxiety produced the strongest correlations
with high school grade averages (divided into six categories) and with
scores on Cooperative English subtests. The highest correlation was
- .30 between Test Anxiety and the ACE Q (1948, presumably a
widely used test since the author gives only the abbreviation in the
text of the article). These are hardly encouraging validity statistics.
A serious problem is. that correlations of above .4 between the
various subscores on the Autobiographical Survey may possibly be
explained in terms of response set. There is no reason for concluding
that Test Anxiety (as measured by the scale by the same name) isa
substantial factor in variance obtained in the various 'intellectual'
variables. Since in no case did Chastain's other personality variables
account for as much as 10 % of the variance in grades, they are not
discussed here. We only note in passing that he is probably correct in
saying that 'course grade may not be synonymous with achievement'
(p. 159) - in fact it may be sensitive to affective variables precisely
1
--~-------=,,-
l34
-
--,-----'(
LANGUAGE TESTS-'AT SCHOOL
Since it is axiomatic (though perhaps not exactly true for all
empirical cases) that.the validity of a test cannot exceed the square of
its reliability, the validity estimates for the MME would have to be in
the range of .25 to .30 - this would be only for the case when the test is
a measure of someone's ability to notice changes in facial expressions,
or bett<;r, as a measure of interjudge agreement on the task of noticing
changes in facial expressions. The extrapolation from such judgements to 'empathy' as the construct to be measured by the MME
is a wild leap indeed. No validity estimates are possible on the basis
of available data for an inferential jump of the latter sort.
Another widely used measure of attitudes - one that is somewhat
less direct than questions or statements concerning the attitudes of
the subject toward the object or situation of interest - is the semantic
differential technique which was introduced by Osgood, Suci, and
Tannenbaum (1957) for a wider range of purposes. In fact, they were
interested in the measurement of meaning in a broader sense. Their
method, however, was adapted to attitude studies by Lambert and
Gardner, and by Spolsky (1969a). Several follow up studies on the
Spolsky research are discussed in Oller, Baca, and Vigil (1977).
Gardner and Lambert (1972) reported the use of seven point scales
of the following type (subjects were asked to rate themselve-s,
Americans, how they themselves would like to be, FrenchAmericans, and their French teacher):
L
SEMANTIC DIFFERENTIAL SCALES, BIPOLAR VARIETY
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
l3.
14.
15.
16.
17.
18.
19.
Interesting
Prejudiced
Brave
Handsome
Colorful
Friendly
Honest
Stupid
Kind
Pleasant
Polite
Sincere
Successful
Secure
Dependable
Permissive
Leader
Mature
Stable
_:_:_:_:_:_:_
- : - : - :_:_:_:_
-:-:_:_:_:_:_
-:_:_:_:_:_:_
- :_:_:_:_:_:_
_:_:_:_:_:_:_
_:_:_:_:_:_:_
-:_:_:_:_:_:_
--'---:_:_:_:_:_:_
-:_:_:_:_:_:_
-:_:_ :_:_:_:_
-:_:_:_:_:_:_
-:-:_:_:_ :_:_
-:_:_:_:_:_:_
-:_:_:_:_:_:_
-:_:_:_:_ :_:_
-:_:_:_:_:_:_
-:-:_:_:_:_:_
-:_:_:_:_:_:_
Boring
Unprejudiced
Cowardly
Ugly
Colorless
Unfriendly
Dishonest
Smart
Cruel
Unpleasant
Impolite
Insincere
Unsuccessful
Insecure
Undependable
Strict
'
Follower
Immature
Unstable
MEASURING ATTITUDES AND MOTIVATIONS
20.
21.
22.
23.
Happy
Popular
Hardworking
Ambitious
_:_:_:_:_:_:_
_:_:_:_:_:_:_
_:_:_:_:_ :_:_
_:_:_:_:_:_:_
135
Sad
Unpopular
Lazy
Not Ambitious
Semantic differential scales of a unipolar variety were used by
Gardner and Lambert (1972) and by Spolsky (1969a) and others (see
Oller, Baca, and Vigil, 1977). In form they are very similar to the,
bipolar scales except that the points of the scales have to be marked
with some value such as 'very characteristic' or 'not at all
characteristic' or possibly 'very much like me' or 'not at all like me'.
Seven point and five point scales have been used.
In an evaluation of attitudes toward the use of a particular
language Lambert, Hodgson, Gardner, and Fillenbaum (1960) used
a 'matched guise' technique. Fluent French-English bilinguals
recorded material in both languages. The recordings from several
speakers were then presented in random order and subjects were
asked to rate the speakers. (Subjects were, of course, unaware that
each speaker was heard twice, once in English and once in French.)
SEMANTIC DIFFERENTIAL SCALES, UNIPOLAR VARIETY
1. Heightvery little _:_:_:_:_:_:_ very much
and so on for the attributes: good looks, leadership, thoughtfulness,
sense of humor, intelligence, honesty, self-confidence, friendliness,
dependability, generosity, entertainingness, nervousness, kindness,
reliability, ambition, sociability, character, and general likability.
Spolsky (1969a) and others have used similar lists of terms
presumably defining personal attributes: helpful, humble, stubborn,
businesslike, shy, nervous, kind, friendly, dependable, and so forth.
The latter scales in Spolsky's studies, and several modeled after his,
were referenced against how subjects saw themselves to be, how they
would like to be, and how they saw speakersoftheir native language,
and speakers of a language they were in the process of acquiring.
How reliable and valid are the foregoing types of scales? Little
information is available. Spolsky (1969a) reasoned that scales such as
the foregoing should provide more reliable data than those which
were based on responses to direct questions concerning a subject's
agreement or disagreement with a statement rather bald-facedly
presenting a particular attitude bias, or than straightforward
questions about why subjects were studying the foreign language and
the like. The semantic differential type scales were believed to be more
136
LANGUAGE TESTS AT SCHOOL
indirect measures of subject attitudes and therefore more valid than
more direct questions about attitudes. The former, ~was reasoned,
should be less susceptible to distortion by sensitive respondents.
Data concerning the tendency of scales to correlate in meaningful
ways are about the only evidence concerning the validity of such
scales. For instance, negatively valued scales such as 'stubborn'
'nervous' and 'shy' tend to cluster together (by correlation and factor
analysis techniques) indicating at least that subjects are differentiating the semantic values of scales in meaningful ways. Similarly, scales
concerning highly valued positive traits such as 'kind' 'friendly'
'dependable' and the like also tend to be more highly correlated with
each other than with very dissimilar traits.
There is also evidence that views of persons of different national,
ethnic, or iinguistic backgrounds differ substantially in ways that are
characteristic of known attitudes of certain groups. For instance,
Oller, Baca, and Vigil (1977) report data showing that a group of
Mexican American women in a Job Corps program in Albuquerque
generally rate Mexicans substantially higher than they rate
Americanos (Anglo-Americans) on the same traits. It is conceivable
that such scales could be used to judge the strength of self-concept,
attitude toward other groups, and siinilar constructs. However, much
more research is needed before such measures are put forth as
measures of particular constructs. Furthermore, they are subject to
all of the usual objections concerning self-reported data.
Little research has been done with the measurement of attitudes in
children (at least this is ,true in relation to the questions an4 research
interests disc)lssed in section B above). Recently, however, Zirkel and
Jackson (1974) have offered scales intended for use with children of
Anglo, Black, Native American, and Chicano heritages. These scales
are of the Like~t-type (agree versus disagree on a five point scale with
one position for 'don't know'). The innovation in their technique'
involves the use ofline drawings of happy versus sad faces as shown
in Figure 4. Strickland (1970) may have been the first to use such a
method with children. It is apparently a device for obtaining scaled
responses to attitude objects (such as, food, dress, games, well known
personalities who are models of a particular cultural group, and
symbols believed important in the definition of a culture). Th~scales
are used for non-readers and preliterate children. The Cultural
Attitude Scales exist in four forms (one for each of the above
designated ethnic groups).
The Technical Report indicates test-retest reliabilities ranging from
MEASURING ATTITUDES AND MOTIVATIONS
137
Figure 4. Example of a Likert-type attitude scale intended for children.
(From Zirkel (1973), Black American Cultural Attitude Scale. The scale has
five points with a possible 'don't know' answer as represented at the extreme
left - scale~ are referenced against objects presumed important to defining
cultural attItudes and of a sort that can be pictured easily.)
.52 to .61, and validity coefficients ranging from .15 to .46. These are
not impressive if one co·nsiders that only about 4 % to 25 % of the
variance in the scales is apt to be related to the attitudes they purport to
measure.
No figures on reliability or validity .are given in the Test Manual.
The authors caution, 'the use of the Cultural Attitude Scales to
diagnose the acculturation of individual children in the classroom is
at this time very precarious' (p. 27). It seems to be implied that
'acculturation' is a widely accepted goal of educational programs,
and this is questionable. Further, it seems to be suggested that the
scales might someday be used to determine the level of acculturation
of individual children - this implication seems unwarranted. There is
not any reason to expect reliable measures of such matters ever to be
forthcoming. Nonetheless, the caution is commendable.
The original studies with the Social Distance Scale (Bogardus,
1925, 1933), from which the Cultural Attitude Scales very indirectly
derive, suggest that stereotypes of outgroups are amon'g the most
stable attitudes and that the original scale was sufficiently reliable and
valid to use with some confidence (see Shaw and Wright, 1967). With
increasing awareness of crosscultural sensitivities, it may be that
measures of social distance would have to be recalibrated for today's
societal norms (if indeed sugh norms exist and can be defined) but the
original studies and several follow ups have indicated reliabilities in
the range of .90 and 'satisfactory' validity according to Newcomb
(1950, as cited by Shaw and Wright, 1967, p. 408). The original scales
required the respondent to indicate whether he would marry into,
have as close friends, as fellow workers, as speaking acquaintances, as
138'
MEASURING ATTITUDES AND MOTIVATIONS
LANGUAGE TESTS AT SCHOOL
139
proficiency but rather are caused by the degree of proficiency attained
- though weakly perhaps. And, many other possibilities exist.
Backman (1976) in a refreshingly different approach to the
assessment of attitudes offers what she calls the 'chicken or egg'
puzzle. Do attitudes in fact cause behaviors in some way, or are
attitudes rather the result of behaviors? Savignon (1972) showed that
in the foreign language classroom positive attitudes may well be a
result of success rather than a cause.
It seems quite plausible that success in learning a second language
might itself give rise to positive feelings toward the learning situation
and everything (or everyone) associated with it. Similarly, failure
might engender less positive feelings. Yet another plausible
~lternative is that attitudes and behaviors may be complexly
mterre1ated such that both of them influence each other. Bloom
(1976) prefers this latter alternative. Another possibility is that
attitudes are associated with the planning of actions and the
perception of events in some way that influences the availability of
sensory data and thus the options that are perceivable or conceivable
to the doer or learner.
The research of Manis and Dawes (1961) showing that cloze scores
were higher for subjects who agreed with the .Content of a passage
than for those who disagreed would seem to lend credence to this last
suggestion. Manis and Dawes concluded that it wasn't just that
subjects didn't want to give right answers to items on passages with
which they disagreed, but that they were in fact less able to give right
answers. Such an interpretation would also fit the data from a wide
variety of studies revealing expectancy biases of many sorts.
However, Doerr (in press) raises some experimental questions about
the Manis and Dawes design.
Regardless of the solution to the (possibly unsolvable) puzzle
about what causes what, it is still possible to investigate the strength
of the relationship of attitudes as expressed in response to
questionnaires and scores on proficiency measures of second
language ability for learners in different . contexts. It has been
observed that the relationship is apparently stronger in contexts
where learners can avail themselves of opportunities to talk with
representatives of the target language group(s), than it is in contexts
where the learners do not have an abundance of such opportunities.
F or instance, a group of Chinese adults in the Southwestern United
States (mostly graduate students on temporary visas) performed
somewhat better on a cloze test in English if they rated Americans as
visitors to his country, or would debar from visiting his country
members of a specific minority or designated outgroup. The
Bogardus definition of 'social distance', however, is considerably
narrower than that proposed more recently by Schumann (1976). The
latter is, at this point, a theoretical construct in the making and is
therefore not reviewed here.
D. Observed relationships to achievement and remaining puzzles
The question addressed in this section is, how are affective variables
related to educational attainment in general, and in particular to the
acquisition of a second language? Put differently, what is the nature
and the strength of observed relationships? Gardner (1975) and
Gardner, Smythe, Clement, and Gliksman (1976) have argued that
attitudes are somewhat indirectly related to attainment of proficiency
in a second language. Attitudes, they reason, are merely one of the
types of factors that give rise to motivations which are merely one of
the types of factors which eventually result in attainment of
proficiency in a second language. By this line of reasoning, attitudes
are considered to be causally related to achievement of proficiency in
a second language even though the relationship is not apt to be a
very strong one.
In a review of some 33surveys of 'six different grade levels (grades 7
to 12) from.seven different regions across Canada' (p. 32) involving
no less than about 2,000 subjects, the highest average correlation
between no less than 12 different attitude scales with two measures of
French achievement in no case exceeded .29 (Gardner, 1975). Thus,
the largest amount of variance in language proficiency scores that
was predicted on the average by the attitude measures was never
greater than 8! %. This result is not inconsistent with the claim that
the relationship between attitude measures and attainment in a
second language is quite indirect.
However, such a result also leaves open anumber of alternative
explanations. It is possible that the weakness of the observed
relationships is due to the unreliability or lack of validity ofthe
attitude measures. If this explanation were correct, there might be a
much stronger relationship between attitudes and attained proficiency than has or ever would become apparent using those attitude
measures. Another possibility is that the measures of language
proficiency used are themselves low in reliability or validity. Yet
another possibility is that attitudes do not cause attainment of
b
140
LANGUAGE TESTS AT SCHOOL
high on a factor defined by such posltlve traits as helpfulness,
sincerity, kindness, reasonableness, and friendliness (Oller, Hudson,
and Liu, 1977). Similarly, a group of Mexican-American women
enrolled in a Job Corps program in Albuquerque, New Mexico
scored somewhat higher on a cloze test if they rated themselves higher
on a factor defined by traits such as calmness, conservativeness,
religiosity, shyness, humility, and sincerity (Oller, Baca, and Vigil,
1977). The respective correlations between the proficiency measures
and the attitude factors were .52 and .49.
In the cases of two populations of Japanese subjects learning
English as a foreign language in Japan, weak or i~si~nificant
relationships between similar attitude measures and SImIlar proficiency measures were observed (Asakawa and Oller, 1978, and
Chihara and Oller, 1978). In the first mentioned studies, where
learners were in a societal context rich in occasions where English
might be used, attitudinal variables seemed somewhat more closely
related to attained proficiency than in the latter studies, where
learners were presumably exposed to fewer opportunities to
communicate in English with representatives of the target language
culture(s).
These. observed contrasts between second language contexts (such
as the ones the Chinese and Mexican-American subjects were
exposed to) and foreign language contexts (such as the ones the
Japanese subjects were exposed to) are by no means empirically
secure. They seem to support the hunch that attitudes may have a
greater importance to learning in some contexts than in others, and
the direction of the contrasts is consistently in favor of the second
language contexts. However, the pattern of sociocultural variables in
the situations referred to is sufficiently diverse to give rise to doubts
about their comparability. Further, the learners in the second
language contexts generally achieved higher scores in English.
Probably the stickiest and most persistent difficulty in obtaining
reliable data on attitudes is the necessity to rely on self-reports, or
worse yet, someone else's evaluative and second-hand judgement.
The problem is not just one of honesty. There is a serious question
whether it is reasonable to expect someone to give an unbiased report
of how they feel or think or behave when they are smart enough to
know that such information may in some way be used against them,
but this is not the only problem. There is the question of how reliable
and valid are a person's jUdgements even when there is no aversive
stimulus or threat to his security. How well do people know how
MEASURING ATTITUDES AND MOTIVATIONS
141
they would behave in such and such a hypothetical situation? Or,
how does someone know how they feel about a statement that may
not have any relevance to their present experience? Are average
scores on such tasks truly indicative of group tendencies in terms of
attitudes and their supposed correlative behaviors, or are they merely
indicative, of group tendencies in responding to what may be
relatively meaningless tasks?
The foregoing questions may be unanswerable for precisely the
same reasons that they are interesting questions. However, there are
other questions that can be posed concerning SUbjective self-ratings
that are more tractable. For instance, how reliable are the self-ratings
of subjects of their own language skills in a second language, say?
Can subjects reliably judge their own ability in reading, writing, or
speaking and listening tasks? Frequently in studies of attitude, the
measures of attitude are correlated only with subject's own reports of
how well they speak a certain language in a certain context with no
objective measures of language skill whatsoever. Or, alternatively,
subjects are asked to indicate when they speak language X and to
whom and in what contexts. How reliable are suchjudgements?
In the cases where they can be compared to actual tests oflanguage
proficiency, the results are not too encouraging. In a study with four
different proficiency measures (grammar, vocabulary, listening
comprehension, and cloze) and four self-rating scales (listening,
speaking, reading, and writing), in no case did a correlation between a
self-rating scale and a proficiency test reach .60 (there were 123
SUbjects) - this indicates less than 36 %overlap in variance on any of
the self-ratings with any of the proficiency tests (Chihara and Oller,
1978). In another study (Oller, Baca, Vigil, 1977) correlations
between a single self-rating scale and a cloze test in English scored by
exact and acceptable methods (see Chapter 12) were .33 and .37
respectively (subjects numbered 60).
Techniques which require others to make judgements about the
attitudes of a person or group seem to be even less reliable than selfratings. For instance, Jensen (1965) says of the widest used
'projective' technique for making judgements about personality
variables, 'put frankly, the consensus of qualified judgement is that
the Rorschach is a very poor test and has no practical worth for any
of the purposes for which it is recommended by its devotees' (p. 293).
Careful research has shown that the Rorschach (administered to
about a million persons a year in the 1960s in the U.S. alone,
according to Jensen, at a total cost of 'approximately 25 million
142
LANGUAGE TESTS AT SCHOOL
dollars', p. 292) and other projective techniques like it (such as the
Thematic Apperception Test, mentioned at the outset of this chapter)
generate about as much variability across trained judges as they do
across subjects. In other words, when trained psychologists or
psychiatrists use projective interview techniques such as the TAT or
Rorschach to make judgements about the personality of patients or
clients, the judges differ in their judgements about as much as the
patients differ in their judged traits. On the basis of such tests it would
be impossible to tell the difference (even if there was one) between the
level of, say anxiety, in Mr Jones and Mr Smith. In the different
ratings ofMr Jones, he would appear about as different from himself
as he would from Mr Smith. 2
In conclusion, the measurement of attitudes does not seem to be a
promising field ~ though it offers many challenges. Chastain urges in
the conclusion to his 1975 article that 'each teacher should do what he
or she can to encourage the timid, support the anxious, and loose the
creative' (p. 160). One might add that the teacher will probably be far
more capable of determining who is anxious, timid, creative, and we
may add empathetic, aggressive, outgoing, introverted, eager,
enthusiastic, shy, stubborn, inventive, egocentric, fascistic, ethnocentric, kind, tender, loving, and so on, without the help of the existing
measures that purport to discriminate between such types of
personalities. Teachers will be better off relying on their own
intuitions Qased on a compassionate and kind-hearted interest in
their students.
2 In spite of the now well-known weaknesses of the Rorschach (Jensen, 1965) and the
TAT (Anastasi, 1976), Ervin (1955) used the TAT to draw conclusions about contrasts
between bilingual subject's performances on the test in each of their two languages.
The variables on which she claimed they differed were such things as 'physical
aggression', 'escaping blame', 'withdrawal', and 'assertions of independence' (p. 391).
She says, 'it was concluded that there are systematic differences in the content of speech
of bilinguals according to the language being spoken, and that the differences are
probably related to differences in social roles and standards of conduct associated with
the respective language communities' (p. 391). In view of recent studies on the validity
of the TAT (for instance, see the remarks by Anastasi, 1976, pp. 565-587), it is doubtful
that Ervin's results could be replicated. Perhaps someone should attempt a similar
study to see if the same pattern of results will emerge. However, as long as the validity
of the TAT and other projective techniques like it is in serious doubt, the results
obtained are necessarily insecure. Moreover, it is not only the validity of such
techniques as measures of something in particular that is in question - but their
reliability as measures.of anything at all is in question.
MEASURING ATTITUDES AND MOTIVATIONS
143
KEY POINTS
1. It is widely believed that attitudes are factors involved in the causation of
behavior and that they are therefore important to success or failure in
I schools.
2. Attitudes toward self and others are probably among the most
important.
3. Attitudes cannot be directly observed, but must be inferred from
behavior.
4. Usually, attitude tests (or attempts to measure attitudes) consist of
asking a subject to say how he feels about or would respond to some
hypothetical situation (or possibly merely a statement that is believed to
characterize the attitude in question in some way).
5. Although attitude and personality research (according to Buros, 1970)
received more attention from 1940-1970 than any other area of testing,
attitude and personality measures are generally the least valid sort of
tests.
6. One of the essential ingredients of successful research that seems to have
been lacking in many of the attitude studies is the readiness to entertain
multiple hypotheses instead of a particular favored viewpoint.
7. Appeal to the label on a particular 'attitude measure' is not satisfactory
evidence of validity.
8. There are many ways of assessing the validity of a proposed measure of a
particular hypothetical construct such as an attitude or motivational
orientation. Among them are the development of multiple methods of
assessing the same construct and checking the pattern of correlation
between them; checking the attitudes of groups known to behave
differently toward the attitude object (e.g., the institutionalized church,
or possibly the schools, or charitable organizations); and repeated
combinations of the foregoing.
9. For some attitudes or personality traits there may be no adequate
behavioral criterion. For instance, if a person says he is angry, what
. behavioral criterion will unfailingly prove that he is not in fact scared
instead of angry?
10. Affective variables that relate to the ways people interact through
language compass a very wide range of conceivable affective variables.
II. It has been widely claimed that affective variables play an important part
in the learning of second languages. The research evidence, however, is
often contradictory.
12. Guiora, et al tried to predict the ease with which a new system of
pronunciation would be acquired on the basis of a purported measure of
empathy. Their arguments, however, probably have more intuitive
appeal than the research justifies.
13. In two cases, groups scoring lower on Guiora's test of empathy did better
in acquiring a new system of pronunciation than did groups scoring
higher on the empathy test. This directly contradicts Guiora's
hypothesis.
14. Brown (1973) and others have reported that self-concept may be an
important factor in predicting success in learning a foreign or second
language.
144
MEASURING ATTITUDES AND MOTIVATIONS
LANGUAGE TESTS AT SCHOOL
15. Gardner and Lambert have argued that integratively motivated learners
should perform better in learning a second language than instrumentally
motivated learners. The argument has much appeal, but the data confirm
every possible outcome - sometimes integratively motivated learners do
excel, sometimes they do not, and sometimes they lag behind
instrumentally motivated learners. Could the measures of motivation be
invalid? There are other possibilities.
16. The fact that attitude 'scales require subjects to be honest about
potentially damaging information makes those scales suspect of a builtin distortion factor (even if the subjects try to be honest).
17. A serious moral problem arises in the valuation of the scales. Who is to
judge what constitutes a prejudicial view? An ethnocentric view? A bias"?
Voting will not necessarily help. There is still the problem of determining
who will be included, in the'vote, or who is a member of which ethnic,
racial, national, religious, or mm-religious group. It is known that the
meaning of responses to a particular scale is essentially uninterpretable
'
unless the value of the scale can be determined in advance.
18. Attitude measures are all necessarily indirect measures (if indeed they"
can be construed as measures at all). They may, however, be'
straightforward questions such as 'Are you prejudiced?' or they may be
cleverly designed scales that conceal their true purpose - for example,
the F, or Fascism Scale by Adorno, et al.
19. A possible interpretation ofthe pattern of correlations for items on the F
Scale and the E Scale (Adorno, e( al and, Gardner and Lambert) is
response set. Since all of the· statements are keyed in the same direction, a
tendency to respond consistently though independently of item content
would produce some correlation among the items and overlap between
the two scales.l;:1.owever? the correlations might have nothing to do With
fascism or ethnocentrism which is what. the two scales purport to
measure.
20. It has been suggested that the emotive tone of statements included in
attitude questionnaires may be as important to the responses-of subjects
as is the factive content of those same statements.
21. As long as the validity of purported attitude measures is in question, their
pattern of interrelationship with any otlier criteria (say, proficiency
attained in a second language) remains essentially uninterpretable.
22. Concerning the sense of lostness presumably measured by the Anomie
Scale, all possible predictions have been made in relation to attitudes
toward minorities and outgroups. The scale is moderately correlated
with the E and F Scales in the Gardner and Lambert data which may
merely indicate that the Anomie Scale too is subject to .a response set
factor.
23. Experts rarely recommend the use of personality inventories as a basis
for decisions that will affect individuals.
24. A study of anxiety by Chastain (1975) revealed conflicting results: for
one group higher anxiety was correlated with better instead of worse
grades.
25. Patterns of correlation between scales of the semantic differential type as
applied to attitude measurement indicate some possible validity.
26.
27.
28.
29.
145
Clusters of variables that are distilled by factor analytic techniques show
semantically similar scales to be more closely related to each other than
to semantically dissimilar scales.
Attitude scales for children of different ethnic backgrounds have recently
been developed by Zirkel and Jackson (1974). No validity or reliability
information is given in the Test Manual. Researchers and others are
cautioned to use the scales only for group assessment - not for decisions
affecting individual children.
There may be no way to determine the extent to which attitudes cause
behavior or are caused by it, or both. The nature of the relationship may
differ according to context.
There is some evidence that attitudes may be more closely associated
with second language attainments in contexts that are richer in
opportunities for communication in the target language than in contexts
that afford fewer opportunities for such interaction.
In view of all of the research, teachers are probably better off relying on
their own compassionate judgements than on even the most highly
researched attitude measures.
DISCUSSION QUESTIONS
1. Reflect back over your educational experience. What factors would you
identify as being most important in your own successes and failures in
school settings? Consider the teachers you have studied under. How
many of them really influenced you for better or for worse? What specific
events can you point to that were particularly important to the
inspirations, discouragements, and day-to-day experiences that have
characterized your own education. In short, how important have
attitudes been in your educational experience and what were the causes
of those attitudes in your judgement?
2. Read a chapter or two from B. F. Skinner's Verbal Behavior (1957) or
. Beyond Freedom and Dignity (1971) or better yet, read all of his writings
on one or the other topics. Discuss his argument about the dispensability
of intervening variables such as ideas, meaning, attitudes, feelings, and
the like.-How does his argument apply to what is said in Chapter 5 of this
book?
3. What evidences would you accept as indicating a belligerent attitude? A
cheerful outgoing personality? Can such evidences be translated into
more objective or operational testing procedures?
... 4. Discuss John Platt's claims about the need for disproof in science. Can
you make a list, of potentially falsifiable hypotheses concerning the
nature of the relationship between. attitudes and learning? What would
you take as evidence that a particular view had indeed been disproved?
Why do you suppose that disproofs are so often disregarded in the
literature on attitudes? Can you offer an explanation for this? Are
intuitiol)s concerning the nature and effects of attitudes apt to be less
reliable than tests which purport to measure attitudes?
5. Pick a test that you know of that is used in your school and look it up in
the compendia by Buros (see the bibliography at the end of this book).
\
146
,
MEASURING ATTITUDES AND MOTIVATIONS
LANGUAGE TESTS AT SCHOOL
-
6.
7.
8.
9.
10.
11.
12.
13.
./
What do the reviewers say concerning the test? Is it relial;lle? Does it have
a substantial degree ofvalidity? How is it used in your school? How does
the use of the test affect decisions concerning children in the school?
Check the school files to see what sorts of data are recorded there on the
results of personality inventories (The Rorschach? The Thematic
Apperception Test ?). Do a correlation study to determine what is the
degree of relationship between scores on available personality scales and
other educationall!leasures.
.
Discuss the questions posed by Cooper and Fishman (p. 118f, this
chapter). What would you say they reveal about the state of knowledge
concerning the nature an<i effects of language attitudes and their
measurement?
I
Consider the moral problem associated with the valuatioh of attitude
scales: Brodkey and Shore (1976) say that 'A student seems to exhibit an
enjoyment of writing for its own sake, enjoyment of solitary work, a
rejection of Puritanical constraints, a good deal of self-confidence, and
sensitivity in personal relationships' (p. 158). How would you value the
statements on the Q-sort p. 130 above with respect to each of the
attitudinal or personality constructs offered by Brodkey and Shore?
What would you recommend the English Tutorial Program do with
respect to subjects who perform 'poorly' by someone's definition on.cthe
Q-sort? The Native American? The subjects over 30? Orientals?
Discuss with a group the valuation ofthe scale on the statement: 'Human
nature being what it is, there will always be war and conflict.' Do you
believe it is moral and just to say that a person who agrees strongly with
this statement is to that extent fascistic or authoritarian or prejudiced?
(See the F Scale, p. 122f.)
Consider the meaning of disagreement with the statement: 'In this
country, it's whom you know, not what you know, that makes for
success.' What are some of the bases of disagreement? Suppose subjects
think. success is not possible? Will they be apt to agree or disagree with
the statement? Is their agreement or disagreement characteristic of
Anomie in your view? Suppose a person feels that other factors besides
knowing the right people are more important to success. What response
would you expect him to give to the above statement on the Anomie
scale? Suppose a person felt that success was inevitable for certain types
of people (such a person might be regarded as a fatalist or an unrealistic
optimist). What response would you predict? What would it mean?
Pick any statement from the list given from the Q-sort on p. '130.
Consider whether in your view it is indicative of a person who is likely to
be a good student or not a good student. Better yet, take all of the
statements given and rank them from most characteristic of good
students to least characteristic. Ask a group of teachers to do the same.
Compare the rank orderings for degree of agreement.
Repeat the procedure of question 11 with the extremes of the scale
defined in terms of Puritanism (or, say, positive self concept) versus nonPuritanism (or, say, negative self concept) and again compare the rank
'
orders achieved.
Consider the response you might give to a statement like: 'While taking
147
an important examination, I perspire a great deal.' What factors might
influence your degree of agreement or disagreement independently ofthe
degree of anxiety you mayor may not feel when taking an important
examination? How about room temperature? Body chemistry (some
people normally perspire a great deal)? Your feelings about such a
subject? Your feelings about someone who is brazen enough to ask such
a socially obtuse question? The humor of mopping your brow as you
mark the spot labeled 'never' on the questionnaire?
14. What factor,s do you feel enter into the definition of empathy? Which of
those are potentially measurable by the MME? (See the discussion in the _
text on pp. 133-4.)
.
15. How well do you think semantic differential scales, or Likert-type
(agree-disagree) scales in general can be understood by subjects who
might be tested with such techniques? Consider the happy-sad faces and
the neutral response of 'I don't know' indicated in Figure 4. Will children
who are rion-literate (pre-readers) be able to perform the task
meaningfully in your judgement? Try it out on some relatively nonthreatening subject such as recess (play-time, or whatever it is called at
your school) compared against some attitude object that you know the
children are generally less keen about (e.g., arithmetic?"reading? sitting
quietly?). The question is can the children do the task in a way that
reflects their known preferences (if indeed you do know their
preferences) ?
16. How reliable and consistent are self-reports about skills, feelings,
attitudes, and the like in general? Consider yourself or persons whom
you know well, e.g., a spouse, sibling, room-mate, or close friend. Do you
usually agree with a person's own assessment of how he feels about such
and such? Are there points of disagreement? Do you ever feel you have
been right in assessing the feelings of others, who claimed to feel
, differently than you believed they were actually feeling? Has someone
else ever enlightened you on how you were 'really' feeling? Was he
correct? Ever? Do you ever say everything is fine when in fact it is lousy?
What kinds of social factors might influence such statements? Is it a
matter of honesty or kindness or both or neither in your view?
17. Suppose you have the opportunity to influence a school board or other
decision-making body concerning the use or non-use of personality tests
in schools. What kinds of decisions would you recommend?
SUGGESTED READINGS
1. Anne Anastasi, 'Personality Tests,' Part 5 of the book by the same
author entitled Psychological Testing (4th ed). New York: Macmillan,
1976, 493-621. (Much of the information contained in this very
thorough book is accessible to the person not trained in statistics and
research design. Some of it is technical, however.)
2. H. Douglas Brown, 'Affective Variables in Second Language
Acquisition,' Language Learning 23, 1973, 231-44. (A thoughtful
discussion of affective variables that need to be more thoroughly
studied.)
lA8
LANGUAGE TESTS AT SCHOOL
. 3. Robert L. Cooper and Joshua A. Fishman, 'Some Issues)n the Theory
and Measurement of Language Attitudes,' in L. Palmer and B. Spolsky
(eds.) Papers on Language Testing: 1967-1974. Washington, D.C.:
Teachers of English to Speakers of Other Languages, 187-98 . .
4. John Lett, 'Assessing Attitudinal Outcomes,' in June K. Phillips (ed.)
The Language Connection: From the Classroom to the World: ACTFL
Foreign Language Education Series. New York: National Textbook, in
press.
5. Paul Pimsleur, 'Student Factors in Foreign Language Learning: A
Review of the Literature,' Modern Language Journal 46, 1962, 1/50--9.
6. Sandra J. Savignon, Communicative Competence. Montreal: (Ma~cel
Didier, 1972.
,(
7. John Schumann, 'Affective Factors and the Problem of Age in Second
Language Acquisition,' Language Learning 25, 1975,209-35. (Follows
up on Brown's discussion and reviews much of the recent second
language literature on the topic of affective variables - espeCially the
work of Guiora, et al.)
,PART TWO
Theories and Methods
of Discrete Point Testing
\;
I
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
Syntactic Linguistics \
as a Source for
Discrete Point Methods
A. From theory to practice,
exclusively?
B. Meaning-less structural analysis
C. Pattern drills without meaning
D. From discrete point teaching to
discrete point testing
E. Contrastive linguistics
F. Discrete elements of discrete
aspects of discrete components of
discrete skills - a problem of
numbers
This chapter explores the effects of the linguistic theory that
contended language was primarily syntax-based - that meaning
could be dispensed with. That theoretical view led to methods of
language teaching and testing that broke language down into ever sO
many little pieces. The pieces and their patterns were supposed to be
taught in language classrooms and tested in the discrete items of
discrete sections of language tests. The question is whether language
can be treated in this way without destroying its essence. Humpty
Dumpty illustrated that some things, once they are broken apart,\are '
exceedingly difficult to put together again. Here we examine the
theoretical basis of taking language apart to teach and test it piece by
piece. In Chapter 8, we will return to the question of just how feasible
this procedur~ is.
A. From theory to practice, exclusively?
Prevailing theories about the nature of language influence theories
about language learning which in their turn influence ways of
teaching and testing language. As Upshur observed (1969a) the
150
151
direction ofthe influence is usually from linguistic theory to learning
theory to teaching methods and eventually to testing. As a result of
the direction of influence, there have been important time lags changes in theory at one end take ~ long time to be realized in changes
at the other end. Moreover, just as changes in blueprints are easier
than changes in buildings, changes in theories have been made often
without any appreciable change in tests.
Although the chain of influence is sometimes a long and indirect
one, with many intervening variables, it is possible to see the
unmistakable marks of certain techniques of linguistic analysis not
only on the pattern drill techniques of teaching that derive from those
methods of analysis,but also on a wide range of discrete point testing
techniques.
The unidirectional influence from theory to practice is not healthy.
As John Dewey put it many years ago:
That individuals in every branch of human endeavor be
experimentalists engaged in testing the findings of theorists is
the sole guarantee for the sanity of the theorist (1916, p. 442).
Language theorists are not immune to the bite of the foregoing
maxim. In fact, because it is so easy to speculate about the nature of
language, and because it has been such an immensely popular
pastime with philosophers, psychologists, logicians, linguists,
educators and others theories of language - perhaps more than
other the~ries -'-- need t~ be constantly challenged and put to the test in
_
every conceivable laboratory.
Surely the findings of classroom teachers (especia1ly language
teachers or teachers who have learned the importance oflanguage to
all aspects of an educational curriculum) are as important to the
theories oflanguage as the theories themselves are to what happens in
the classroom. Unfortunately, the direction of influence has been
much too one-sided. Too often the teacher is merely handed untried
and untested materials that some theory says ought to work - too
often the materials don't work at all and teachers are left to invent
their own curricula while at the same time they are expected to
perform the absorbing task of delivering it. It's something like trying
to write the script, perform it, direct the production, and operate the
theater all at the same time. The incredible thing is that some teachers
manage surprisingly well.
If the direction of influence between theory and practice were
mutual , the interaction would be fatal to many of the existing
152
LANGUAGE TESTS AT SCHOOL
theories. This would wound the pride of many a theorist,:but it would
generally be a healthy and happy state of affairs. As we have seen,
empirical advances are made by disproofs (Platt, 1964, citing Bacon).
They are not made by supporting a favored position - perhaps by
refining a favored position some progress is made, but what progress
is there in empirical research that merely. supports a favored view
while pretending that there are no plau~ble competing alternatives?
The latter is not empirical research at all. It is a form of idol worship
where the theory is enshrined and the pretence of research is' merely a
form - a ritual. Platt argues that a theory which cannot be 'mortally
endangered' is not alive. We may add that empirical research that
does not mortally endanger the hypotheses (or theories) to which it is
addressed is not empirical research at all.
How have linguistic theories influenced theories about language
learning and subsequently (or simultaneously in some cases)
methods of language teaching and testing? Put differently, what are
some of the salient characteristics of theoretical views that have
influenced practices in language teaching and testing? How have
discrete point testIng methods, in particular, found justification in
language teaching methods and linguistip theories? What are the
most imp~rtant differences between pragmatic testing methods and
discrete point testing methods?
.
The crux of the issue has to do with meaning. People use language
to inform others, to get information from others, to express their
feelings and emotions, to analyze and characterize their own
thoughts in words, to explain, cajole, reply, explore, incite, disturb,
encourage, plan, describe, promise, play, and much much more . .The
crucial question, therefore, for any theory that claims to be a theory
of natural language (and as we have argued in Part One, for any test
that purports to assess a person's ability to use language) is how it
addresses this characteristic feature of language - the fact that
language is used in meaningful ways. Put somewhat differently,
language is used in ways that put utterances in pragmatic
correspondences with extra-linguistic contexts. Learning a language
involves discovering how to create utterances in accord with such
pragmatic correspondences.
B. Meaning-less structural analysis
We will briefly consider how the structural linguistics of the
Bloomfieldian era dealt with the question of meaning and then we will
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
153
consider how language teaching and eventually language testing
methodologies were subsequently affected. Bloomfield (1933, p. l39)
defined the meaning of a linguistic form as the situation in which the
speaker utters it, and the response it calls forth in the hearer. The
behavioristic motivation for such strict attention to observables will
be obvious to anyone familiar with the basic tenets of behaviorism
(see Skinner, 1953, and 1957). There are two major problems,
however, with such a definition of meaning. For one it ignores the
inferential processes that are always involved in the association of
meanings with linguistic utterances, and for another it fails to take
account of the iI)1portance of situations and contexts that are part of
the history of experience that influence the inferential connection of
utterances to meanings.
The greatest difficulties of the Bloomfie1dian structuralism,
however, arise not directly from the definition of meaning that
Bloomfield proposed, but from the fact that he proposed to disregard
meaning in his linguistic theory altogether. He argued that 'in order
to give a scientifically accurate definition of meaning we should have
to have a scientifically accurate knowledge of everything in the
speaker's world' (p. l39). Therefore, he contended that meaning
should not constitute any part of a scientific linguistic analysis. The
implication of his definition was that since the situations which
prompt speech are so numerous, the number of meanings of the
linguistic units which occur in them must consequently be so large as
to render their description infeasible. Hence, Bloomfieldian
linguistics tried to set up inventories of phonemes, morphemes, and
certain syntactic patterns without reference to the ways in which
those units were used in normal communication.
What Bloomfield appeared to overlook was the fact that the
communicative use of language is systematic. If it were not people
could not communicate as they do. While it may be impossible to
describe each of an infinitude of situations, just as it is impossible to
count up to even the lowest order of infinities, it is not impossible in
principle to characterize a generative system that will succeed where
simple enumeration fails. The problem of language description (or
better the characterization of language) is not a problem of merely
enumerating the elements of a particular analytical paradigm (e.g.,
the phonemes or distinctive sounds of a given language). The
problem of characterizing language is precisely the one that
Bloomfield ruled out of bounds - namely, how people learn and use
language meaningfully.
154
LANGUAGE TESTS AT SCHOOL
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
What effect would such thinking have on language teaching and
eventually on language testing? A natural prediction would be that it
ought to lead to a devastating disregard for meaning in the pragmatic
sense of the word. Indeed it did. But, before we examine critically
some of the debilitating effects on language teaching it is necessary to
recognize that Bloomfield's deliberate excision of meaning from
linguistic analyses was not a short-lived nor narrowly parochial
suggestion - it was widely accepted and persisted well into the 1970s
as a definitive characteristic of American linguistics. Though
Bloomfield's limiting assumption was certainly not accepted by all
American linguists and was severely crititized or ignored in certain
European traditions of considerable significance,l his particular
variety of structural linguistics was the one that unfortunately was to
pervade the theories and methods oflanguage teaching in the United
States for the next forty or so years (with few exceptions, in fact, until
the present).
The commitment to a meaning-less linguistic analysis was
strengthened by Zellig Harris (1951) whose own thinking was
apparently very influential in certain similar assumptions of Noam
Chomsky, a student of Harris. Harris believed that it would be
possible to do linguistic analys<;s on the basis of purely formal criteria
having to do with nothing except the observable relationships
between linguistic elements and other linguistic elements. He said:
155
work for one unidentified element, how can it possibly work for all of
them? The fact is that it cannot work at all. Nor has anyone ever
successfully applied the methods Harris recommended. It is not a mere
procedural difficulty that Harris's proposals run aground on, it is the
procedure itself that creates the difficulty. It is intrinsically
unworkable and viciously circular (Oller, Sales, and Harrington,
1969).
Further, how can unidentified elements be defined in terms of
themselves or in terms of their relationships to other similarly
undefined elements? We will see below that Harris's recommendations for a procedure to be used in the discovery of the
grammatical elements of a language, however indirectly and through
whatever chain of inferential steps, has been nearly perfectly
translated into procedures for teaching languages in classroom
situations - procedures that work about as well as Harris's methods
worked in linguistics.
Unfortunately, Bloomfield's limiting assumption about meaning
did not end its influence in the writings of Zellig Harris, but persisted
right on through the Chomsky an revolution and into the early 1970s.
In fact, it persists even today in teaching methods and standardized
instruments for assessing language skills of a wide variety of sorts as
we will see below.
Chomsky (1957) found what he believed were compelling reasons
for treating grammar as an entity apart from meaning. He said:
.
The whole schedule of procedures ... which is designed to begin
with the raw data and end with a statement of grammatical
structure, is essentially a twice made application of two major
steps: the setting up of elements and the statement of the
distribution of these elements relative to each other ... The
elements are determined relatively to each other, and on the
basis of the distributional relations among them (Harris, 1951,
p.61).
I think that we are forced to conclude that grammar is
autonomous and independent of meaning (p. 17).
and again at the conclusion to his book Syntactic Structures:
Grammar is best formulated as a self-contained study
independent of semantics (p. 106).
He was interested in
There is a problem, however, with Harris's method. How will the
first element be identified? It is not possible to identify an unidentified
element on the basis of its yet to be discovered relationships to other
yet to be discovered elements. Neither is it conceivable as Harris
recommends (1951, p. 7) that 'this operation can be carried out ....
only if it is cat-hed out for all elements simultaneously'. If it cannot
attempting to describe the structure oflanguage with no explicit
reference to the way in which this instrument is put to use (p.
103).
Although Chomsky stated his hope that the syntactic theory he
was elaborating might eventually have 'significant interconnections
with a parallel semantic theory' (p. 103), his early theorizing made no
provision for the fact that words and sentences are used for
meaningful purposes - indeed that fact was considered, only to be
summarily disregarded. Furthermore, he later contended that the
1 For instance, Edward Sapir (1921) was one of the Americans who was not willing to
accept Bloomfield's limiting assumption about meaning. The Prague School of
linguistics in Czechoslovakia was a notable European stronghold which was little
enamored with Bloomfield's formalism (see Vachek, 1966 on the Prague group).
•
7
f
156
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
LANGUAGE TESTS AT SCHOOL
.'
communicative function of language was subsidiary and derivative that language as a syntactically governed~ system had its real essence
in some kind of'innettotality' (1964, p. 58)~that native speakers ofa
language are capable of producing 'new sentences ... that are
immediately understood by other speakers although they bear no
physical resemblance to sentences which are "familiar'" (Chomsky,
1966, p. 4).
The hoped for 'semantic theory' which Chomsky alluded to in
several places seemed to have emerged in 1963 when Katz and Fodor
published 'The Structure of a Semantic Theory'. However, they too
contended that a speaker was capable of producing and understanding indefinitely many sentences that were 'wholly novel to him' (their
italics, p." 481). This idea, inspired by Chomsky's thinking, is an
exaggeration of the creativity of language - or an understatement
depending on how the coin is turned.
If everything about a particular utterance is completely new, it is
not an utterance in any natural language, for one of the most
characteristic facts about utterances in natural languages is that they
conform to certain systematic principles. By this interpretation, Katz
and Fodor have overstated the case for creativity. On the other hand,
for everything "about a"particUlar utterance to be completely novel,
that utterance would conform to none of the pragmatic constraints or
lower order phonological rules, syntactic patterns, semantic values
and the like. By this rendering, Katz and Fodor have underestimated
the ability of the language User: to be creative within the li1l!its set by
his language.
The fact is that precisely because utterances are used in
communicative contexts in particular correspondences to those
contexts, practically everything about them is familiar - their
newness consists in the fact that they constitute new combinations of
known lexical elements and known sequences of grammatical
categories. It i.s in this sense that Katz and Fodor's remark. can be
read as an understatement. The meaningful use of utterances in
discourse is always new and is constantly a source of information and
meaning that would otherwise remain undisclosed.
The continuation of Bloomfield's limiting assumption about
meaning was only made quite clear in a footnote to An Integrated
Theory of Linguistic Descriptions by Katz and Postal (1964). In spite
of the fact that they claimed to be integrating Chomsky's syntactic
theory with a semantic one, they mentioned in a footnote that 'we
exclude aspects of sentence use and comprehension that are not .
157
explicable through the postulation of a generative mechanism as the
reconstruction of the speaker's ability to produce and understand
sentences. In other words, we exclude conceptual features such as the
physical and sociological setting of utterances, attitudes, and beliefs
of the speaker and hearer, perceptual and memory limitations, noise
level of the settings, etc.' (p. 4). It would seem that in fact they
excluded just about everything of interest to an adequate theory of
language use and learning, and to methods of language teaching and
testing.
It is interesting to note that by 1965, Chomsky had waivered from
his originally strong stand on the separation of grammar and
meaning to the position that 'the syntactic and semantic structure of
natural languages evidently offers many mysteries, both offact and of
principle, and any attempt to delimit these domains must certainly be
quite'tentative' (p. 163). In 1972, he weakened his position still further
(or made it stronger from a pragmatic perspective) by saying, 'itis not
clear at all that it is possible to distinguish sharply between the
contribution of grammar to the determination of meaning; and the
contribution of so-called "pragmatic considerations", question of
fact and belief and context of utterance' (p. 111).
From the argument that grammar and meaning were clearly
autonomous and independent, Chomsky had come a long way
indeed. He did not correct the earlier errors concerning Bloomfield's
assumption about meaning, but at least he came to the position of
questioning the correctness of such an assumption. It remains to be
seen how long it will take for his relatively recently acquired
skepticism concerning some of his own widely accepted views to filter
back to the methods of teaching and testing for which his earlier views
served as after-the-fact supports ifnot indeed foundational pillars. It
may not have been Chomsky'S desire to have his theoretical thinking
applied as it has been (see his remarks at the Northeast Conference,
1966, reprinted in 1973), but can anyone deny that it has been applied
in such ways? Moreover, though some of his arguments may have.
been very badly misunderstood by some applied linguists, his
argument about the autonomy of grammar is simple enough not to be
misunderstood by anyone.
C. Pattern drills without meaning
What effects, then, have the linguistic theories briefly discussed above
had on methods of language teaching and subsequently on methods
158
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
LANGUAGE TESTS AT SCHOOL
159
Furthermore, his attention is drawn to the changes, which are
stimulated by pictures, oral substitutions, etc., and thus the
pattern itself, the significant framework of the sentence, rather
than the particular sentence, is driven intensively into his habit
reflexes.
It would be false to assume that Pattern Practice, because it
aims at habit formation, is unworthy of the educated mind,
which, it might be argued, seeks to control language through
conscious understanding. There is no disagreement on the value
of having the human mind understand in order to be at its
learning best. But nothing could be more enslaving and
therefore less worthy of the human mind than to have it chained
to the mechanics of the patterns of the language rather than free
to dwell on the message conveyed through language. It is
precisely because of this view that we discover the highest
purpose of pattern pr(1ctice: to reduce to habit what rightfully
belongs to habit in the new language, so that the mind and the
personality may be freed to dwell in their proper real).1l, that is,
on the meaning of the communication rather than the
mechanics of the grammar (pp. xv-xvi).
of language testing? The effects are direct, obvious, and unmistakable. From meaning-less linguistic analysis comes meaning-less
pattern drill to instill the structural patterns or the distributional
'meanings' of linguistic forms as they are strung together in
utterances. In the Preface (by Charles C. Fries) to the first edition of
English Sentence Patterns (see Lado and Fries, 1957), we read:
The 'grammar' lessons here set forth ... consist basically of
exercises to develop habits . ..
What kinds of habits? Exactly the sort of habits that Harris believed
were the essence of language structure and that his 'distributional
analysis' aimed to characterize. The only substantial difference was
that in the famed Michigan approach to teaching English as a second
or foreign language, it was the learner in the classroom who was to
apply the distributional discovery procedure (that is the procedure
for putting the elements oflanguage in proper perspective in relation
to each other). Fries continues:
The habits to be learned consist of patterns or molds in which
the 'words' must be grasped. 'Grammar' from the point of view
of these materials is the particular system of devices which a
language uses to signal one of its various layers of meaning structural meaning ( ... ). 'Knowing' this grammar for practical
use means being able to produce and to respond to these signals
of structural meaning. To develop such habits efficiently
demands practice and more practice, especially oral practice.
These lessons provide the exercises for a sound sequence of such
practice to cover a basic minimum of production patterns in
English (p. v).
And just how do these pattern practices work? An example or two
will display the principle adequately. For instance, here is one from
Lado and Fries (1957):
they represent a new theory of language learning, the idea that
to learn a new language one must establish orally the patterns of
the language as subconscious habits. These oral practices are
directed specifically tothat end. (His emphasis, Lado and Fries,
1958, p. xv)
Exercise lc.I. (To produce affirmative short answers.) Answer
the questions with YES, HE IS; YES, SHE IS; YES, IT IS; ... For
example:
Is John busy?
YES, HE IS.
Is the secretary busy?
YES, SHE IS.
Is the telephone busy?
YES, IT IS.
Am I right?
YES, YOU ARE.
Are you and John busy? .
YES, WEARE.
Are the students homesick? YES, THEY ARE.
Are you busy?
YES, I AM.
(Continue:)
1. Is John busy?
8. Is Mary tired?
2. Is the secretary busy?
9. Is she hungry?
3. Is the telephone busy?
10. Are you tired?
4. Are you and John busy? 11. Is the teacher right?
5. Are the students
12. Are the students busy?
homesick?
13. Is the answer correct?
14. Am I right?
6. Are you busy?
7. Is the alphabet
15. Is Mr. Brown a doctor?
important?
... in these lessons, the student is lead to practise a pattern,
changing some element of that pattern each time, so that
normally he never repeats the same sentence twice.
Suppose that the well-meaning student wants to discover the
meaning of the sentences that are presented as Lado and Fries
suggest. How could it be done? How, for instance, will it be possible
In his Foreword to the later edition, Lado suggests,
The lessons are most effective when used simultaneously with
English Pattern Practices, which provides additional drill for
the patterns introduced here (Lado and Fries, 1957, p. iii).
In his introduction to the latter mentioned volume, Lado continues,
concerning the Pattern Practice Materials, that
•
'r
-----...---~----------------------
160
[Mechanical type of practice]
,
Teaching Point: Contrast WITH + N/BY + N
Model:~ [Cue]
He used a plane to go there.
, R [Response]
He went there by plane.
C
He used his teeth to open it.
~
R
He opened it with his teeth.
He used a telegram to answer it.
T [Teacher]
He answered it by telegram.
S [Student]
He used a key to unlock it.
T
He unlocked it with a key.
&
. He used a phone to contact her.
T
He contacted her by phone.
S
He used a smile to calm them.
T
He calmed them with a smile.
S
He used a radio to talk to them.
T
S
He talked to them by radio.-
Ml
[Meaningful drill according to the authors]
Teaching Point: Use O/HOW and Manner Adverbials
Model: C [Cue
open a bottle
Rl [one
possible
response]
How do you open a bottle?
M2
----
DISCRETE POINT METHODS IN SYNTACTIC LINGmSTICS
LANGUAGETEsTSATSCHOOL
for the learner to discover the differences between a phone being busy
and a person being busy? Or between being a doctor and being
correct? Or between being right and being tired? Or between the
appropriateness of asking a question like, 'Are you hungry?' on
certain occasions, but not 'Are you the secretary busy?'
While considerip.g these questions, consider a series of pattern
drills selected more or less at random from a 1975 text entitled From
Substitution to Substance. The authors purport to take the learner
from 'manipulation to communication' (Paulston and Bruder, 1975,
p. 5). This particular series of drills (supposedly progressing from
more manipulative and mechanical types of drills to more meaningful
and communicative types) is designed to teach adverbs that involve
the manner in which something is done as specified by with plus a
noun phrase:
He opened the door with a key.
Model:c [Cue]
can/church key
He opened the can with a church key.
R [Response]
T [Teacher says]
s [Student responds]'
He opened the bottle with an opener.
bottle/opener
- He opened the QOx with his teeth.
box/his teeth
He opened the letter with a knife.
letter/knife
- -
R2 [another]
C [Cue]
Rl
R2
T
161
(With an opener.)
finance a car
How do you finance a car?
(With a loan from the bank.)
(By getting a loan.)
[Teacher]
light a fire
sharpen a pencil
make a sandwich
answer a question
[Communicative drill according to the authors1
Teaching Point: Communicative Use
Model: C [Cue]
How do you usually send letters to
your country?
R
(By airmail.) (By surface mail.)
C
How does your friend listen to your
problems? _
R
(Patiently.) (With a smile.)
How do you pay your bills here?
T [Teacher]
How do you find your apartment
here?
How will you go on your next
vacation?
How can I find a good restaurant?
C
In the immediately foregoing pattern drills the point of the exercise
is specified in each case. The pattern that is being drilled is the only
motivation for the collection of sentences that appears in a particular
drill. That is why at the outset of this section we used the term
'syntactically-based pattern drills'. The drills that are selected are not
exceptional in any case. They are in fact characteristic of the major
texts in English as a second language and practically all of the texts
produced in recent years for the teaching of foreign languages (in
ESL/EFL see National Council of Teachers of English, 1973, Bird
and Woolf, 1968, Nadler, Marelli, and Haynes, 1971, Rutherford,
1968, Wright and McGillivray, 1971, Rand, 1969a, 1969b, and many
others). They all present learners with lists of sentences that are
similar in structure (though not always identical, as we will see
shortly) but which are markedly different in meaning.
How will the learner be able to discover the differences in meaning
~ between such similar forms? Of course, we must assume that the
learner is not already a native speaker - otherwise there would be no
point in studying ESL/EFL or a foreign language. The native speaker
knows the fact that saying, 'He used a plane to go there,' is a little less
162
LANGUAGE TESTS AT SCHOOL
natural than saying, 'He went there in an airplane,' or 'He flew,' but
how will the non-native speaker discover such things on the basis of
the information that can be made available in the drill? Perhaps the
authors of the drill are expecting the teacher to act out each sentence
in some way, or creatively to invent a meaningful context for each
sentence as it comes up. Anyone who has tried it knows that the hour
gets by before you can make it through just a few sentences. Inventing
contexts for sentences like, 'Is the alphabet important T and 'Is the
telephone busyT is like trying to write a story where the characters,
the plot,and all of the backdrops change from one second to the next.
It is not just difficult to conceive of a context in which'sttidents can be
homesick, alphabe~ important, the telephone, the secretary, and
John busy, Mary tired, me right, Brown a doctor and so on, but
before you get to page two the task is impossible.
If it is difficult for a native speaker to perform the task of inventing
contexts for such bizarre collections of utterances, why should
anyone expect a non-native speaker who doesn't know the language
to be able to do it? The simple truth is that they cannot do it. It is no
more possible to learn a language by such a method- than it is to
analyze a language by the form of distributional analysis proposed by
H~rris. It is necessary to get ,some data in the foim of pragmatic
mappings~f utterances onto 'meaningful contexts - failing that it is
not possible either to analy~e a language adequately or to learn one at
all.
'
Worse yet, the typi~l (not the exc~ptional, but the ordinary
everyday garden variety) pattern drill is bristling with false leads
about Ilimilarities that are only superficial and will lead aimost
immediately to unacceptable forms - the learner of course won't
know that 'they are'Unacceptable because he is not a native speaker of
the language and has little or no chance of ever discovering where he
went wrong. The bewildered learner will have no way of knowing that
for a person to be busy is not like a telephone being busy. What
information is there to prevent the learner from drawing the,
reasonable conclusion that if telephones can be busy in the way-that
people can, that televisions, vacuum cleaners, telegraphs,' and
typewriters can be busy in the same sense? What will keep the learner
from having difficulty distinguishing the meanings of alphabet,
telephone, secretary, and doctor if he doesn't already know the
meanings of those words? If the learner doesn't already know tlrat'a
doctor, a secretary, and a teacher are phrases that refer to people with
different occupational statuses, how will the drill help him to discover
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
163
this information? What, from the learner's point oCview is different
about homesick and a doctor? What will prevent the learner from
saying Are the students doctor and Is the alphabet busy?
Meaning would prevent such absurdities, but the nature of the
pattern drill encourages them. It is an invitation to confusion.
Without meaning to help keep similar forms apart, the result is
simple. They cannot be kept apart - they become mixed together
indiscriminately. This is not because learners are lacking in
intelligence, rather it is because they are in fact quite intelligent and
they use their intelligence to classify similar things together and to
keep different things apart. But how can they keep different things
apart when it is only the superficial similarities that have been
constantly called to their attention in a pattern drill ?
The drills proposed by Paulston and Bruder are more remarkable
than the ones by Lado and Fries, because the Paulston-Bruder drills'
are supposed to become progressively more meaningful- but they do
not. They merely become less obviously structured. The responses
and the stimuli to elicit them don't become any more meaningful. The
responses merely become lessptedictable as one progresses through
the series of drills concerning each separate point of grammatical
structure,
Who but a native speaker of English will know that opening a door
with a key is not very much like opening a can with a church key? The
two sentences are alike in fact primarily in terms of the way they
sound. If the non-native knew the meanings before doing the drill
there would be no need for doing the drill- but if he does need the
drill iL will do him .absolutely no good and probably some harm:
What will keep him from saying, He opened the can with a hand? Or,
He opened the bottle with a church key? Or, He opened the car with a
door? Or, He opened thefaucet with a wren~h? Ifhecan say, He opened
the letter with a letter opener, why not; He opened the box with a box
opener? Or, He opened the plane with a plane opener? If the learner is
encouraged to say, He used a key to unlock it, why not, He used a letter
to write her, or He used a call to phone her?
In the drill above labelled M 1 , which the authors describe as a
mechanical drill, the object is to contrast phrases like with a knife and
by telegram. Read over the drill and then try to say what will prevent
the learner from coming up with forms like, He went there with a
plane, He contacted her with a phone, He unlocked it by key, He calmed
them by smile, He used radio to talk to them, He used phone to contact
her, He contacted her with telegram, etc.
164
LANGUAGE TEsTS AT SCHOOL
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
In the drill labelled M 2 , which is supposed to be somewhat more
meaningful, additional traps are neatly laid for the non-native
speaker who is helplessly at the mercy of the patterns laid out in the
drill. When asked how to light a fire, sharpen a pencil, make a
sandwich, or answer a question, he has still fresh in his memory the
keyed answers to the question abogt how to openYa bottle or finance a
car. He has heard that you can finance l\. car by getting a loan or that
you can open a bottle with an. opener. What i~ to prevent the
unsuspecting and naive learner from saying <that you can open a
bottle by getting an opener - structurally the answer is flawless and in
line with what he has just been taught. The answer is eve~ creative.
But pragmatically it is not quite right. Because ofa quirk of the
language that native speakers have learned, gett'ing an open~r does
not imply using it, though with an opener in response to the question
How do you open a bottle? does imply the required use of the opener.
What will keep the learner from creatively inventing forms like, by
a match in response to the question about starting a fire? Or by a
pencil sharpener in answer to how you sharpen pencil? Would it not
be perfectly reasonabfe if when asked How do you answer a question?
the learner r7plied with an answ~r?orby~answering?
The so-called 'CommnniC'ii:tive' lJ se' drill offers even more
interesting traps.' How do· yo~ send your letters? By an airplane of.
cOJJrse. Or sometiI11es I send them with a ship. When I'm in a hurry
though I always send them with a plane. How does your friend listen
to yourproblems? BYl'atience mostly, but sometimes he does so by
smiling. Bills? Well, I almost always bill them by mail or with a carsometimes in an airmail. My,apartment? Easy. I found it by a friend.
We went there with a car. My next vacation? With an airplane. My
girl friend is wanting to gq too, but she goes by getting a loan with a
bank. A good restaurant ? You can go with a taxi.
Is there any need to say more? Is there an English teacher alive
anywhere who cannot write reams on the topic? What then, Oh
Watchman of the pattern drill? The pattern drill without meaning, my
Son, is as a door opening into darkness and leading nowhere but to '
confusion.
If the preceding examples were exceptional, there might be reason
to hope that pattern drills -of the sort illustrated above might be
(transformed into more meaningful exercises. There is no reason for
such a hope, however. Pattern drills which are unrelated and
int:t:insically unrelatable to meaningful extralinguistic contexts 'are
confusing precisely because they are well written - that is, in the sense
a
165
that they conform to the principles of the meaning-less theory of
linguistic analysis on which they were based. They are unworkable as
teaching methods for the same reason that the analytical principles
on which they are based are unworkable as techniques of linguistic
analysis. The analytical principles that disregard meaning are not just
difficult to apply, but they are fundamentally inapplicable to the
objects of interest - namely, natural languages.
D. From discrete point teaching (meaning-less pattern drills) to
discrete point testing
One might have expected that the hyperbole of meaning-less
language was fully expressed in the typical pattern drills that
characterized the language teaching of the 1950s and to a lesser extent
is still characteristic of most published materials today. However, a
further step toward conwlete meaninglessness was possible and was
advocated by two leading authorities of the 1960s. Brooks (1964) and
Morton (1960, 1966) urged that the minds of the learners who were
manipulating the pattern drills should be kept free and unencumbered by the meanings of the forms they were practicing. Even
Lado and Fries (1957, 1958) at least argued that the main purpose of
pattern drills was not only to instill 'habits' but was to enable learners
to say meaningful things in the language. But, Brooks and Morton
developed the argument that skill in the purely manipUlative use of
the language, as taught in pattern drills, would have to be fully
mastered before proceeding to the capacity to use the language for
communicative purposes. The analogy offered was the practicing of
scales and arpeggios by a novice pianist, before the novice could hope
to join in a concerto or to u.se the newly acquired habits expressively.
Clark (1972) apparently accepted this two stage-model in relation
to the acquisition of listening comprehension in a foreign language.
Furthermore, he extended the model as a justification for discrete
point and integrative tests:
Second-level ability cannot be effectively acquired unless firstlevel perception of grammatical cues and other formal
interrelationships among spoken utterances has become so
thoroughly,learned and so automatic that the student is able to
turn most of his listening attention to 'those elements which
seem to him to contain the gist of the message' (Rivers, 1967, p.
193, as quoted by Clark, 1972, p. 43).
Clark continues:
166
LANGUAGE TESTS AT SCHOOL
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
Testing information of a highly diagnostic type wbuld be useful
during the 'first stage' of, instruction, in which sound
discriminations, basic patterns of spdken grammar, items of
functional vocabulary, and so forth were being formally taught
and practised.... As the instructional emphasis changes from
formal work in discrete aspects" to more extensive and less
controlled listening practice, the utility (and also the possibility)
of diagnostic testing is reduced ill" favor of evaluative
procedures which test primarily'the students' comprehension of
the 'general message' rather than the ,apprehtmsion of certain
specific sounds or sound patterns (p. 43). -," .
pattern drills was created. It was oriented toward the teaching of
'pronunciation'," especially the minimal phonemic contrasts of
various target languages. For instance, Lado and Fries (1954)
suggested:
A very simple drill for practicing the recognition of : ..
distinctive differences can be made by arranging minimal pairs
of words on the blackboard in columns thus:
(The words they used were offered in phonetic script but are
presented here in their normal English spellings.)
How successful has this two stage dichotomy proved to be in
language teaching? Stockwell and Bowen hinted at the core of the
difficulty in their introduction to Rutherford (1968):
The most difficult transition in learning a language is going
from mechanical skill in reproducing patterns acquired by
repetition to the construction of novel but appropriate
sentences in natural social contexts. Language teachers ... not
infrequently ... fumble and despair, when confronted with the
challenge of leading students comfortably over this hurdle (p.
~O·
.
What if the hurdle were an un~ecessary one? What'if it were a mere
artefact of the atte~pt to separate the learning of the grammatical
patterns of the language from the communicative use of the
language? If we asked how often children are exposed to meaningless
non-contextualized language of the sort that second language
learners are so frequently expected to master in foreign language
classrooms, the answer would be, never. Are pattern drills, therefore,
necessary to language learning? The answer must be that they are
not. Further, pattern drills of the non-contextualized and noncontextualizable variety are probably about as confusing as they are
informative.
If as we have already seen above, pattern drills are associated with
the 'first stage' of a two stage process of teaching a foreign language
and if the so-called 'diagnostic tests' (or discrete point tests) are also
associated with that first stage, it only remains to show the connection
between the pattern drills and the discrete poitit items themselves.
Once this is accomplished, we will have illustrated each link in the
chain from certain linguistic theories to discrete point methods of
language testing.
Perhaps the area of linguistic analysis which developed the most
rapidly was the level of phonemics. Accordingly, a whole tradition of (
I
167
man
lass
lad
pan
bat
sat
men
less
led
pen
bet
'set
The authors continue:
The teacher pronounces pairs of words in order to make the
student aware of the contrast. When the teacher is certain that
the students are beginning to hear these distinctions he can then
have them actively participate in the exercise (p. iv).
In a footnote the reader is reminded:
Care must be taken to pronounce such contrasts with the same
intonation on both words so that the sole difference between the
words will be the sound under study (op cit).
It is but a short step to test items addressed to minimal
phonological contrasts. Lado and Fries point out in fact that a
possible test item is "a picture of a woman watching a baby versus a
woman washing a baby. In such a case, the examinee might hear the
statement, The woman is washing the baby, and point to or otherWise '
indicate the picture to which the utterance is appropriate.
Harris (1969) observes that the minimal pair type of exercise, of the
sort illustrated above, 'is, in reality, a two-choice" objective test", and
most sound discrimination tests are simply variations and expansions
of this common classroom technique' (pp. 32-3). Other variations
for which Harris offers examples include heard pairs of words where
the learner (or in this case, the examinee) must indicate whether the
two words are the,same or different; or a heard triplet where the
learner must indicate which of three words (e.g., jump, chump, jump) ,
was different from the other two; or a heard sentence in which either
168
\
LANGUAGE TESTS AT SCHOOL
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
,
,
169
d
member of a minimal pair might occur (e.g., It was a' large ship,
versus, It was a large sheep) where the examinee must inaicatl~ either a
picture of a large ship orra large sheep depending on what was heard.
Harris refers to the last case as an example of testing minimal, pairs 'in
context' (p. 33-4). It is not difficult to see, however, that the types of
contexts in which one might expect to find both ships and sheep are
relatively few in number - certainly a very small minority of possible
contexts in which one might efCpect to find ships without.sheep'or
sheep without ships.
Vocabulary teaching by discrete point methods also leads rather
directly to discrete point vocabulary tests. For ~instance, Bird and
Woolf (1968) include a substitution drill set in a sentence frame of
That's a ___, or This is a ___ with such items as chair, pencil,
table, book, and door. It is a short step from such a drill to a series of
corresponding test items. For example, Clark (1972) suggests a test
item where a door, chair, table, and bed are pictured. Associated with
each picture is a letter which the student may mark on an answer sheet
for easy scoring. The learner hears in French, Voici une chaise, and
should correspondingly mark the letter of the picture of the chair on
the answer sheet.
Other item types, suggested by Harris (1969) include: a word
followed by several brief definitions from which the examinee must
select the one that corresponds to the meaning of the given word; a
definition followed by several words from which the examinee must
select the one closest in meaning to the given definition; a sentence
frame witlr an underlined word and several possible synonyms from
which the examinee must choose the best alternative; and a sentence
frame with a blank to be filled by one of several choices and where all
but one of the choices fail to agree with the meaning requirements of
the sentence frame.
Test items of the discrete point type aimed at assessing particular
grammatical rules have often been derived directly from pattern drill
formats. For'example, in their text for teaching English as a foreign
language in Mali, Bird and Woolf (1968) recommend typicll;l
transformation drills from singular statements to plural ones (e.g., ,Is
this a book? to Are these books? and reverse, see p. 14a); ftom
negative to negative interrogative (e.g., John isn't here, to Isn't John
here? see p. 83); from interrogative to negative interrogative (e.g., Are
we going? to Aren't we going? see p. 83) ; statement to question (e.g.,
He hears about Takamba, to What does he hear about ?seep. 131); and
so forth.
There are many other types of possible drills in relation to syntax,
but the fact that drills of this type can and have been translated more
or less directly into test items is sufficient perhaps to illustrate the
trend. Spolsky, Murphy, Holm, and Ferrel (1972, 1975) give
examples of test items requiring transformations from affirmative
form to negative, or to question form, from present to past, or from
present to future as part of a 'functional test of oral proficiency' for
adult learners of English as a second language.
Many other examples could be given illustrating the connection
between discrete point teaching and discrete point testing, but the
foregoing examples should be enough to indicate the relationship,
which is simple and fairly direct. Discrete point testing derives from
the pattern drill methods of discrete point teaching and is therefore
subject to many of the same difficulties.
E. Contrastive linguistics
, One of the strongholds of the structural linguistics of the 1950s and
perhaps to a lesser extent the 1960s was contrastive analysis. It has
had less influence on work in the teaching of English as a second
language in the United States than it has had on the teaching of
English as a. foreign language and the teaching of other foreign
languages. There is no way to apply contrastive analysis to the
preparation of materials for teaching English as a second language
when the language backgrounds of the students range from
Mandarin Chinese, to Spanish, to Vietnamese, to Igbo, to German,
etc. It would 'be impossible for any set of materials to take into
account all of the 'contrasts between all of the languages that are
represented in many typical college level classes for ESL in the U.S.
However the claims of contrastive analysis are still relatively strong
in the te~ching of foreign languages and in recent years have been '
reasserted in relation to the teaching of the majority variety of
English as a second dialect to children who come to school speaking
some other variety.
The basic idea of contrastive analysis was stated by Lado (1957). It
is,
the assumption that we can predict and describe the p~tterns '
that will cause difficulty in learning, and those that will not
cause difficulty, by comparing systematically the language and
culture to be learned with the native language and culture of the'
student. In our view; the preparation of up-to-date pedagogical
T····
..
t
170
LANGUAGE TESTS AT SCHOOL
and experimental materials must be based on this kipd of
comparison (p. vii).
Similar claims were offered by Politzer and Staubach (1961), Lado
(1964), Strevens (1965), Rivers (1964, 1968), Barrutia (1969), and
Bung (1973). All of these authors were concerned with the teaching of
foreIgn languages.
.
More recently, the claims of contrastive analysis have been
extended to the teaching of reading in the schools. Reed (1973) says,
the more 'radically divergent' the non-standard dialect (i.e., the
greater the structural contrast and historical autonomy vis-a-vis
standard English), the greater the need for a second language
strategy in teaching Standard English (p. 294).
Farther on she reasons that unless the learner is
enabled to bring to the level of consciousness, i.e., to formalize
his intuitions about his dialect, it is not likely that he will come
to understand and recognize the systematic points of contrast
and interference between his dialect and the Standard English
he must learn to control (p. 294).
Earlier, Goodman (1965) offered a similar argument based on
contrastive analysis. He said,
the more divergence there is between the dialect of the learner
and the dialect oHearning, the more difficult will be the task of
learning to read (as cited by Richards, 1972, p. 250). (Goodman
has since changed his mind; see Goodman and Buck, 1973.)
If these remarks were expected to have the same sort of effects on
language testing that other notions concerning language teaching
have had, we should expect other suggestions to be forthcoming
about related (in fact derived) methods oflanguage testing. Actually,
the extension to language testing was suggested by Lado (1961) in
what was perhaps the first major book on the topic. He reasoned that
language "tests should focus on those points of difference between the
language of the learners and the target language. First, the :linguistic
problems' were to be determined by a 'contrastive analysis' of the
structures of the native and target languages. Then,
the test ... will have to choose a few sounds and a few structures
at random hoping to give a fair indication of the general
achievement of the student (p. 28).
More recently, testing techniques that focus on discrete points of
difference between two languages or two dialects have generally
fallen into disfavor. Exceptions however can be found. One example
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
171
is the test used by Politzer, Hoover, and Brown (1974) to assess degree
of control of two important dialects of English. Such items of
difference between the majority variety of English and the minority
variety at issue included the marking of possessives (e.g., John's house
versus John house), and the presence or absence of the copula in the
surface form (e.g., He goin' to town versus He is going to town).
Interestingly, except for the manifest influence of contrastive analysis
and discrete point theory on the scoring of the test used by Politzer, et
ai, it could be construed as a pragmatic test (i.e., it consisted of a
sequential text where the task set the learner was to repeat sequences
of material presented at a conversational rate).
Among the most serious difficulties for tests based on contrastive
linguistics is that they should be suited (in theory at least) for only one
language background - namely, the language on which the
contrastive analysis Was performed. Upshur (1962) argues that this
very fact results in a peculiar dilemma for contrastively based tests:
either the tests will not differentiate ability levels among students with
the same native language background and experience, 'or the
contrastive analysis hypothesis is invalid' (p. 127). Since the purpose
of all tests is to differentiate success and failure or degrees of one or
the other, any contrastively based test is therefore either not a test or
not contrastively based. A more practical problem for contrastively
based tests is that learners from different source languages (or
dialects) would require different tests. If there are many source
languages contrastively based tests become infeasible.
Further, from the point of view of pragmatic language testing as
discussed in Part One above, the contrastive analysis approach is
irrelevant to the determination of what constitutes an adequate
language test. It is an empirical question as to whether tests can be
devised which are more difficult for learners from one source
language than for learners from other source languages (where
proficiency level is a controlled variable). Wilson (in press) discusses
this problem and presents some evidence suggesting that contrastive
analysis is not helpful in determining which test items will be difficult
for learners of a certain language background.
In view of the fact that contrastive analysis has proved to be a poor
basis for predicting errors that language learners will make, or for
hierarchically ranking points of structure according to their degree of
difficulty, it seems highly unlikely that it will ever provide a
substantial basis for the construction of language tests. At best,
contrastive analysis provides heuristics only for certain discrete point
172
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
LANGUAGE TESTS AT SCHOOL
test items. Any such items must then be subjected to the same sorts of
validity criteria as are any other test items.
Language Skills
Components
One of the serious difficulties of a thoroughgoing analytical model of
discrete point testing is that it generates an inordinately large number
of tests. If as Lado (19,6 1) claimed, we 'need to test the elen;tents and
the skills separately' (p. 28), ,and if as he further argued we need
, separate tests for supposedly separate components of language
ability, and for both productive and receptive aspects of those
components, we wind up needing a very large number of tests indeed.
It might seem odd to insist on pronunciation tests for speaking and
separate pronunciation tests for listening, but Lado did argue that the
'linguistic problems' to be tested 'will differ somewhat for production
and for recognition' and that therefore 'different lists are necessary to
test the student's pronunciation in speaking and in listening' (p. 45).
Such distinctions, of course, are also common to t~sts in speech
pathology.
Hence, what is required by discrete point testing theory <is a set of
items for testing the elements of phonology, lexicon (or vocabulary),
syntax~ and possibly an additional component of semantics
(depending on the theory one selects) times the number of aspects one
recognizes (e.g., productive versus receptive) times the number of
separate skills one recognizes (e.g., listening, speaking, reading,
writing, and possibly others).
In fact, several different models have been proposed. Figure 5
below illustrates the componential analysis suggested by Harris
(1969); Figure 6 shows the analysis suggested by Cooper (1972);
Figure 7 shows a breakdown offered by Silverman, N oa, Russell, and
Molina (1967); and finally, Figure 8 shows a slightly different model
of discrete point categories proposed by Oller (1976c) in a discussion
of possible research recommended to test certain crucial hypotheses
generated by discrete point and pragmatic theories of language
testing.
Harris's model would require (in principle) sixteen separate tests
or subtests; Cooper's would require twenty-four separate tests or
subtests; the model of Silverman et al would require sixteen; and the
.
model of Oller would require twelve.
What classroom teacher has time to develop so many different
Speaking
Listening
j
F. Discrete elements of discrete aspects of discrete components of
discrete skills - a problem of numbers
173
Reading
Writing
Phonology/
Orthography
. Structure
Vocabulary
Rate and general
fluency
Figure 5. Componential breakdown of language proficiency proposed by
Harris (1969, p. 11).
'.l -ox\e\'1
~
Knowledge
Phonology Syntax Semantics Total
7
Listening
~
Speaking
S
Reading
k
i
I
I'
~
~
v,
~
Writing
,
.-----
s
~
~
~
./"
./"
~
---,.,,-
~
Figure 6. Componential analysis of language skills as a framework for test
construction from Cooper (1968, 1972, p. 337).
tests? There are other grave difficulties in the selection of items for
each test and in determining how many items should appear in each
test, but these are discussed in Chapter 7 section A. The question is,
are all the various tests or subtests necessary? We don't normally use
174
LANGUAGE TESTS AT SCHOOL
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
~
MODE
SensoryMoto{
MODALITY
RECEPTIVE
PRODUCTIVE
Listening
Speaking
175
AUDITORY/
ARTICULATORY
Phon- Struc- Vocab- Phon- Struc- . Vocabology ture
ulary
ture
ulary
ology
Reading
Writing
VISUAL/
MANUAL
I
Figure 7. 'Language assessment domains' as defined by Silverman et al (1976,
p.21).
,
.
only our phonology, or vocabulary, or grammar; why must they be
taught and tested separately?
An empirical question which must be answered in order to justify
such comporrentialization of language skill is whether tests that
purport to ~easure the same component of language skill (or the
same aspect, modality, or whatever) are in fact more highly
correlated with each other than with tests that purport to measure
different components. Presently available research results show many
cases where tests that purport to measure different components or
aspects (etc.) correlate as strongly or even more strongly than do tests
that purport to measure the same components. These results are
devastating to the claims of construct validity put forth by advocates
of discrete point testing.
For example, Pike (1973) found that scores on an essay task
correlated more strongly with the Listening Comprehension subscore
on the TOEFL than with the Reading Comprehension subscore for
three different groups of subjects. This sort of result controverts the
prediction that tasks in the readi~g-writing modalty ought to be more
highly intercorrelated with each oth~r than with tasks in the listeni.ng-
Graph- Struc- Vocab- Graph- Struc- Vocabulary
ology
ture
ulary
ology
ture
Figure 8. Schematic representation of constructs posited by a componential
analysis of language skills based on discrete point test theory, from Oller
(l976c, p. 150).
speakirig modality (of course, the TOEFL Listening Comprehension
subtest does require reading). Perhaps more surprisingly, Darnell
(1968) found that a cloze task was more highly correlated with the
Listening Comprehension section on the TOEFL than with any of
the other subscores. Oller and Conrad (1971) got a similar result with
the UCLA ESL Placement Examination Form 2C. Oller and Streiff
(1975) found that a dictation task was more strongly correlated with
each other part of the UCLA ESLP E Form 1 than any of the other
parts witl1 each other. This was particularly surprising to discrete
point theorizing in view of the fact that the dictation was the only
section of the test that required substantial listening comprehension.
Except for the phonological discrimination task, which required
distinguishing minimal pairs in sentence sized contexts, no other
subsection required listening comprehension at all.
In conclusion, if tasks that bear a certain label (e.g., reading
.xs
176
LANGUAGE TESTS AT SCHOOL
comprehension) correlate as well with tasks that bear different labels
(e.g., listening comprehension, or vocabulary, or oral intervie~, etc.)
as they do with each other, what independent justification can be
offered for their distinct labels or for the positing of separat,e skills,
aspects, and components of language r- The only justification that
comes to mind is the questionable theoretical bias of discrete 'point
theory.
Such a justification is a variety, a not too subtle variety in fact, of
validity by fiat, or nominal validity - for instance, the statement that a
'listening comprehension test' is a test of 'listening comprehension'
because that is what it was named by its author(s); or that a 'reading'
test is distinct from a 'grammar' test because they were assigned
different labels by their creators. A 'vocabulary' test is different from
a 'grammar' or 'phonology' test because they were dubbed differently
by the theorists who rationalized the distinction in the first place.
There is a better basis for labeling tests. Tests may be referred to,
not in terms of the hypothetical constructs that they are supposed to
measure, but rather in terms of the sorts of operations they actually
require oflearners or examinees. A cloze test req~ires learners to fill in
blanks. A dictation requires them to write down phrases, clauses or
other meaningful segments of discours~. An imitation task req~ires
them to repeat or possibly rephrase material that is heard. A reading
aloud task requires reading aloud. And so forth. A synonym
matching task is the most popular form of task usually called a
'Vocabulary Test'. If tests were consistently labeled according to
what they require learners to do, a great deal of argumentation
concerning relatively meaningless and misleading test labels could be
avoided. Questions about test validity are difficult empirical
questions which are only obscured by the careless assignment oftest
labels on the basis of untested theoretical (i.e., theory based) biases.
In the final analysis, the question of whether or not language skill
can be split up into so many discrete elements, components, and so
forth, is an empirical question. It cannot be decided a priori by fiat. It
is true that discrete point tests can be made from bits and pieces of
language, just as pumpkin pie can be made from pumpkins. The
question is whether the bits and pieces can be put together again, and
the extent to which they are characteristic of the whole. We explore
this question more thoroughly in Chapter 8 below. Here, the intent
has been merely to show that discrete point theories of testing are
derived from certain methods of teaching which in their turn derive
from certain methodsoflinguistic analysis.
DISC;::RETE POINT METHODS IN SYNTACTIC LINGUISTICS
177
KEY POINTS
1. Theories of language influence theories of language learning which in
their tum influence theories of language teaching which in their tum
linfl.uence theories and methods oflanguage testing.
2. A unidirectinal influence from theory to practice (and not from practical
findings to theories) is unhealthy.
3. A theory that cannot be mortally endangered is not alive.
4. Language is typically used for meaningful purposes - therefore, any
theory of language that hopes to attain a degree of adequacy must
encountenance this fact.
5. Structural analysis less meaning led to pattern drills without meaning.
6. Bloomfield's exclusion of meaning from the domain of interest to
linguistics was reified in the distributional discovery procedure
recommended by Zellig Harris.
7. The insistence on grammar without meaning was perpetuated by
Chomsky in 1957 when he insisted on a grammatical analysis of natural
languages with no reference to how those languages were used in normal
communication.
S. Even when semantic notions were incorporated into the 'integrated
theory' of the mid sixties, Katz and Fodor, and Katz and Postal insisted
that the speaker's knowledge of how language is used in relation to
extralinguistic contexts should remain outside the pale of interest.
9. Analysis of language without reference to meaning or context led to
theories of language learning which similarly tried to get learners to
internalize grammatical rules with little or no chance of ever discovering
the meanings of the utterances they were encouraged to· habitualize
through manipulative pattern drill.
10. Pattern drills, like the linguistic analyses from which they derive,
focussed on 'structural meaning' - the superficial meaning that was
associated with the distributional patterns of linguistic elements relative
to each other and largely independent of any pragmatic motivation for
uttering them.
11. Typical pattern drills based on the syntactic theories referred to above
are essentially noncontextualizable - that is, there is no possible context
in which all of the diverse things that are included in a pattern drill could
actually occur. (See the drills in most any ESLjEFL text - refer to the list
of references given on p. 161 above.)
i
12. It is quite impossible, not just difficult, for a non-native speaker to infer
pragmatic contexts for the sentences correctly in a typical syntax based I
pattern drill unless he happens to know already what the drill purports to
be teaching.
13. Syntactically motivated pattern drills are intrinsically structured in ways
that will necessarily confuse learners concerning similar forms with
different meanings - there is no way for the learner to discover the
pragmatic motivations for the differences in meaning.
14. The hurdle between manipUlative drills and communicative use of the
utterances in them (or the rules they are supposed to instill in the1earner)
is an artefact of meaning-less pattern drills in the first place.
178
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS
LANGUAGE TESTS AT SCHOOL
15. Discrete point teaching~ particularly the syntax based patte~n drill
approach, has been more or less directly translated into discrete point
testing.
16. Contrastive linguistics contended that the difficult patterns of a
particular target language could be determined in advance by a diligent
and careful comparison of the native language of the learner with the
target language.
"
17. The notion of contrastive linguistics was extended to the teaching of
reading and to language tests in the claim that the points of difficulty
(predicted by contrastive analysis) should be th~ main targets for
teaching and for test items.
18. A major difficulty for contrastive linguistics is that it has never provided
a very good basis for prediction. In many cases where the predictions are
clear they are wrong, and in others, they are too vague to be of any
empirical value.
19. Another serious difficulty is that every different native language
background theoretically requires a different language test to asse.ss
knowledge of the same target language (e.g., English). Further, there
seems to be no reason to expect a comparison of English and Spanish to
provide a good test of either English or Spanish; rather, what appears to
be required is something that can be arrived at independently of any
comparison of the two languages - a test of one or the other language.
20. Many problems of test validity can be avoided if tests are labeled
according to what they require examinees to do instead of according to
what the test author thinks the test measures.
21. Finally, discrete point test theories require many subtests which are of
questionable validity. Whether there exists a separable (and separately
testable) component of vocabulary, another of grammar, and separable
skills of listening, reading, writing, and so forth must be determined on
an empirical basis.
DISCUSSION QUESTIONS
1. Take a pattern drill from any text. Ask what the motivation for the drill
was. Consider the possibility (or impossibility) of contextualizing the
drill by making obvious to the learner how the sentences of the drill
might relate to realistic contexts of communication. Can it be done? If
so, demonstrate how. If not, explain why not. Can contextualized or
contextuallzable drills be written? If so, how would the motivation for
such drills differ from the motivation for noncontextualizable drills?
2. Examine pragmatically motivated drills such as those included in El
espanol por el mundo (Oller, 1963-65). Study the way in which each
sentence in the drill is relatable (in ways that can be and in fact an:: made
obvious to the learner) to pragmatic contexts that are already established
in the mind of the learner. How do such drills differ in focus from the
drills recommended by other authors who take syntax as the starting
point rather than meaning (or pragmatic mapping of utterance onto
context)?
3. Work through a language test that you know to be widely used. Consider
4.
,
5.
6.
7.
e
179
which of the subparts of the test rely on discrete point theory for their
justification. What sorts of empirical studies could you propose to see if
the tests in question really measure what they purport to measure? What
specific predictions would you make concerning-the intercorrelations of
tests with the same label as opposed to tests with different labels? What if
the labels are not more than mere labels?
Consider the sentences of a particular pattern drill in a language that you
know well. Ask whether the sentences of the drill would be apt to occur in
real life. For instance, as a starter, consider the likelihood of ever having
to distinguish 'between She's watching the baby versus She's washing the
baby. Can you conceive of more likely contrasts? What contextual
factors would cause you to prefer She's giving the baby a bath, or She's
bathing the infant, etc. ?
Following up on question 4, take a particular sentence from a pattern
drill and try to say all that you know about its form and meaning. For
instance; consider what other forms it calls to mind and what other
meanings it excludes or calls to mind. Compare the associations thus
developed for one sentence with the set of similar associations that come
to mind in the next sentence in the drill, and the next and the next. Now,
analyze some of the false associations that the drill encourages. Try to
predict some ofthe errors that the drill will encoUl;age learners to make.
Do an observational study where the pattern drill is used in a classroom
situation and see if the false association (errors) in fact arise. Or
alternatively, record the errors that are committed in response to the
sentences of a particular drill and see if you can explain them after the
fact in terms of the sorts of associations that are encouraged by the drill.
As an exercise in distributional analysis, have someone give you a simple
list of similarly structured sentences in a language that you do not know.
Try to segment those sentences without reference to meaning. See if you
can tell where the word boundaries are, and see if it is possible to
determine what the relationships between words are - i.e., try to discover
the structural meanings ofthe utterances in the foreign language without
referring to the way those utterances are pragmatically mapped onto
extralinguistic contexts. Then, test the success of your attempt by asking
an informant to give you a literal word for word translation (or possibly
a word for morpheme or phrase translation) along with any less literal
translation into English. See how close your understanding of the units
of the language was to the actual structuring that can be determined on
the basis of a more pragmatic analysis.
Take any utterance in any meaningful linguistic context and assign to it
the sort of tree structure suggested by a phrase structure grammar (or a
more sophisticated grammatical system if you like). Now consider the
question, what additional sorts of knowledge do I normally bring to bear
on the interpretation of such sentences that is not captured by the
syntactic analysis just completed? Extend the question. What techniques
might be used to make a learner aware ofthe additional sorts of clues and
information that native speakers make use of in coding and decoding
such utterances? The extension can be carried one step further. What
techniques could be used to test the ability oflearners to utilize the kinds
180
LANGUAGE TESTS AT SCHOOL
'\of clues and information available to native speakers in the coding and
decoding of such utterances?
"8. Is there a need for pattern drills without meaning in language teaching?
If one chose to dispense with them completely, how could pattern drills
be constructed so as to maximize awareness of the pragmatic
consequences of each formal change in the utterances in the pattern drill?
Consider the naturalness constraints proposed f-or pragmatic language
tests. Could not similar naturalness constraints be imposed on pattern
drills? What kinds of artificiality might be tolerable in such a system and
what kinds would be intolerable?
9. Take any discrete test item from any discrete test. Embed it in a context
(if that is possible, for some items it is very difficult to conceive of a
realistic context). Consider then the degree of variability in· possible
choices for the isolated discrete item and for the same item in context.
Which do you think would produce results most commensurate with
genuine ability to communicate in the language in question? Why?
Alternatively, take an item from a cloze test and isolate it frQm its contqt
(this is always possible). Ask the same questions.
SUGGESTED READINGS
1. J. Stanley Ahman, and Marvin D. Glock, Measuring and Evaluating
Educational Achievement 2nd Ed. Boston: Allyn and Bacon, 1975. See
,
especially Chapters 2,3, and 4.
'2. John L. D. Clark, -Foreign Language Testing: Theory and Practice.
Philadelphia: Center for Curriculum Development, 1972, 25-113.
3. David P. Harris, Testing English as a Second Language. New York:
McGraw Hill, 1969, 1-11,24--54.
4. J. B. Heaton, Writing English Language Tests. London: Longman, 1975.
5. Robert J. Silverman, Joslyn K. Noa, Randall H. Russell, and John
Molina, Oral Language Tests for Bilingual Students: An Evaluation of
Language Dominance and Proficiency Instruments. Portland, Oregon:
Center for Bilingual Education (USOE, Department of HEW), ,18-28.
7
Statistical Traps
A. Sampling theory and test
construction
B. Two common misinterpretations of
correlations
C. Statistical procedures as the final
criterion for item selection
D. Referencing tests against nonnative performance
There is an old adage that 'Figures don't lie', to which some sage
added, 'But liars do figure'. While in the right contexts, both ofthese
statements are true, both oversimplify the problems faced by anyone
who must deal with the sorts of figures known as statistics produced
by the figurers known as statisticians. Although most statistics can
easily be misleading, most statisticians probably do not intend to be
misleading even when they produce statistics that are likely to be
misinterpreted.
The book by Huff (1954), How to Lie with Statistics, would
probably be more widely read if it were on the topic, How to Lie
without Statistics. Statistics pet: se is a dry and forbidding subject. The
difficulty is probably not related to deliberate deception but to the
difficulty of the objects of interest - namely, certain classes of
numbers known as statistics. This chapter examines several common
but misleading applications of statistics and statistical concepts. We
are thinking here of language testing, but the problems are in fact very
general in educational testing.
A. Sampling theory and test construction
One of the misapplications of statistical notions to language testing
relates to the construction of tests. Discrete point theorists have
suggested that
181
182
LANGUAGE TESTS AT SCHOOL
the various 'parts' of the domain of 'language proficiency' must
be defined and represented in apptopriate' proportions on the
test ...
Item analysiS statistics take on a different meaning ... The
concern is ... with how well the items on the test represent the
content domain (Petersen and Cartier, 1975, p. 108t) ...
At first glance, it might appear that, in principle, a test of
general proficiency in ,a foreign languag~ should ~e a samp~e <?f
the entire language at large. In practIce, ObVlO\lsly, this IS
neither necessary nor desirable. The average native speaker gets
along quite well knowing only a limited sample of the language
at large, so our course and test really only need to sample that
sample. (Further discussion on what to sample will follow after
touching on the problem of how to sample [po 111]).
The reader will begin to appreciate the difficulty of determining just
what do~ains to sample (within the sample of what the native
speaker knows) a couple of paragraphs farther on in the Petetsen~
Cartier argument.
Most language tests, including DU's tests [i.e., the tests of
the Defense Language Institute], therefore make a kind ot
stratified random sample, assuring by plan that some items test
grammatical features, some test phonological features, some
test vocabulary, and so on. Thus, for example, the DLI's
English Comprehension Level tests are constructed according
to a fairly complex sampling matrix which requires that specific
percentages of the total number of 120 items be devoted to
vocabulary, sound discrimination, grammar, idioms, listening
comprehension, reading comprehension, and so on.
The authors continue to point out that there was once an, attempt to
determine the feasibility of -establishing a universal itemselection matrix of this sort for all languages, or perhaps for all
languages of a family, so that the problem of making a stratified
sample for test construction purposes could be reduced to a
somewhl;lt standard procedure. However, such a procedure has
not, as yet, been found, and until it is, we must use some method
for establishing a rational sample of a language in our tests (p.
112).
The solution arrived at is to 'consider the present DU courses as
rational samples of the language and to sample them ... for the item
objectives in our tests of general ability' (p. 113).1
The authors indicate in personal communication that they reached this decision only
'with great reluctance', They did not in their words 'arrive at these conclusions naively
but, in fact, only with some considerable pain'. Therefore, the foregoing and following
remarks should be interpreted accordingly.
1
STATISTICAL TRAPS
183
Several crucial questions arise: What would a representative
proportion of the phonology of the language be? How many items
would be required to sample the component of phonology if such a
component is in fact considered to be part of the larger domain of
language proficiency? What percentage of the items on a language
test should address the component of vocabulary (in order to be
representative in the way that this component is sampled)? How
many items should address the grammar component? What is a
representative sampling of the grammatical structures of a language?
How many noun phrases should it contain? Verb phrases?' Left
embedded clauses? Right,branching relative clauses?
One is struck by the possibility that in spite of what anyone may
say, there may be no answers to these questions. The suspicion that
\ no answers exist is heightened by ~he admissio,n that no procedure has
yet been found which will provide a rational basis for 'making a
stratified sample'. Indeed, if this is so, on what possible rational basis
could specific percentages of a certain total number of items be
determined?
Clearly, the motivation for separate items aimed at 'phonological
features, vocabulary, grammar, idioms, listening comprehension,
reading comprehension' and so forth is some sort of discrete point
analysis that is presumed to have validity independent of any
particular language test or portion thereof. However, the sort of
questions that must be answered concerning the possible e,xistence of
such components of language proficiency cannot be answered by
merely invoking the theoretical biases that led to the hypothesized
components oflanguage proficiency in the first place. The theoretical
biases themselves must be tested - not merely assumed.
The problem for anyone proposing a test construction method that
relies on an attempt to produce a representative (or even a 'rationa1')
'sampling of the language' is not merely a matter of how to parse up
the universe of 'language proficiency' into components, aspects,
skills, and so forth, but once having arrived at some division by
whatever methods, the most difficult problem still remains - how to
recognize a representative or rational sample in any particular
portion of the defined universe.
It can be shown that any procedure that might be proposed to
assess the representativeness of a particular sample of speech is
doomed to failure because by any possible methods, all samples of
real speech are either equally representative or equally unrepresentative of the universe of possible utterances. This is a natural
184
LANGUAGE TESTS AT SCHOOL
and inevitable consequence of the fact that speech is intrinsically nonrepetitive. Even an attempt to repeat a communicative exchange by
using the same utterances with the same meanings is bound to fail
because the context in which utterances are created is constantly
changing. Utterances cannot be perfectly repeated. Natural discourse
of the sort that is characteristic of the normal uses of human
languages are even less repeatable - by the same logic. The universe to
be sampled from is not just very large, it is infinitely large and nonrepetitive? To speak of a rational sampling of possible utterances is
like speaking of a rational sampling of future events or even of
historical happenings. The problem is not just where to dipjnto the
possible sea pf experience, but also how to know when to stop dipping
- i.e., when a rational sample has been achieved. .
To make the problem more meaningful, consider sampling the
possible sentences of English. It is known that the number of
sentences must be infinite unless some upper bound can be placed on
the length of an allowable sentence. Suppose we exclude all sentences
greater than twenty words in length. (To include them only
strengthens the argument, but since we are using the method of
reduction to absurdity tying our hal1ds in this way is a reasonable way
to start.) Miller (1964) conservatively estimated that the number of
grammatical twenty-word sentences in English is roughly 10 29 . This
figure derives direytly from the fact that if we interrupt someoneiwho
is speaking, on'the average there are about 10 words that can be used
to form an appropriate grammatical continuation. The number 10 20
exceeds the number of seconds in 100,000,000 centuries.
Petersen and Cartier (1975) suggested that one way around the
sampling problem was to 'sample the sample' of language that
happened to appear in the particular course of study at the Defense
Language Institute. Rather than trying to determine what would be a
representative sample of the course, suppose we just took the whole
course of an estimated '30 hours a week for 47 weeks' (p. 112).
Suppose furthe~ that we make the contrary to fact assumption that in
the course students are exposed to a minimum of one twenty-word
sentence each second. Ifwe then considered the entire course of study
2 This is not the same as saying that every sentence is 'wholly novel' or 'totally
unfamiliar' - a point of view that was argued against in Chapter 6 above. To say thatnormal communicative events cannot be repeated is like saying that you cannot relive
yesterday or even the preceding moment of time. This does not mean that there is
nothing familiar about today or that there will be nothing familiar about tomorrow. It
islike saying, however, that tomorrow will not be identical to today, nor is today quite
like yesterday, etc.
STATISTICAL TRAPS
185
as a kind of language test, it would only be possible for such a test to
cover about five milli9n sentences -less than .000000000001 percent
(one trillionth of one percent) of the possible twenty-word sentences
in English. In what realistic sense could this 'sample' be considered
representative of the whole? Weare confronted here not just with a
difficulty of how to select a 'sample', because it doesn't make any
difference how we select the sample. It can never be argued that any
possible sample of sentences is representative of the whole.
If we take a larger unit of discourse as our basic working unit, the
difficulty of obtaining a representative sample becomes far greater
since the number of possible discourses is many orders of magnitude
larger than the number of possible twenty-word sentences. We are
thus forced to the conclusion that the discrete point sampling notion
is about as applicable to the problem of constructing a good language
test as the method of listing sentences is to the characterization ofthe
grammar of a particular natural language. No list will ever cover
enough of the language to be interesting as a theory of grammar. It is
fundamentally inadequkte in principle - not just incomplete. Such a
list could never be completed, and a test modeled after the same
fashion could never be long enough even if we extended it for the
duration of a person's life expectancy. In this vein, consider the fact
that the number of twenty-word sentences by" itself is;;tl:>5'ut~. ,
thousand times larger than the estimated age of the earthA (Miller,
'
1964).
The escape from the problem of sampling theory is suggested in the
statement by Petersen and Cartier (1975) that 'the average native
speaker gets along quite well knowing only a limited sample of the
language at large, so our course and test really only need to sample that
sample' (p. 111). Actually, the native speaker knows far more than he
is ever observed to perform with his language. The conversations that
a person has with others are but a small fraction of conversations that
one could have if one chose to do so. Similarly, .the utterances that
one comprehends are but an infinitesimal fraction of the utterances
that one could understand if they were ever presented. Indeed, the
native speaker knows far more of his language than he will ever be
observed to use - even if he talks a lot and if we observe him all the
time. This was the original motivation for the distinction between
competence and performance in the Chomskyan linguistics of the
1950s and 1960s.
.
The solution to the difficulty is not to shift our attention from many
utterances (say, all of the ones uttered or understood by a given native
186
J
LANGUAGE TESTS AT SCHOOL
speaker) to fewer utterances (say, all of those utterances presented to a
language learner in a particular course of study at the D LI), but to remove our attention to an entirely different sort of object - namely, the
grammatical system (call it a cognitive network, generative grammar,
expectancy grammar,. or interlanguage system) that the learner is in
the process of internalizing and which when it is mature (i.e., like that
of the native speaker) will generate: not only the few meaningful
utterances that happen to occur in a particular segment of time, but
the many that the native speaker can say and understand.
Instead of trying to construct a language test that will 'representatively' or 'rationally sample' the universe of 'language', we
should simply construct a test that requires the language learner to do
what native speakers do with discourse (perhaps any discourse will
do). Then the interpretation of the test is related not to the particular
discourse that we happened to select, nor even to the universe of
possible discourses in the sense of sampling theory. But thus it is
related to the efficiency of the learner's internalized grammatical
system in processing discourse. The validity of the test is related to.
how "i'ell it enabl,es. us to predict the learner's performance in other
discourse processing tasks.
We. can differentiate between segments of discourse that are easy to
process and segments that are more difficult to process. Futther, we
can differentiate between segments of discourse that would be
appropriate to the subject matter of mathematics, as opposed to say,
. Keography, or gardenin~ as opposed ·to architecture. The selection of
segments of discourse appropriate to a particular learner or group of
learners would depend largely on the kinds of things that would be
expected later on of the same learner or group. Perhaps in this sense it
would be possible to 'sample' types of discourse - but this is a very
different sort of sampling than 'sampling' the phonological contrasts
of a language, say, in appropriate proportion to the vocabulary
items, grammatical rules, and the like. A language user can select
portions of college level tests for cloze tests, but this is not the sort of
'sampling' that we have been arguing against. To make the selection
analogous to the sort of sampling we have been arguing against is in
principle impossible not just indefensible; one would have to search
for prose with a specific proportion of certain phonological contrasts,
vocabulary items,grammatical rules and the like in relatiol} to a
course syllabus for teaching the language of the test.
In short, sampling theory is either inapplicable or not needed.
Where it might seem to apply, e.g., in the case of phonological
STATISTICAL TRAPS
187
contrasts, or vocabulary,it is not at all cleaf how to justify the
weighting of subtests in relation to each other. Where sampling
theory is clearly inapplicable, e.g., at the sentence or discourse levels,
it is also obviously not needed. No elaborate sampling technique is
needed to determine whether a learner can read college texts at a
defined level of comprehension - nor is sampling theory necessary to
the definition of the degree of comprehension that a given learner
may exhibit.
B. Two common misinterpretations of correlations
Correlations between language tests have sometimes been misinterpreted in two ways: first, low correlations have sometimes been
taken to mean that two tests with different labels are in fact measuring
different skills, or aspects or components of skills; and, second, high
correlations between dissimilar language processing tasks have
sometimes been interpreted as indicating mere reliability or even a
lack of validity. Depending upon the starting assumptions of a
particular theoretical viewpoint, the same statistics may yield
different interpretations - or may seem to support different, even
mutually contradictory conclusions. When such contradictions arise,
either one or both of the contradictory viewpoints must be wrong or
poorly articulated.
For instance, it is not reasonable to interpret.a low correlation
between two tests as an indication that the two tests are measuring
different skills and also that neither of them is reliable (or that one of
them is not reliable). Since reliability is a prerequisite to validity, a
given statistic cannot be taken as an indication of low reliability and
high validity (yet this is sometimes suggested in the literature as we
will see below). Similarly, a high correlation between two dissimilar
language tests cannot be dismissed as a case of high reliability but low
validity all in the same breath. This latter argument, however, is more
complex, so we will take the simpler case first.
Suppose that the correlation between a particular test that bears
the label 'Grammar' (or 'English Structure') and another language
test that bears the label 'Vocabulary' (or 'Lexical Knowledge') is
observed to be comparatively low, say, .40. Can it be concluded from
such a low correlation that the two tests are therefore measuring
different skills? Can we say on the basis of the low correlation
between them that the so-called 'Grammar' test is a test of a
'grammar' component of language proficiency while the so-called
188
STATISTICAL TRAPS
LANGUAGE TESTS AT SCHOOL
189
continue. In a test such as the TOEFL, this word could appear as an
item in the Vocabulary subsection. It might appear in a sentence
frame something like, The people want to 'continue'. The examinees
might be required to identify a synonymous expression such as
keep on from a field of distractors. On the other hand, the word
continue might just as easily appear in the English Structure section
as part of a verb phrase, or on some other part of the test in a
different grammatical form - e.g., continuation, continual, continuity,
discontinuity, continuous, or the like. If the word appeared in the
English Structure section it might be part of an item such as the
following:
'Vocabulary' test is a test of a 'lexical' component? The answer to
both questions is a simple, no.
The observed low correlation could result if both tests were in fact
measures of the same basic factor but were both relatively unreliable
measures of that factor. It could also result if one of the tests were
unreliable; or if one of them were poorly calibrated with respect to
the tested subjects (i.e., too easy or too difficult); or if one of the tests
were in fact a poor measure of what it purported to measure even
though it might be reliable; and so forth. In any case, a low
correlation between two tests (even if it is expected on the basis of
some theoretical reasoning) is relatively uninformative. It certainly
cannot be taken as an indication of the validity of the correlated tests.
Consider using recitation of poetry as a measure of empathy and
quality of artistic taste as a measure of independence - would a low
correlation between the two measures be a basis for claiming that one
or both must be valid? Would a low correlation justify the assertion
that the two tests are measures of different factors? The point is that
low correlations would be expected if neither test were a measure of
anything at all.
A low correlation between a 'Grammar' test and a 'Vocabulary'
test might well be the product of poor tests rather than an
independence of hypothesized components of proficiency. In fact,
many failures to achieve very high correlations would not prove by
any means that very high correlations do not in fact exist between
something we might call 'knowledge of vocabulary' and something
else that we might call 'knowledge of grammar'. Indeed the two kinds
of knowledge might be one kind in reality and no number of low
correlations between language tests labeled 'Grammar' tests on the
one hand and 'Vocabulary' tests on the other would suffice to exclude
such a possibility - and so on for all possible dichotomies. In fact the
observation of low correlations between language tests where high
correlations might be expected are a little like fishing trips which
produce small catches where large catches were expected. Small
catches do not prove that big catches do not exist. The larger catches
may simply be beyond the depths of previously used nets.
To carry the example a little farther, consider the following remark
by Bolinger (1975) in his introductory text on linguistics: 'There is a
vast amount of grammatical detail still to be dug out of the lexiconso much that by the time we are through there may be little point in
talking about grammar and lexicon as if they were two different
things' (p. 299). Take a relatively common word in English like
SPEAKER A:
SPEAKER B :
'But do you think they'll go on building?'
'Yes, I do because the contractor has to meet 'his
deadline. I think _ _ _ _ _ _ _ _ __
(A) the people continue to will
(B) will they want to continue
(C) to continue the people will
(D) they will want to continue
In an important sense, knowing a word is knowing how to use it in
a meaningful context, a context that is subject to the normal syntactic
(and other) constraints of a particular language. Does it make sense
then to insist on testing word-knowledge independent of the
constraints that govern the relationships between words in discourse ?
Is it possible? Even if it does turn out to be possible, the proof that it
has been accomplished will have to come from sources of evidence
other than mere low correlations between tests labeled 'Grammar'
tests and tests called 'Vocabulary' tests. In particular, it will have to
be shown that there exists some unique and meaningful variance
associated with tests of the one type that is not also associated with
tests of the other type - this has not yet been done. Indeed, many
attenipts to find such unique variances have failed (see the Appendix
and references cited there).
In spite of the foregoing considerations, some researchers have
contended that relatively low correlations between different language
tests have more substantial meaning. For instance, the College
Entrance Examination Board and Educational Testing Service
recommended in their 1969 Manual for Studies in Support of Score
Interpretation (for the TOEFL) that it may be desirable to 'study the
intercorrelations among the parts [of the TOEFL] to determine the
extent to which they are in fact measuring different abilities for the
group tested' (p. 6) .
•
190
LANGUAGE TESTS AT SCHOOL
The hint that low correlations might be taken as evidence that
different subtests are in fact measuring different components of
language proficiency or different skills is confirmed in two other
reports. For instance, on the basis of the data in Table 1, the authors
of the TOEFL Interpretive Information (Revised 1968) conclude: 'it
appears that Listening Comprehension is measuring some aspect of
English proficiency different from that measured by the other four
parts, since the correlations of Listening Comprehension with each of
the others are the four lowest in the table' (p. 14).
TABLE 1
Intercorrelation of the Part Scores on the Test of
English as a Foreign Language. Averaged over
Forms Administered through April 1967. (From
Manualfor TOEFL Score Recipients. Copyright ©
1973 by Educational Testing Service. All rights
reserved. Reprinted by permission.)
Subscores
(1)
(1)
Listening Comprehension
(2)
English Structure
the Listening test (i.e., poor sound reproduction in some testing
centers, or merely inconsistent procedures) then it is not reasonable
to use the same low correlations as evidence that the Listening test is
validly measuring something that the other subtests are not
measuring. What might that something be? Clearly, if it is lack of
TABLE 2
Intercorrelations of the Part Scores on the Test of
English as a Foreign Language. Averaged over
Administrations from October 1966 through June
1971. (From Manualfor TOEFL Score Recipients.
Copyright © 1973 by Educational Testing Service.
All rights reserved. Reprinted by permission.)
Subscores
(1)
(1)
(2)
(3)
(4)
(5)
.62
.53
.63
.55
Listening Comprehension
(2)
English Structure
(3)
(2)
(3)
(4)
(5)
.64
.56
.65
.60
.72
.67
.78
.69
.74
.64
.56
.72
.65
.67
.69
.60
.78
.74
Vocabulary
.62
.73
(3)
.53
.73
(4)
.63
.66
.70
.55
.79
.77
.66
.79
.70
.77
Vocabulary
Reading Comprehension
(5)
Writing Ability
191
STATISTICAL TRAPS
(4)
Reading Comprehension
(5)
Writing Ability
.72
.72
.72
.72
Later, in an update of the same interpretive pUblication, Manual
for TOEFL Score Recipients (1973) on the basis ofa similar table (see
Table 2 below), it is suggested that 'the correlations between
Listening Comprehension and the other parts of the test are the
lowest. This is probably because skill in listening comprehension may
be quite independent of skills in reading and writing; also it is not
possible to standardize the administration of the Listening
Comprehension section to the same degree as the other parts of the
test' (p. 15). Here the authors offer what amount to mutually
exclusive explanations. Both cannot be correct. If the low
correlations between Listening Comprehension and the other
subtests are the product of unreliability in the techniques for giving
consistency in the procedure that produces the low correlations it is
not listening ability as distinct from reading, or vocabulary
knowledge, or grammar knowledge, etc. On the other hand if the low
correlations are produced by real differences in the underlying skills
presumed to be tested, the administrative procedures for the
Listening test must have substantial reliability. It just can't go both
ways.
In the 1973 manual, the authors continue the argument that the
tests are in fact measuring different skills by noting that 'none of the
correlations ... [in our Table 2] are as high as the reliabilities of the
part scores' from which they conclude that 'each part is contributing
something unique to the total score' (p. 15). The question that is still
unanswered is what that 'something unique' is, and whether in the
case of each subtest it is in any way related to the label on that subtest.
Is the Reading Comprehension subtest more of a measure of reading
ability than it is of writing ability or grammatical knowledge or
, r,'
192
LANGUAGE TESTS AT SCHOOL
vocabulary knowledge or mere test-taking ability ora general
proficiency factor or intelligence? The fact that the reliability
coefficients are higher than the correlations between different part
scores is no proof that the tests are measuring different kinds of
knowledge. In fact, they may be measuring the same kinds of
knowledge and their low intercorrelations may indicate merely that
they are not doing it as well as they could. In any event, it is axiomatic
that validity cannot exceed reliability - indeed the general rule of
thumb is that validity coefficients are not expected to exceed the
square of the reliabilities of the intercorrelated tests (Tate, 1965). If a
certain test has some error variance in it and a certain other test also
has some error variance in it, the error is apt to be compounded in
their intercorrelation. Therefore, the correlation between two tests
can hardly be expected to exceed their separate reliabilities. It can
equal them only in the very special case that the tests are measuring
exactly the same thing.
From all of the foregoing, it is possible to see that low (or at least
relatively low) correlations between different language tests can be
interpreted as indications of low reliability or validity but hardly as
proof that the tests are measuring different things.
If ~:me makes the mistake of interpreting low correlations as
evid~nce that the tests in question are measuring different things, how
will one interpret higher correlations when they are in fact observed
between equally diverse tests? For instance, if the somewhat lower
correlations between TOEFL Listening Comprehension subtest and
the Reading Comprehension subtest (as compared against the
intercorrelations between the Reading Comprehension subtest and
the other subtests) represented in Tables 1 and 2 above, are taken as
evidence that the Listening test measures some skill that the Reading
test does not measure, and vice versa, then how can we explain the
fact that the Listening Comprehension subtest correlates more
strongly with ~ cloze test (usually regarded as a reading comprehension measure) than the latter does with the Reading
Comprehension subtest (see Darnell, 1968)?
Once high correlations between apparently diverse tests, are
discovered, the previous interpretations of low correlations as
indicators of a lack of relationship between whatever skills the tests
are presumed to measure are about as convincing as the argumen.ts of,
the unsuccessful fisherman who said there were no fish to be caught.
The fact is that the fish are in hand. Surprisingly high correlations
have been observed between a wide variety of testing techniques with a
F
STATISTICAL TRAPS
193
wide range of tested populations. The techniques range from a whole
family ofprocedures under the general rubric of cloze testing, dictation,
elicited imitation, essay writing, and oral interview. Populations have
ranged from children and adult second language learners, to children'
and adults tested in their native language (see the Appendix).
What then can be made of such high correlations? Two
interpretations have been offered. One of them argues that the strong
correlations previously observed between cloze and dictation, for
instance, are merely indications of the reliability of both procedures
and proof in fact that they are both measuring basically the same
thing - further, that one of them is therefore not needed since they
both give the same information. A second interpretation is that the
high correlations between diverse tests must be taken as evidence not
only of reliability but also of substantial test validity. In the first case,
it is argued that part scores on a language proficiency test should
produce low intercorrelations in order to attain validity, and in the
second that just the reverse is true. It would seem that both positions
cannot be correct.
It is easy to see that the expectation of low correlations between
tests that purport to measure different skills, components, or aspects
of language proficiency in accord with discrete point test philosophy
necessitates some method of explaining away high correlations when
they occur. The solution of treating the correlations merely as
indications oftest reliability, as Rand (1972) has done, runs into very
serious logical trouble. Why should we expect a dictation which
requires auditory processing of sequences of spoken material to
measure the same thing as a cloze test which requires the learner to
read prose and replace missing words? To say that these tests are tests
of the same thing, or to interpret a, high correlation between them as
an indication of reliability (alone, and not something more) is to saw
off the limb on which the whole of discrete point theory is perched. If
cloze procedure is not different from dictation, then what is the
difference between speaking and listening skills? What basis could be
offered for distinguishing a reading test from a grammar test? Are
such tasks more dissimilar than cloze and dictation? If we were to
follow this line of reasoning just a short step further, we would be
forced to conclude that low correlations between language tests of
any sort are indicators of low reliabilities per force. This is a
conclusion that no discrete point theorist, however, could entertain
as it obliterates all of the distinctions that are crucial to discrete point
testing.
194
LANGUAGE TESTS AT SCHOOL
What has been proposed by discrete point theorists, however, is
that tests should be constructed so as to minimize, deliberately, the
correlations between parts. Ifdiscrete point theorizing has substance
to it, such
recommendation IS not entirely without support.
However, if the competing viewpoint of pragmatic or integrative test
philosophy turns out to be more correct, test constructors should
interpret low correlations as probable indicators oflow validities and
should seek to construct language tests that maximize the
intercorrelation of part scores.
This does not mean that what is required is a test that consists of
only one sort of task (e.g., reading without speaking, listening, or
writing). On the contrary, unless one is interested only in ability to
read with comprehension or something of the sort, t6 learn how well
an individual understands, speaks, reads, and writes a language, it
may well be necessary (or at least highly desirable) to require him to
do all four sorts of performances. The question here is not which or
how many tasks should be included on a comprehensive language
test, but what sort of interrelationship between performances on
language tasks in general should be expected.
If an individual happens to be much better at listening tasks than at
speaking tasks, or at reading and writing tasks than at speaking and
listening tasks, we would be much more apt t6 discover this fact with
valid language tests than with non-valid ones. However, the case of a
particular individual, who may show marked differences in ability to
perform different language tasks, is not an argument against the
possibility of a very high correlation between those same tasks for an
entire population of subjects, or for subjects in general.
What if we go down the other road? What' if we assume that part
scores on language tests that intercorrelate highly are therefore
redundant and that one of the highly correlated test scores should be
eliminated from the test? The next step would be to look for some
new subtest (~r to construct one or more than one) which would
assess some different component oflanguage skill not included in the
redundant tests. In addition to sawing the limb out from under the
discrete point test philosophy, we would be making a fundamental
error in the definition of reliability versus validity. Furtherm'ore, we
would be faced with the difficult (and probably intrjnsically
insoluble) problem oftrying to decide how much weight to assign to
which subskill, component, or aspect, etc. We have discussed this
latter difficulty in section A above.
The matter of confusing reliability and validity is the second point
a
/
/
t
STATISTICAL TRAPS
195
to be attended to in this section. Among the methods for assessing test
reliability are: the test-retest method; the technique of correlating
one half ofa test with the other half of the same test (e.g., correlating
the average score on even numbered items with the average on odd
numbered items for each presumed homogeneous portion of a test);
and the alternate forms method (e.g., correlating different forms of
the same test). In all of these cases, and in fact in all other measures of
reliability (Kuder-Richardson formulas and other internal consistency measures included), reliability by definition has ts> do with
tests or portions of tests that are in some sen~e the same or
fundamentally similar.
To interpret high correlations between substantially different tests,
or tests that require the performance of substantially different tasks,
as mere indicators of reliability is to redefine reliability in an
unrecognizable way. If one accepts such a definition, then how will
one ever distInguish between measures of reliability and measures of
validity? The distinction, which is a necessary one, evaporates.
In the case of language tests that require the performance of
substantially different tasks, to interpret a high correlation between
them as an indication of reliability alone is to treat the tasks as the
same when they are not, and to ignore the possibility that even more
diverse pragmatic language tasks may be equally closely related. In
the case oflanguage tests high correlations are probably the result of
an underlying proficiency factor that relates to a psychologically real
grammatical system. If such a factor exists, the ultimate test of
validity of any language test is whether or not it taps that factor, and
how well it does so. The higher the correlations obtained between
diverse tasks, the stronger the confidence that they are in fact tapping
such a factor. The reasoning may seem circular, but the circularity is
only apparent. There are independent reasons for postulating the
underlying grammatical system (or expectancy grammar) and there
are still other bases for determining what a particular language test '
measures (e.g., error analysis). The crucial empirical test for the
existence of a psychologically real grammar is in fact performance on
language tests (or call them tasks) of different sorts. Similarly, the
.chief criterion of validity for any proposed language test is how well it
assesses the efficiency of the learner's internalized grammatical
system (or in the terms of Part One of this book, the learner's
expectancy system).
On the basis of present research (see Oller and Perkins, 1978) it
seems likely that Chomsky (1972) was correct in arguing that
,
196
LANGUAGE TESTS AT SCHOOL
language abilities are central to human intelligence. Further, as is
discussed in greater detail in the Appendix, it is apparently the case
that language ability, school achievement, and IQ all constitute a
relatively unitary factor. However, even if this latter conclusion were
not sustained by further research, the discrete point interpretations of
correlations as discussed above will still have to be radically revised.
The problems there are logical and analytical whereas the unitary
factor hypothesis is an empirical issue that requires experimenta1
study.
c.
Statistical procedures as the final criterion for item selection
Perhaps because of its distinctive jargon, perhaps because of its
bristling mathematical formulas, or perhaps out of blind respect for
things that one does not understand fully, statistical procedures (like
budgets and curricula as we noted in Chapter 4) are spmetimes
elevated from the status of slaves to educational purposes to the
status of masters which define the purposes instead of serving them.
In Chapter 9 below we return to the topic of item analysis in greater
detail~ Here it is necessary to define briefly the two item statistics on
which the fate of most test items is usually decided - i.e., whether to
i'oclude-a particular item, exclude it, or possibly rewrite it and pretest
it again. The first item statistic is the simple percentage of students
answering correctly (item facility) ·or the percentage answering
incorrectly (item difficulty). For the sake of parsimony we will speak
only of item facility (IF) - in fact, a little reflection will show that item
difficulty is merely another way of expressing the same numerical
value that is expressed as IF.
The second item statistic that is usually used by professional testers
in evaluating the efficiency of an item is item discrimination (ID). This
latter statistic has to do with how well the item tends to separate the
examinees who are more proficient at the task in question from those
examinees who are less proficient. There are numerous formulas for
computing IDs, but all of them are in fact methods of measuring the
degree of correlation between the tendency to get high and lowscores
on the total test (or subtest) and the tendency to answer correctly or
incorrectly on a particular test item. The necessary assumption is that
the whole test (or subtest) is apt to be a better measure of whatever the
item attempts to measure (and the item can be considered a kind of
miniature test) than any single item. If a given item is valid (by this
criterion) it must correlate positively and significantly with the total
r
..
STATISTICAL TRAPS
197
test. If it does so, it is possible to conclude that high scorers on the test
will tend to answer the item in question correctly more frequently
than do low SCOll:t;IS;
isreversing the ttc:~• •J._i:8l!*Wltt'''''~!1.,~t:U~~~.J
the i,tem·'and the tel!;ttilll!rm.tWliJ''',mcNlll\)_ _~Y~._II_
high ~cores on t"1at",".tQ ,us,. t,e item more frequently
exammees wlro;~t!:t law scores on the te\t.
For reasons that are discussed in more detail in Chapter 9, items
with very low or very high IFs and/or items with very low or negative
IDs are usually discarded from tests. Very simply, items that are too
easy or too hard provide little or no information about the range of
proficiencie~ in a particular group of examinees, and items that have
nil or negative IDs either contribute nothing to the total amount of
meaningful variance in the test (i.e., the tendency of the test to spread
the examinees over a scale ranging from less proficient to more
proficient) or they in fact tend to depress the meaningful variance by
cancelling out some of it (in the case of negative IDs).
Probably any multiple choice test, or other test that is susceptible
to item analysis, can be significantly improved by the application of
the above criteria for eliminating weak items. In fact, as is argued in
Chapter 9, multiple choice tests which have not been SUbjected to the
requirements of such item analyses should probably not be used for
the purposes of making educational decisions - unfortunately, they
are used for such purposes in many educational contexts.
The appropriate use of item analysis then is to eliminate (or at least
to flag for revision) items that are for whatever reason inconsistent
with the test as a whole, or items that are not calibrated appropriately
to the.level of proficiency of the population to be tested. But what
about the items that are left unscathed by such analyses? What about
the items that seem to be appropriate in IF and ID? Are they
necessarily, therefore, valid items? If such statistics can be used as
methods for eliminating weak items, why not use them as the final
criteria for judging the items which are not eliminated as valid - once
and for all? There are several reasons why acceptable item statistics
cannot be used as the final basis for judging the validity of items to be
included in tests. It is necessary that test items conform to minimal
requirements ofIF and ID, but even if they do, this is not a sufficient
tiasis forjudging the items to be 'valid' in any fundamental sense.
One of the reasons that item statistics cannot be used as final
198
STATISTICAL TRAPS
LANGUAGE TESTS AT SCHOOL
criteria for item selection - and perhaps the most fundamental reason
- relates to the assumption on which ID is based. Suppose that a
certain Reading Comprehension test (or one that bears the label) is a
fairly poor test of what it purports to measure. It follows that even
items that correlate perfectly with the total score on such a test must
also be poor items. For instance, if the test were really a measure of
the learner's ability to recall dates or numbers mentioned in the
reading selection and to do simple arithmetic operations to derive
new dates, the items with the highest IDs might in actuality be the
ones with the lowest validities (as tests of reading comprehension).
Another reason that item statistics cannot be relied on for the final
selection of test items is that they may in fact push the test in a
direction that it should not go. For example, suppose that one wants
to test knowledge of words needed for college-level reading of texts in
mathematics. (We ignore for the sake of the argument at this point
the question of whether a 'vocabulary' test as distinct from other
types of tests is really more of a measure of vocabulary knowledge
than of other things. Indeed, for the sake of the argument at this
point, let us assume that a valid test of 'vocabulary knowledge' can be
constructed.) By selecting words from mathematics texts, we might
.construct a satisfactory test. But suppose that for whatever reason
certain words like ogle, rustle, shimmer, sheen, chiffonier, spouse,
fettered, prune, and pester creep into the examination, and suppose
further that they all produce acceptable item statistics. Are they
therefore acceptable items for a test that purports to measure the
vocabulary necessary to the reading of mathematics texts?
Or change the example radically, do words like ogle, and chiffonier,
belong in the TOEFL Vocabulary subtest? With the right field of
distractors, either of these might produce quite acceptable item
statistics - indeed the first member of the pair did appear in the
TOEFL. Is it a word that foreign students applying to American
universities need to know? As a certain Mr Jones pointed out, it did
produce very acceptable item statistics. Should it therefore be
included in the test? If such items are allowed to stand, then what is to
prevent a test from gravitating further and further away from
common forms and usages to more and more esoteric terms that
produce acceptable item statistics?
To return to the example concerning the comprehension of words
in mathematics texts, it is conceivable that a certain population of
very capable students will know all the words included in the
vocabulary test. Therefore, the items would turn out to be too easy by
199
item statistics criteria. Does this necessarily mean that the items are
not sufficiently difficult for the examinees? To claim this is like
arguing that a ten foot ceiling cannot be measured with a ten foot
tape. It is like arguing that a ten foot tape is the wrong instrument to
use for measuring a ten foot ceiling because the ceiling is not high
enough (or alternatively because the tape is too short).
Or, to look at a different case, suppose that all of the subjects in a
certain population perform very badly on our vocabulary test. The
item statistics may be unacceptable by the usual standards. Does it
necessarily follow that the test must be made less difficult? Not
necessarily, because it is possible that the subjects to be tested do not
know any of the words in the mathematics texts - e.g., they may not
know the language of the texts. To claim therefore that the test is not
valid and/or that it is too difficult is like claiming that a tape measure
is not a good measure of length, and/or that it is not short enough,
because it cannot be used to measure the distance between adjacent
points on a line.
For all of the foregoing reasons, test item statistics alone cannot be
used to select or (in the limiting cases just discussed) to reject items.
The interpretation of test item statistics must be subordinate to
higher considerations. Once a test is available which has certain
independent claims to validity, item analysis may be a useful tool for
refining that test and for attaining slightly (perhaps significantly)
higher levels of validity. Such statistics are by no means, however,
final criteria for the selection of items.
D. Referencing tests against non-native performance
The evolution of a test or testing procedure is largely determined by
the assumptions on which that test or procedure is based. This is
particularly true of institutional or standardized tests because of their
longer survival expectancy as compared against classroom tests that
are usually used only once and in only one form. Until now, discrete
point test philosophy has been the principal basis underlying
standardized tests of all sorts.
Since discrete point theories of language testing were generally
articulated in relation to the performance of non-native speakers of a .
given target language, most of the language tests based on such
theorizing have been developed in reference to the performance of
non-native speakers of the language in question. Generally, this has
been justified on the basis of the assumption that native speakers
1
200
LANGUAGE TESTS AT SCHOOL
ShQUld perfQrm flawlessly, Qr nearly sO', Qn language t~sks that are
nQrmally included in such tests. HQwever, native' speakers Qf a
language dO' vary in their ability to' perfQrm language related tasks. A
child Qf six years may be just as much a native speaker Qf English as
an adult Qf age twenty-five, but we dO' nQt expect the child to' be able
to' dO' all Qf the things with English that we may expect Qf the adult hence, there are differences in skill attributable to' age Qr maturatiQn.
Neither dO' we expect an illiterate farmer whO' has nQt had the
educatiQnal QPPQrtunities Qf an urbanite Qf cQmparable abilities to' be
able tQ"read at the same grade level and with equal cQmprehensiQnhence, there are differences due to' educatiQn and experience.
FurthermQre, recent empirical ,research, especially Stump (1978)
has shQwn that nQrmal native speakers dO' vary greatly in proficiency
and that this variance may be identical with what has fQrmerly been
called IQ and/or achievement.
Thus, we must cQnclude that there is a real chQice: language tests
can either be referenced against the perfQrmance Qf native speakers Qr
they may be referenced against the perfQrmance Qf nQn-native
speakers. Put mQre cQncretely, the effectiveness Qf a test itetn (Qr a
subtest, Qr an entire battery Qftests) may be judged in terms QfhQW it
furictiQns with natives Qr nQn-natives in producing a range Qf SCQresQr in prQducing meaningful variance between better perfQrl)1ers and
WQrse performers.
If a nQn-native reference PQPulatiQn is used, test writers will tend to'
prepare items that maximize the variability within that PQPulatiQn. If
native speakers are selected as a reference PQPulatiQn, test writers will
tend,tQ arrange items SO' as to' maximize the variability within that
PQPulatiQn. Or mQre accurately, the test writers will attempt to' make
the testes) in either case as sensitive as PQssible to' the variance in
language prQficiency that is actually characteristic Qf the PQPulatiQn
against which the test is referenced.
In general, tl)e attempt to' maximize the sensitivity Qf a test to' true
variabilities in tested PQPulatiQns is desirable. This is what test
validity is abQut. The rub CQmes frQm the fact that in the case Qf
language tests, the ability Qf nQn-nativespeakers to' answer certain
discrete test items cQrrectly may be unrelated to' the kinds Qf ability
that native speakers display when they use language in nQrmal
CQntexts Qf cQmmunicatiQn.
There are a number Qf salient differences between the perfQrmance
Qf native speakers Qf a given language and the perfQrmance Qf nQnnatives whO' are at variQus stages Qf develQpment in acquiring the
T
STATISTICAL TRAPS
201
same language as a secQnd Qr fQreign language. AmQng the
differences is the fact that native speakers generally make fewer
errors, less severe errQrs, and errors which have nO' relatiQnship to'
anQther language system (i.e., native speakers dQnQt have fQreign
accents, nQr dO' they tend to' make errQrs that Qriginate in the
syntactic, semantic, Qr pragmatic system Qf a cQmpeting language).
Native speakers are typically able to' prQcess material in their native
language that is richer in QrganizatiQnal cQmplexities than the typical
nQn-native can handle (Qther things such as age, educatiQnal
experience, and the like being equal). NQn-natives have difficulty in
achieving the same level Qf skill that native speakers typically exhibit
in handling jQkes, puns, riddles, irony, sarcasm, facetiQus humQr,
hyperbQle, dQuble entendre, subtle inuendQ, and SO' fQrth. Highly
skilled native speakers are less susceptible to' false analQgies (e.g.,
pronounciation fQr pronunciation, ask it to him Qn analQgy with fQrms
like tell it to him) and are mQre capable Qf making theapprQpriate
analQgies affQrded by the richness Qttheir Qwn linguistic system (e.g.,
the use Qf parallel phrasing acrQSS sentences, the use Qf metaphQrs,
similes, cQntrasts, cQmparisQns, and SO' Qn).
Because Qf the CQntrasts in native and nQn-native perfQrmance
which we have nQted abQve, and many Qthers, the effect Qf
referencing a test against Qne PQPulatiQn rather than the Qther may be
quite significant. SupPQse the decisiQn is made to' use nQn-natives as a
reference PQPulatiQn - as, fQr instance, the TOEFL test writers
decided tQd~ in the early 1960s. What will be the effect Qn the
eventual fQrm Qf the test items? HQW will they be apt to' differ frQm
test items that ,are re(erenced against the perfQrmance Qf native
speakers?
If the variance in the perfQrmance Qf natives is nQt cQmpletely
similar to' the variance in the perfQrmance Qf nQn-natives, it fQllQWS
that items which wQrk well in relatiQn to' the variance in Qne will nQt
necessarily wQrk well in relatiQn to' the variance in the other. In fact,
we shQuld predict that some Qf the items that are easy fQr native
speakers shQuld be difficult fQr nQn~natives and vice versa - SQme Qf '
the items that are easy fQr nQn-natives shQuld be mQre difficult fQr
native speakers. This last predictiQn seems anQmalQus. Why shQuld
nQh-native speakers be able to' perfQrm better Qn any language test
item than native speakers? FrQm the PQint Qf view Qf a SQurid theQry.
Qf psychQlinguistics, the fact is that native speakers shQuld always
QutperfQrm nQn-natives, Qther things being equa1. HQwever, if a
given test Qf language prQficiency is referenced against the
202
LANGUAGE TESTS AT SCHOOL
performance of non-native speakers, and ,if t~ variance in their
performance is different from the variance in the performance of
natives, it follows tl;l.at some of the items 'in the test will tend to
gravitate toward portions of variance in the reference population that
are not characteristic of normal language use by native, speakers.
Hence, some of the items on a test referenced against non-native
performance will be more difficult for natives than for non-natives,
and many of the items on such tests may have little or nothing to do
with actual ability to communicate in the tested language.
Why is there reason to expect variance in the language skills of nonnative speakers to be somewhat skewed as compared against the
variance in native performance (due to age, education, and the like)?
For one thing, many non-native speakers - perhaps most non-natives
who were the reference populations for tests like the TOEFL, the
Modern Language Association Tests, and many other Joreign
language tests - are exposed to the target language primarily in
somewhat artificial classroom contexts. Further, they are exposed
principally to materials (syntax based pattern drills, for instance) that
are founded on discrete point theories of teaching and analyzing
languages. They are encouraged to form generalizations about the
nature of the target language that would be very uncharacteristic of
native speaker intuitions about how to say and mean things in that
same language. ,No native speaker, for example, would be apt to
confuse going to a dance in a car with going to a dance with a girl, but
non-natives may be led into such confusions. Forms like going to a
foreign country with an airplane and going to a foreign country in an
airplane are often confused due to classroom experience ~ see Chapter
6, section C.
The kinds of false analogies, or incorrect generalizations that nonnatives are lured into by poorly conceived materials combined with
good teaching might be construed as the basis for what could be
called a kind of freak grammar - that is, a grammar that is suited only
for the rather odd contexts of certain teaching materials and that is
quite ill-suited for the natural contexts of communication. Ifa test
then is aimed at the variance in performance that is generated by
more ·of less effective intermilization of such a freak grammar, it
should not he surprising that some of the items whi<:h are sensitive to
th<1 knowledge that such a grammar expresses would be impervious to
the knowledge that a more normal (i.e., native speaker) grammar
specifies. Similarly, tests that are sensitive to the variance in natural
grammars might well be insensitive to some of the kinds of discrete
STATISTICAL TRAPS
203
point knowledge characteristically taught in language materials.
If the foregoing predictions were correct, interesting contrasts in
native and non-native performance on tests such as the TOEFL
should be experimentally demonstrable. In a study by Angoff and
Sharon (1971), a group of native speaking college students at the
University of New Mexico performed less well than non-natives on
21 %of the items in the Writing Ability section of that examination.
The Writing Ability subtest of the TOEFL consists primarily of items
aimed at assessing knowledge of discrete aspects of grammar, style,
and usage. The fact that some items are harder for natives than for
non-natives draws into question the validity of those items as
measures of knowledge that native speakers possess. Apparently
some of the items are in fact s91sitive to things that non-natives are
taught but that native speakers do not normally learn. If the test Were
normed against the performance of native speakers in the first place,
this sort of anomaly could not arise. By this logic, native performance
is a more valid criterion against which to judge the effectiveness of test
items than non-native performance is.
Another sense in which the performance of non-native speakers
may be skewed (i.e., characteristic of unusual or freak grammars) is in
the relationship between skills, aspects of skills, and components of
skills. For instance, the fact that scores on a test of listening
comprehension correlate less strongly with written tests of reading
comprehension, grammatical knowledge (of the discrete point sort),
and so-called writing ability (as assessed by the TOEFL, for
instance), than the latter tests correlate with each other (see Tables 1
and 2 in section B above) may be a function of experiential bias rather
than a consequence of the factorial structure oflanguage proficiency.
Many non-native speakers who are tested on the TOEFL, for
example, may not have been exposed to models who ~peak fluent
American English. Furthermore, it maywell be that experience with
such fluent models is essential to the development of listening skill
hand in hand with speaking, reading, and writing abilities. Hence, it is
possible that the true correlation between skills, aspects, and
com'ponents of skills is much higher under normal circumstances
than has often been assumed. Further, the best approach if this is true
would be to make
the items and tests maximally sensitive to the
I
meaningful variance present in native speaker performance (e.g., that
sort of variance that is due to normal maturation and experience).
In short, referencing tests against the performance of non-native
speakers, though statistically an impeccable decision, is hardly
202
LANGUAGE TESTS AT SCHOOL
performance of non-native speakers, and if the variance in their
performance is different from the variance in the performance of
natives, it follows that some of the items in the test will tend to
gravitate toward portions of variance in the reference population that
are not characteristic of normal language use by native speakers.
Hence, some of the items on a test referenced against non-native
performance will be more difficult for natives than for non-natives,
and many of the items on such tests may have little or nothing to do
with actual ability to communicate in the tested language.
Why is there reason to expect variance in the language skills of nonnative speakers to be somewhat skewed as compared against the
variance in native performance (due to age, education, and the like)?
For one thing, many non-native speakers - perhaps most non-natives
who were the reference populations for tests like the TOEFL, the
Modern Language Association Tests, and many other foreign
language tests - are exposed to the target language primarily in
somewhat artificial classroom contexts. Further, they are exposed
principally to materials (syntax based pattern drills, for instance) that
are founded on discrete point theories of teaching and analyzing
languages. They are encouraged to form generalizations about the
nature of the target language that. would be very uncharacteristic of
native speaker intuitions about how to say and mean things in that
same language. No native speaker, for example, would be apt to
confuse going to a dance in a car with going to a dance with a girl, but
non-natives may be led into such confusions. Forms like going to a
foreign country with an airplane and going to a foreign country in an
airplane are often confused due to classroom experience - see Chapter
6, section C.
The kinds of false analogies, or incorrect generalizations that nonnatives are lured into by poorly conceived materials combined with
good teaching might be construed as the basis for what could be
called a kind of freak grammar - that is, a grammar that is suited only
for the rather odd contexts of certain teaching materials and that is
. quite ill-suited for the natural contexts of communication. If a test
then is aimed at the variance in performance that is generated by
more or Jess effective internalization of such a freak grammar, it
should not be surprising that some of the items which are sensitive to
the. knowledge that such a grammar expresses would be impervious to
the knowledge that a more normal (i.e., native speaker) grammar
specifies. Similarly, tests that are sensitive to the variance in natural
grammars might well be insensitive to some of the kinds of discrete
T
•
STATISTICAL TRAPS
203
point knowledge characteristically taught in language materials.
If the foregoing predictions were correct, interesting contrasts in
native and non-native performance on tests such as the TOEFL
should be experimentally demonstrable. In a study by Angoff and
Sharon (1971), a group of native speaking college students at the
University of New Mexico performed less well than non-natives on
21 %of the items in the Writing Ability section of that examination.
The Writing Ability subtest of the TOEFL consists primarily of items
aimed at assessing knowledge of discrete aspects of grammar, style,
and usage. The fact that some items are harder for natives than for
non-natives draws into question the validity of those items as
measures of knowledge that native speakers possess. Apparently
some of the items are in fact sepsitive to things that non-natives are
taught but that native speakers do not normally learn. If the test Were
normed against the performance of native speakers in the first place,
this sort of anomaly could not arise. By this logic, native performance
is a more valid criterion against which to judge the effectiveness of test
items than non-native performance is.
Another sense in which the performance of non-native speakers
may be skewed (i.e., characteristic of unusual or freak grammars) is in
the relationship between skills, aspects of skills, and components of
skills. For instance, the fact that scores on a test of listening
comprehension correlate less strongly with written tests of reading
comprehension, grammatical knowledge (of the discrete point sort),
and so-called writing ability (as assessed by the TOEFL, for
instance), than the latter tests correlate with each other (see Tables 1
and 2 in section B above) may be a function of experiential bias rather
than a consequence of the factorial structure oflanguage proficiency.
Many non-native speakers who are tested on the TOEFL, for
example, may not have been exposed to models who speak fluent
American English. Furthermore, it may well be that experience with
such fluent models is essential to the development of listening skill
hand in hand with speaking, reading, and writing abilities. Hence, it is
possible that the true correlation between skills, aspects, and
comp~nents of skills is much higher under normal circumstances
than has often been assumed. Further, the best approach if this is true
would be to make the items and tests maximally sensitive to the
meaningful variance present in native speaker performance (e.g., that
sort of variance that is due to normal maturation and experience).
In short, referencing tests against the performance of non-native
speakers, though statistically an impeccable decision, is hardly
204
STATISTICAL TRAPS
LANGUAGE TESTS AT SCHOOL
defensible from the vantage point of deeper principles of validity and
practicality. In a fundamental and indisputable sense,native speaker
performance is the criterion against which alllanguage tests must be
validated because it is the only observable criterion in terms of which
language proficiency can be defined. To choose .non-native
performance as a criterion whenever native performance can be
obtained is like using an imitation (even ifit is a good one) when the
genuine article is ready to hand. The choice of native speaker
performance as the criterion against which to judge the validity of
language proficiency tests, and as a basis for refining and developing
them, guarantees greater facility in the interpretation of test scores,
and more meaningful test sensitivities (i.e., variance).
Another incidental benefit of referencing tests against native
performance is the exertion of a healthy pressure on materials writers,
teachers, and administrators to teach non-native speakers to do what
natives do - i.e., to communicate effectively - instead of teaching
them to perform discrete point drills that have little or no relation to
real communication. Of course, there is nothing surprising to the
successful classroom teacher in any of these observations. Many
language teachers have been devoting much effort to making all of
their classroom activities as meaningful, natural, and relevant to the
normal communicative uses of language as is possible, and that for
many years.
7.
8.
9.
10.
KEY POINTS
1. Statistical reasoning is sometimes difficult and can easily be misleading.
2. There is no known rational way of deciding what percentage of items on
a discrete point test should be devoted to the assessment of a particular
skill, aspect, or component of a skill. Indeed, there cannot be any basis
for componential analysis oflanguage tests into phonology, syntax, and
vocabulary subtests, because in normal uses oflanguage all components
work hand in hand and simultaneously.
3. The difficulty of representatively sampling the universe of possible
sentences in a language or discourses in a language is insurmountable.
4. The sentences or discourses in a language which actually occur are but a
small portion (an infinitesimally small portion) of the ones which could
occur, and they are non-repetitive due to the very nature of human
experience.
5. The discrete point method of sampling the universe of possible phrases,
or sentences, or discourses, is about as applicable to the fundamental
pro blem oflanguage testing as the method oflisting examples of phrases,
sentences, or discourses is to the basic problem of characterizing
language proficiency - or psychologically real grammatical systems.
6. The solution to the grammarian's problem is to focus attention at a
11.
12.
13.
14.
15.
t
205
deeper level - not on the surface forms of utterances, but on the
underlying capacity which generates not only a particular utterance, but
utterances in general. The solution to the tester's problem is similar namely to focus attention not on the sampling of phrases, sentences, or
discourses per se, but rather on the assessment of the efficiency of the
developing learner capacity which generates sequences of linguistic
elements in the target language (i.e., the efficiency of the learner's
psychologically real grammar that interprets and produces sequences of
elements in the target language in particular correspondences to
extralinguistic contexts).
Low correlations have sOlll.etimes been interpreted incorrectly as
showing that tests with different labels are necessarily measures of what
the labels name. There are, however, many other sources of low
correlations. In fact, tests that are measures of exactly the same thing
may correlate at low levels if one or both are unreliable, too hard or too
easy, or simply not valid (e.g., if both are measures of nothing).
It cannot reasonably be argued that a low correlation between tests with
different labels is due to a lack of validity for one of the tests and is also
evidence that the tests are measuring different skills.
Observed high correlations between diverse language tests cannot be
dismissed as mere indications of reliability - they must indicate that the
proficiency factor underlying the diverse performances is validly tapped
by both tests. Furthermore, such high correlations are not ambiguous in
the way that low correlations are.
The expectation of low correlations between tests that require diverse
language performances (e.g., listening as opposed to reading) is drawn
from discrete point theorizing (especially the componentializing of
language skills), but is strongly refuted when diverse language tests (e.g.,
cloze and dictation, sentence paraphrasing and essay writing, etc.) are
observed to correlate at high levels.
To assume that high correlations between diverse tests are merely an
indication that the tests are reliable, is to treat different tests as if they
were the same. If they are not in fact the same,and if they are treated as
the same, what justification remains for treating any two tests as different
tests (e.g., a phonology test compared against a vocabulary test)? To
follow j>uch reasoning to its logical conclusion is to obliterate the
possibility of recognizing different skills, aspects, or components of
skills.
Statistical procedures merit the position of slaves to educational
purposes much the way hammers and nails merit the position of tools in
relation to building shelves. If the tools are elevated to the status of
procedures for defining the shape of the shelves or what sort of books
they can hold, they are being misused.
Acceptable item statistics do not guarantee valid test items, neither do
unacceptable item statistics prove that a given item is not valid.
Tests must have independent and higher claims to validity before item
statistics per se can be meaningfully interpreted.
Language tests may be referenced against the performance of native or
non-native speakers.
206
LANGUAGE TESTS AT SCHOOL
STATISTICAL TRAPS
16. Native and non-native performance on la~guage tasks contrast ina
number of ways including the frequency and severity of errors, type of
errors (interference errors being characteristic only of non-native
speech), and the subtlety and complexity of the organizationjll
constraints that can be handled (humor, sarcasm, etc.).
17. Non-native performance is also apt to be skewed by the artificial contexts
of much classroom experience.
18. If the foregoing generalizations were correct, we should expect natives to
perform more poorly on some of the items aimed at assessing the
knowledge that non-natives acquire in artificial settings.
19. Angoffand Sharon (1971) found that natives did more poorly than nonnatives on 21 %ofthe items on the Writing Ability section ofthe TOEFL
(a test referenced against the performance of non-natives to start with).
20. If, on the other hand, native performance is set as the main criterion,
language tests can easily be~made more interpretable and necessarily
would achieve higher validity (other things being equal).
21. Another advantage of referencing language tests against the perfol'mance of native speakers is to place a healthy pressure on what happens
in classrooms - a pressure toward more realistic uses of language for
communicative purposes.
207
constrained by cognitive factors th~n lower order.units? Can you. show
by logic that some of the constraints on phrases SImply do not eXIst for
words and that some of the constraints on sentences do not exist for
phras~s and so forth? If so, what implications does your reasoning hold
for language tests? What, for instance-, is the difference between a
vocabulary item without any additional context as opposed to say a
vocabulary item in the context of a sentence, or in the context of a
paragraph, or in the context of an entire essay or speech? Which sort of
test is more apt to mirror faithfully what native speakers do when they
use words - perhaps even when they use the word in question in the item?
Consider the problem of trying to decide what percentage of items on a
language test should be devoted to phonology as opposed to vocabulary
or syntax. What proportion of normal speech is represented by a strict
attention to phonology as opposed to vocabulary?
5. Consider the meaning of a word in a particular context. For instance,
suppose someone runs up to you and tells you that'your closest friend has
just been in an automobile accident. Is the meaning associated with the
friend's name in that context the same as the meaning in a context where
you are told that this friend will be leaving for Australia within a
month? In what ways are the two uses of the friend's name (a proper
noun in these cases) similar? Different? What if the same person is
standing nearby and you call him by name. Or suppose that he is not
nearby and you mention his name to some other person. Or perhaps you
refer to your friend by name in a letter addressed to himself, or to
. someone else. In what sense are all of these uses of the name the same,
and in what sense are they different? If we extend the discussion then to
other classes of words which are pragmatically more complex than
proper nouns, or to grammatical forms that are more complex still, in
what sense can any utterance be said to be a repetition of any other
utterance? Discuss the implications for language teaching and language
testing. If utterances and communicative events in general are nonrepetitive, what does this imply for language tests? Be care~ul not to
overlook the similarities between utterances on different occaSlOns.
6. What crucial datum must be supplied to prove (or adeast support) the
notion that a test labeled a 'Vocabulary' test is in fact more a measure of
vocabulary knowledge than of grammatical knowledge?
i/'. Why not interpret low correlations as proof that the intercorrelated tests
are valid measures of what their labels imply?
8. Why not generally assume that high correlations are mere indications of
test reliability? In what cases is such a claim justified? When is it not
justified? What is the crucial factor that must be cons~dered?
9. What evidence is there that syntactic and lexical knowledge may be more
closely interrelated than was once thought? What does a learner have to
know in 'order to select the best synonym for a given word when the
choices offered and the given word are presented in isolation from any
particular context?
Suppose that all of the items in a given test or subtest produce acceptable
statistics. What additional criteria need to be met in order to prove that
the test is a good test of whatever it purports to measure? Or, considered
V{.
DISCUSSION QUESTIONS
v 1.
Estimate the number of distinctive sounds there are in English (i.e.,
phonemes). Then, estimate the number of syllables that can be
constructed from those sounds. Which number is greater? Suppose that
there were no limit on the length or structure of a syllable (or say, a
sequence of sounds). How many syllables would there be in English?
Expressions like John and Mary and Bill and ... can be extended
indefinitely. Therefore, how many possible sequences are there of the
type? What other examples of recursive strings can y~u exemplify (i.e.,
strings whose length can always be increased by reapplying a principle
already used in the construction of the string)? How many such strings
are there in English? In any other language? Discuss the applicability of
sampling theory to the problem of finding a representative set of such
)
structures for a language test or subtest.
, 2. How could you demonstrate' by logic that the number of possible
sentences in a language must be smaller than the number of possible
conversations? Or that the number of possible words must be smaller
than the number of possible sentences? Or that in general (i.e., in all
cases) the number of possible units of a lower order of structure must be
smaller than the number of possible units ofa higher order? Further, ifit
is impossible to sample representatively.the phrases of a language, what
does this imply with respect to the sentences? The discourses?
J 3. What unit of discourse is more highly constrained by rules of an
expectancy system, a word or a phrase? A phrase or a sentence? A
sentence or a dialogue? A chapter or an entire novel? Can you prove by
logic that larger units of language are necessarily more highly
vlo.
.i
208 .LANGUAGE TESTS AT SCHOOL
from a different angle, what criteria right obtain which would make the
/ test unacceptable in spite of the statistics?
.
vII. Suppose the statistics indicate that some of the items oIl: ~ test ar~ n~t
valid, or possibly that none are acceptable. What addItional cntena
should be considered before the test is radically revised?
'
12. Give a cloze test to a group of non-native speakers and to a group of
native speakers. Compare the diversity of errors and the general degree
of agreement on response choices. For in~tance, do nati,:es tend to.show
greater or lesser diversity of responses on Items that reqwre words lIke to,
for, if, and, but, however, not, and so f~rth than non-natives? Wh~t ~bout
blanks that require content words lIke nouns, verbs, and adJectives?
What are the most obvious contrasts between native and non-native
,
.
responses?
13. Analyze the errors of native speakers on an essay task (or any other
pragmatic task) and compare them to those of a group of non-natives.
14. In what ways are tests influenced by classroom procedures and
conversely how do tests influence what happens in classrooms? Has the,
TOEFL for instance, had a substantial influence on the teaching of EFL
abroad'; Or consider the influence of tests like the SAT, or the American
College Tests, or the Comprehensive Tests of Basic Skills, or IQ tests
in general.
SUGGESTED READINGS
1. Anne Anastasi, 'Reliability,' Chapter 5 in Psychological Testing, 4th ed.,
New York: Macmillan, 1976, 103-133.
2. Anne Anastasi, 'Validity: Basic Concepts,' Chapter 6 in Psychological
Testing, 4thed., New York: Macmillan, 1976, 134--161.
3. Lee J. Cronbach and P. E. Meehl, 'Construct Validity in Psychological
Tests.' Psychological Bulletin 1955, 52,281-302.
4. Robert L. Ebel, 'Must All Tests Be Valid?, in G. H. Bracht, Kenneth D.
Hopkins, and Julian C. Stanley (eds.) Perspectives in Educational and
Psychological Measurement. Englewood Ciffs. New Jersey: PrenticeHall, 1972,74--87.
5. Calvin R. Petersen and Francis A. Cartier, 'Some Theoretical Problems
and Practical Solutions in Proficiency Test Validity,' in R. L. Jones and
B. Spolsky (eds.) Testing Language Proficiency. Arlington, Va.,: Center
for Applied Linguistics, 1975, 105-118.
Discrete Point Tests
A. What they attempt to do
B. Theoretical problems in isolating
pieces of a system
c. Examples of discrete point items
D. A proposed reconciliation with
pragmatic testing theory
Here several of the goals of discrete point theory are considered. The
theoretical difficulty of isolating the pieces of a system is considered
along with the diagnostic aim of specific discrete point test items. It is
concluded that the virtues of specific diagnosis are preserved in
pragmatic tests without the theoretical drawbacks and artificiality of
discrete item tests. After all, the elements of language only express
their separate identities normally in full-fledged natural discourse.
A. What they attempt to do
Discrete point tests attempt to achieve a number of desirable goals.
Perhaps the foremost among them is the diagnosis of learner
difficulties and weaknesses. The idea is often put forth that if the
teacher or other test interpreter is able to learn from the test results
exactly what the learner's strengths and weaknesses are, he will be
better able to prescribe remedies for problems and will avoid wasting
time teaching the learner what is already known.
Discrete point tests attempt to assess the learner's capabilities to
handle particular phonological contrasts from the point of view of
perception and production. They attempt to assess the learner's
capabilities to produce and interpret stress patterns and intonations
on longer segments of speech. Special subtests are aimed at
knowledge of vocabulary and syntax. Separate tests for speaking,
listening, reading, and writing may be devised. Always it is correctly
209
,
..
210
..................................... .
«
DISCRETE POINT TESTS
LANGUAGE TESTS AT SCHOOL
211
phonological contrasts; vocabulary exercises focussing on the
expansion of receptive or productive repertoires (or speaking or
listening repertoires); syntax drills designed to teach certain patterns
of structure for speaking, and others designed to teach certain
patterns for listening, and others for reading and yet others for
writing; and so on until all components and skills were exhausted.
These three goals, that is, diagnosing learner strengths and
weaknesses, prescribing curricula aimed at particular skills, .and
developing specific teaching strategies to help learners overcome
particular weaknesses, are among the laudable aims of discrete point
testing. It should be noted, however, that the theoretical basis of
discrete point teaching is no better than the empirical results of
discrete point testing. The presumed components of grammar are no
more real for practical purposes than they can be demonstrated to be
by the meaningful and systematic results of discrete point tests aimed
at differentiating those presumed components of grammar. Further,
the ultimate effectiveness of the whole philosophy of discrete point
linguistic analysis, teaching, and testing (not necessarily in any
particular order) is to be judged in terms of how well the learners who
are subjected to it are thereby enabled to communicate information
effectively in the target language. In brief, the whole of discrete point
methodology stands or falls on the basis of its practical results. The
question is whether learners who are exposed to such a method (or
family of methods) actually acquire the target language.
The general impotence of such methods can be attested to by
almost any student who has studied a foreign language in a classroom
situation. Discrete point methods are notoriously ineffective. Their
nearly complete failure is demonstrated by the paucity of fluent
speakers of any target language who have acquired their fluency
exclusively in a classroom situation. Unfortunately, since classroom
situations are predominantly characterized by materials and methods·
that derive more or less directly from discrete point linguistic
analysis, the verdict seems inescapable: discrete point methods don't
work.
The next obvious question is why. How is it that methods which
have so much authority, and just downright rational analytic appeal,
fail so widely? Surely it is not for lack of dedication in the profession.
It cannot be due to a lack of talented teachers and bright students, nor
that the methods have not been given a fair try. Then, why?
assumed that individuals will differ, some being better in certain skills
and components of knowledge while others are better in other skills
and components. Moreover, it is assumed that a given individual (or
group) may show marked differences in, say, pronunciation skills as
opposed to listening comprehension, or in reading and writing skills
as opposed to listening and speaking skills.
A second goal implicit in the first is the prescription of teaching
remedies for the weaknesses in learner skills as revealed by discrete
point tests. If it is possible to determine precisely what is the profile of
a given learner with respect to the inventory of phonological
contrasts that are possible in a given language, and with respect to
each other skill, aspect or component of a skill as measured by some
subtest which is part of a battery of discrete point tests, then it should
be possible to improve course assignments, specific teaching
objectives, and the like. For instance, if tests reveal substantial
differences in speaking and listening skills as opposed to reading and
writing skills, it might make sense to split students into two streams
where in one stream learners are taught reading and writing skills
while in the other they are taught listening and speaking skills.
Other breakdowns might involve splitting instructional groups
according to productive versus receptive skills, that is, by putting
speaking and writing skills into one course curriculum (or a series of
course curricula advancing from the beginning level upward), or
according to whatever presumed components all learners must
acquire. For instance, phonology might be taught in one class where
the curriculum would concentrate on the teaching of pronunciation
or listening discrimination, or both. Another class (or period of time)
might be devoted to enhancing vocabulary knowledge. Another
could be devoted to the teaching of grammatical skills (pattern drills
and syntax). Or, if one were to be quite consistent with the spirit of
discrete point theorizing there should be separate classes for the
teaching ofvoc;abulary (and each of the other presumed components
oflanguage proficiency) for reading, writing, speaking, and listening.
Correspondingly, there would be phonology for speaking, and
phonology for listening, sound-symbol instruction for reading, and
sound-symbol instruction for writing, and so on.
A third related goal for discrete point diagnostic testing would be
to put discrete point teaching on an even firmer theoretical footing.
Special materials might be devised to deal with precisely the points of
difficulty encountered by learners in just the areas of skill that need
attention. There could be pronunciation lessons focussing on specific
•
212
LANGUAGE TESTS AT SCHOOL
B. Theoretical problems in isolating pieces of a system
Discrete point theories are predicated on the notion that it is possible
to separate analytically the bits and pieces of language and then to
teach and/or test those elements one at a time without reference to the
contexts of usage from which those elements were, excised. It is .an
undeniable fact, however, that phonemes do not exist in isolation. A
child has to go to school to learn that he knows how to handle
phonemic cont:t:asts - to learn that his language has. phonemic
contrasts. It may be true that he unconsciously makes use of the
phonemic contrast between see and.say, for instance, but he must go
to school to find out that he has such skills or that his language
requires them. Normally, the phonemic contrasts of a language are
no more consciously available to the language user than harmonic
intervals are to a music lover, or than the peculiar properties of
chemical elements are to a gourmet cook. Just as the relations
between harmonics are important to a music lover only in the context
of a musical piece (and probably not at all in any other context), and
just as the properties of chemical elements are of interest to the cook
only in terms of the gustatory effects they produce in a'roast or dish of
stew, phonemic contrasts are principally of interest to the language
user only in terms of their effects in communicative exchanges - in
discourse.
Discrete point analysis necessarily breaks the elements oflanguage
apart and tries to teach them (or test them) separately witlliittle orno
attention to the way those elements interact in a larger context of
communication. What makes it ineffective as a basis for teaching or
testing languages is that crucial properties of language are lost when
its elements are separated. The fact is that in any system where the
parts interact to produce properties and qualities that do not exist in
the parts separately, the whole is greater than the sum of its parts. Ifthe
parts cannot just be shuffied together in any old order - if they must
rather be put together according to certain organizational constraints
- those organizational constraints themselves become crucial
properties of the system which simply cannot be found in the parts
separately.
An example of a discrete point approach to the construction of a
test of 'listening grammar' is offered by Clark (1972):
Basic to the growth of student facility in listening comprehension is the development of the ability to isolate and
appropriately interpret important syntactical and morphological aspects of the spoken utterance such as tense, number,
DISCRETE POINT TESTS
213
person, subject-object distinctions, declarative and imperative
struct.ures,. attribut~ons, and so forth. The student's knowledge
oflexicon is not at issue here; and for that matter, a direct way
of testing the aural identification of grammatical functions
would be to use nonsense words incorporating the desired
morphological elements or syntactic patterns. Given a sentence
such as 'Le muglet a he candre par la friblonne,' [roughly
translated from French, The muglet has been candered by the
friblun, where muglet, cander, andfriblun are nonsense words]
_ t~e student might be tested on his ability to determine: 1) the
time aspect of the utterance (past time), 2) the 'actor' and 'acted
upon' (,friblonne' and 'muglet', respectively), and the action
involved ('candre') [po 53f].
First, it is assumed that listening skill is different from speaking
skill, or reading skill, or writing skill. Further, that lexical knowledge
as-related to the listening skill is one thing while lexical knowledge as
related to the reading skill is another, and further still that lexical
knowledge is different from syntactic (or morphological) knowledge
as each pertains to listening skill (or 'listening grammar'). On the
basis of such assumptions, Clark proposes a very logical extension: in
order to separate the supposedly separate skills for testing it is
necessary to eliminate lexical knowledge from consideration by the
use of nonsense words like muglet, cander, andfriblun. He continues,
If such elements were being tested in the area _of reading
comprehension, it would be technically feasible to present
printed nonsense sentences of this sort upon which the student
would operate. In a listening comprehens~on situation,
however, the difficulty of retaining in mem6ry the various
strange w.ords involved in the stimulus sentence would pose a
listening comprehension problem independent of the student's
ability to interpret the grammatical cues themselves. Instead of
nonsense words (which would in any event be avoided by some
teachers on pedagogical grounds), genuine foreign-language
vocabularyis more suitably employed to convey the grammatical elements intended for aural testing [po 54].
'
Thus, a logical extension of discrete point testing is offered for
reading comprehension tests, but is considered inappropriate for
listening tests. Let us suppose that such items as Clark is suggesting
were used in reading comprehension tests to separate syntactic
knowledge from lexical knowledge. In what ways would they differ
from similar sentences that might occur in normal conversation,
prose, or discourse? Compare The muglet has been candered by the
friblun with The money has been squandered by the freeloader. Then
consider the question whether the relationships that hold between the
214
DISCRETE POINT TESTS
LANGUAGE TESTS AT SCHOOL
215
Syllables have properties in discourse that they do not have in
isolation and sentences have properties in discourse that they do not
have in isolation and discourse has properties in relation to everyday
experience that it does not have when it is isolated from such
experience. In fact, discourse cannot really be considered discourse at
all if it is not systematically related to experience in ways that can be
inferred by speakers of the language. With respect to syllables,
consider the stress and length of a given syllable such as /rsd/ as in He
read the entire essay in one sitting and in the sentence He read it is what
he did with it, (as in response to What on earth did you say he did with
it ?).
Can a learner be said to know a syllable on the basis of a discrete
test item that requires him to distinguish it from other similar
syllables? If the learner knew all the syllables of the language in that
sense would this be the same as knowing the language?
For the word syllable in the preceding questions, substitute the
words sound, word, syntactic pattern, but one must not substitute the
words phrase, sentence, or conversation, because they certainly cannot
be adequately tested by discrete item tests. In fact it is extremely
doubtful that anything much above the level of the distinctive sounds
(or phonemes) of a language can be tested one at a time as discrete
point theorizing requires. Furthermore, it is entirely unclear what
should be considered an adequate discrete point test of knowledge of
the sounds or sound system of a language. Should it include all
possible pairs of sounds with similar distributions? Just such a
pairing would create a very long test if it only required discrimination
decisions about whether a heard pair of sounds was the same or
different. Suppose a person could handle all of the items on the test.
In what sense could it be said that he therefore knows the sound
system of the tested language?
The fact is that the sounds of a language are structured into
sequences that make up syllables which are structured in complex
ways into words and phrases which are themselves structured into
sentences and paragraphs or higher level units of discourse, and the
highest level of organization is rather obviously involved in the lowest
level of linguistic unit production and interpretation. The very same
sequence of sounds in one context will be taken for one syllable and in
another context will be taken for another. The very same sound in one
context may be interpreted as one word and in another as a
completely different word (e.g., 'n, as in He's 'n ape, and in This 'n that
'n the other). A given sequence of words in one context may be taken
grammatical subject and its respective predicate in each case is the
same. Add a third example. The pony has been tethered by the
barnyard. It is entirely unclear in the nonsense example whether the
relationship between the muglet and the friblun is similar to the
relationship between the money and the freeloader or whether it is
similar to the relationship between the pony and the barnyard. How
could such syntactic relationships and the knowledge of them be
tested by such items? One might insist that the difference between
squandering something and tethering something is strictly a matter of
lexical knowledge, but can one reasonably claim that the relationship
between a subject and its predicate is strictly a lexical relationship?
To do so would be to erase any vestige of the original distinction
between syntax and vocabulary. The fact that the money is in some
sense acted upon by the freeloader who does something with it,
namely, squanders it, and that the pony is not similarly acted upon by
the barnyard is all bound up in the syntax and in the lexical items of
the respective sentences not to mention their potential pragmatic
relations to extralinguistic contexts and their semantic relations to
other similar sentences.
It is not even remotely possible to represent such intrinsically rich
complexities with nonsense items of the sort Clark is proposing.
What is the answer to questions like: Can fribluns be candered by
muglets ? Is candering something that can be dqne to fribluns ? Can it be
done to muglets ? There are no answers to such questions, but there are
clear and obvious answers to questions like: Can barnyards be
tethered by ponies? Is tethering something that can be done to
barnyards? Can it be done to ponies? Can freeloaders be squandered by
money? Is squandering something that can be done to freeloaders? Can
it be done to money? The fact that the latter questions have answers
and that the former have none is proof that normal sentences have
properties that are not present in the bones of those same sentences
stripped of meaning. In fact, they have syntactic properties that are
not present if the lexical items are not there to enrich the syntactic
organization of the sentences.
The syntax of utterances seems to be just as intricately involved in
the expression of meaning as the lexicon is, and to propose testing
syntactic and lexical knowledge separately is like proposing to test the
speed of an automobile with the wheels first and the enginylater.
It makes little difference to the difficulties that discrete point testing
creates if we change the focal point of the argument from the sentence
level to the syllable level or to the level of full-fledged discourse.
•
·216
LANGUAGE TESTS AT SCHOOL
to mean exactly the opposite of what they mean in another context,
(e.g., Sure you will, meaning either No, you won't or Yes, you will.)
A'll of the foregoing facts and many others that are not mentioned
here make the problems of the discrete item writer not just difficult
but insurmount~ble in principle. There is no way that the normal
facts of language can adequately be taught or tested by using test
items or teaching materials that start out by destr?Ylng the very
properties oflanguage that most need to be grasped by learners.
How can a person learn to map utterances pragmatically onto
extralinguisti~ contexts in a language that he does not yet ~n~w (th~t
is, to express and interpret information in words about expenence) If
he is forced to deal with words and utterances that are never related to
extralinguistic experience in the required ways? The answer is that no
one can learn a language on the basis of the principles advocated by
discrete point theorists. This is not because it is very difficult to learn a
language by experiencing bits and pieces of it in isolation from
pragmatic contexts, it is because it is impossible to learn a language
by experiencing bits and pieces of it in that way.
For the same reason, discrete test items that aim to test the
knowledge of language independent of the use of that knowledge in
normal contexts of communication must also fail. No one has ever
proposed that instead of running races at the Olympics conte~tants
should be subjected to a battery of tests including the analysis of
individual muscle potentials, general quickness, speed of bending the
leg at the knee joint, and the like - rather the speed of runners is tested
by having them run. Why should the case be so different for language
testing? Instead of asking, how well can a particular language learner
handle the bits and pieces of presumed analytical components of
grammar, why not ask how well the learner can use all of the
components (whatever they are) in dealing with discourse?
In addition to strong logic, there is much empirical evidence to
show that di~crete point methods of teaching fail and that discrete
point methods of testing are inefficient. On the other hand, there are
methods of teaching (and learning languages) that work - for
instance, methods of teaching where the pragmatic mapping of
utterances onto extralinguistic contexts is made obvious to the
learner. Similarly, methods of testing that require the examinee to
perform such mapping of sequences of elements in the target
language are quite efficient.
t
<
I
DISCRETE POINT TESTS
217
C. Examples of discrete point items
The purpose of this section is to examine some examples of discrete
point items and to consider the degree to which they produce the
kinds of information they are supposed to produce - namely,
diagnostic information concerning the mastery of specific points of
linguistic structure in it particular target language (and for some
testers, learners from a particular background language). In spite of
the fact that vast numbers of discrete point test categories are possible
in theory, they always get pared down to manageable proportions
even by the theorists who advocated the more proliferate test designs
in the first place.
For example, under the general heading of tests of phonology a
goodly number of subheadings have been proposed including:
subtests of phonemic contrasts, stress and intonation, subclassed still
further into subsubtests of recognition and production not to
mention the distinctions between word stress versus sentence stress
and so on. In actuality, no one has ever devised a test that makes use
of all of the possible distinctions, nor is it likely that anyone ever will
since the possible distinctions can be multiplied ad infinitum by
the same methods that produced the commonly employed distinctions. This last fact, however, has empirical consequences in the
demonstrable fact that no two discrete point testers (unless they have
iInitated each other) are apt to come up with tests that represent
precisely the same categories (i.e., subtests, subsubtests, and the like).
Therefore, the items used as examples here cannot represent all of the
types of items that have been proposed. They do, however, represent
commonly used types.
First, we will consider tests of phonological skills, then vocabulary,
then grammar (usually liInited to a narrow definition of syntax - that
is, having to do with sequential relations between words or phrases, .
or clauses).
1. Phonological items. Perhaps the most often reco~mended and
widest used technique for assessing 'recognition' or 'auditory
discriInination' is the minimal pair technique or some variation of it.
Lado (1961), Harris (1969), Clark (1972), Heaton (1975), Allen and
Davies (1977), and Valette (1977) all recommend some variant of the
technique. For instance, Lado suggests reading pairs of words with
minimal sound contrasts while the stud,ents write down 'same' or
'different' (abbreviated to'S' or 'D') for each numbered pair. To test
'speakers of Spanish, Portuguese, Japanese, Finnish' and other
218
DISCRETE POINT TESTS
219
LANGUAGE TESTS AT SCHOOL
language backgrounds who are learning English as a foreign or
second language, Lado proposes items like the following: ' '
1. sleep; slip,
2. fist; fist
3. ship; sheep
4. heat; heat
5. jeep; gyp
6. leap; leap
7. rid; read'
8. mill; mill
9. neat; knit
10. beat; bit (Lado, 1961, p. 53).
Another item type which is quite similar is offered by both Lado
and Harris. The specific examples here are from Lado. The teacher
(or examiner) reads the words (with identical stress and intonation)
and asks the learner (or examinee) to indicate which words are the
same. If all are the same, the examinee is to check A, B, and C, on the
answer sheet. If none is the same he is to check O.
1. cat; cat,; cot
2. run; sun; run
3. last; last i last
. 4. beast; best; best
5. pair; fair; chair (Lado, 1961, p. 74).
Now, let us consider briefly the question of what diagnostic information such test items provide. Suppose a certain examinee misses
item 2 in the second set of items given above. What can we deduce
from this fact? Is it safe to say that he doesn't know lsi or is it Ir/? Or
could it be he had a lapse of attention? Could he have misunderstood
the item instructions or marked the wrong spot on the answer sheet?
What teaching strategies could be recommended to remedy the
problem? What does missing item 2 mean with respect to overall
comprehension?
Or suppose the learner misses item 4 in the second set given above.
Ask the same questions. What about item 5 where three initial
consonants are contrasted? The implication of the theory that highly
focussed discrete point items are diagnostic by virtue of their being
aimed at specific contrasts is not entirely transparent.
What about the adequacy of coverage of possible contrasts? Since
it can be shown that the phonetic form of a particular phoneme is
quite different when the phoneme occurs initially (after a pause or
silence) rather than medially (between other sounds) or finally (before
a pause or silence), an adequate recognition test for the sounds of
English shoul.d- presumably assess contrasts in all three positions. If
the test were to assess only minimal contrasts, it should presumably
test separately each vowel contrast and each consonantal contrast
(ignoring the chameleonic phonemes such as Ir I and JlI which have
properties of vowels and consonants simultaneously, not to mention
Iwl and Iyl which do not perfectly fit either category). It would have
to be a very long test indeed. If there were only eleven vowels in
English, the matrix of possible contrasts would be eleven times
eleven, or 121, minus eleven (the diagonal pairs of the matrix which
involve contrasts between each element and itself, or the null
contrasts), or 110, divided by two (to compensate for the fact that the
top half of the matrix is identical to the bottom half). ,Hence, the
number of non-redundant pairs of vowels to be contrasted would·be
at least 55. If we add in the number oLconsonants that can occur in'
initial, medial, and final position, say, about twenty (to be on the
conservative side) we must add another 190 items times the three
positions, or 1,470, plus 55 equals 1,525 items. Furthermore, this
estimate is still low because it does not account for consonant
clusters, diphthongs, nor for vocalic elements that can occur in initial
or final positions.
Suppose the teacher decides to test only a sampling of the possible
contrasts and develops a 100 item test. How will the data be used?
Suppose there are twenty students in the class where the test is to be
used. There would be 2,000 separate pieces of data to be used by the
teacher. Suppose that each student missed a slightly different set of
items on the test. How would this diagnostic information be used to
develop different teaching strategies for each separate learner?
Suppose that the teacher actually had the time and energy to sit down
and go through the tests one at a time looking at each separate item
for each separate learner. How would the item score for each separate
learner be translated into an appropriate teaching strategy in each
case?The problem we come back to is how to interpret a particular
performance on a particular item on a highly focussed discrete point
test. It is something like the problem of trying to determine the
composition of sand in a particular sand box in relation to a certain
beach by comparing the grains of sand in the box with the grains of
sand on the beach - one at a time. Even if one were to set out to
improve the degree of correspondence how would one go about it
and what criterion of success could be conceived?
'
Other types of items proposed to test phonological discrimination
are minimal sentence pairs such as :
-
p
220
LANGUAGE TESTS AT SCHOOL
1. Will he sleep? Will he slip?
2. They beat him. They bit him.
.
3. Let me see the sheep. Let me see the sheep. (Lado, 1961, p.
>,
53).
/
Lado suggests that these items are more valid than words in isolation
because they are more difficult: 'The student does not know where the
difference will occur if it does occur' (p. 53). He argues that such a
focussed sort of test item is to be preferred over more fully
contextualized discourse samples because the former insures that the
student has actually perceived the sound contrast rather than merely
guessing the meaning or understanding the context instead of
perceiving tlie 'words containing the difficult sounds' (p. 54). Lado
refers to such guessing factors and context clues as 'non-language
factors'.
But let's consider the matter a bit further. In what possible context
would comprehension of a sentence like, Will he sleep? depend on
someone's knowledge of the difference between sleep and slip? Is the
knowledge associated with the meaning of the word sleep and the sort
of states and the extralinguistic situations that the word is likely to-be
associated with less a matter oflanguage proficiency than knowledge
of the contrast between jiy j and /II? Is it p@ssible to conceive of a
context in which the sentence, Will he sleep? would be likely to be
taken for the sentence, Will he slip? How often do you suppose
slipping and sleeping would be expected to occur in the same
contexts? Ask the same questions for each of the other example
sentences.
Further, consider the difficulty of sentences such as the ones used in
2, where the first sentence sets up a frame that will not fit the second. If
the learner assumes that the they in They beat him has the same
referential meaning as the subsequent they in They bit him the verb bit
is unlikely. People may beat things or people or animals, but him
seems likely to refer to a person or an animal. Take either option.
Then when you hear, They bit him close on the heels of They beat him,
what will you do with it? Does they refer to the same people and does
him refer to the same person or animal? If so, how odd. People might
beat a man or a dog, but would they then be likely to bite him? As,a
result of the usual expectancies that normal language users will
generate in perceiving meaningful sequences of elements in their
language, the second sentence is more difficult with the first as its
antecedent. Hence, the kind of contextualization proposed by Lado
to increase item validity may well decrease item validity. The function
DISCRETE POINT TESTS
,
221
of the sort of context that is suggested for discrete point items of
phonological contrasts is to mislead the better language learners into
false expectanCies instead of helping them (on the basis of normally
correct expectancies set up by discourse constraints) to make subtle
sound distinctions.
The items pictured in Figures 9-13, represent a different sort of
attempt to contextualize discrete point contrasts in a listening mode.
In Figure 9, for instance, both Lado (1961) and Harris (1969) have in
mind testing the contrast between the words ship and sheep. Figure
10, from Lado (1961), is proposed as a basis for testing the distinctipn
between watching and washing. Figure 11, also from Lado, is
proposed as a basis for, testing the contrasts between pin, pen, and pan
- of course, we should note that the distinction between the first two
(pin and pen) no longer exists in the widest used varieties of American
English. Figure 12 aims to test the initial consonant distinctions
between sheep and jeep and the vowel contrast between sheep and
ship. Figure 13 offers the possibility of testing several contrasts by
asking the examinee to point to the pen, pin, pan, picture, pitcher; the
person who is watching the dishes (according to Lado, 1961, p. 59) and
the person who is washing the dishes.
B
A
--------'
o
A
B
Figure 9. The ship/sheep contrast, Lado (1961, p.
57) and Harris (1969, p. 34).
F
222
DISCRETE POINT TESTS
LANGUAGE TESTS AT SCHOOL
, B
A
Pertinent questions for the interpretation of errors on items of the
type related to the pictures in Figures 9-11 are similar to the questions
posed above in relation to similar items without pictures. If it were
difficult to prepare an adequate test to cover the phonemic contrasts
of English without pictures it would surely be more difficult to try to
do it with pictures. Presumably the motivation for using pictures is to
increase the meaningfulness of the test items - to contextualize them,
just as in the case of the sentence frames discussed two paragraphs
earlier. We saw that the sentence contexts actually are apt to create
false expectancies which would distract from the purpose of the
items. What about the pictures?
Is it natural to say that the man in Figure 13 is watching the dishes?
It would seem more likely that he might watch the woman who is
washing the dishes. Or consider the man watching the window in
Figure 10. Why is he doing that? Does he expect it to escape? To
leave? To hatch? To move? To serve as an exit for a cdrilinal who is
about to try to get away from the law? If not for some such reason, it
would seem more reasonable to say that the man in the one picture is
staring at a window and the one in the other picture is washing a
different window. If the man were watching the same window, how is it
that he cannot see the man who is washing it? The context, which is
proposed by Lado to make the contrasts meaningful, not only fails to
represent normal uses of language accurately, but also fails to help
the learner to make the distinctions in question. If the learner does
not already know the difference between watching and washing, and
if he was not confused before experiencing the test item he may well
be afterward. If the learner does not already know the meaning of the
words ship, sheep, pin, pan, and pen, and if the sound contrasts are
difficult for the learner to perceive, the pictures in conjunction with
meaningless similar sounding forms merely serve as a slightly richer
basis for confusion. Why should the learner who is already having
difficulty with the distinction say between pin and pen have any less
difficulty after being exposed to the pictures associated with words
which he cannot distinguish? Ifhe should become able to perceive the
distinction on the basis of some teaching exercise related to the test
item types, on what possible basis is it reasonable to expect the learner
to associate correctly the (previously unfamiliar) word sheep, for
instance, with the picture of the sheep and the word ship with the
picture of the ship?
The very form of the exercise (or test item) has placed the
contrasting words in a context where all of the normal bases for the
Figure 10. The watching/washing contrast, Lado
(1961, p. 57).
A
B
C
Figure 11. The pin/pen/pan contrast, Lado (1961, p.
58).
A
B
223
C
Figure 12. The ship/jeep/sheep contrast, Lado
(1961, p. 58).
Figure 13. 'Who is watching the dishes?' (Lado,
1961, p. 59).
,
L
224
DISCRETE POINT TESTS
LANGUAGE TESTS AT SCHOOL
distinction in meaning have been deFberately obliterated. The learner
is very much in the position of the child learning to spell to whom it is
pointed out that the pairs of spellings their and there, pare and pair,
son and sun represent different meanings and not to get confused
about which is which.~uch a method of presentation is almost certain
to confuse the learner concerning which meaning goes with which
spelling.
2. Vocabulary items. It is usually suggested that knowledge of
words should be referenced against the modality of processing - that·
is, the vocabulary one can comprehend when reading. Hence, it is
often claimed that there must be s,eparate vocabulary tests for each o(
the traditionally recognized four skills, at least for receptive and productive repertoires. Above, especially in Chapter 3, we considered
an alternative explanation for the relative availability of words in
listening, speaking, reading, and writing. It was in fact suggested that
it is probably the difficulty of the task and the load it places on
memory and attention that creates the apparent differences in
vocabulary across different processing tasks. Ask yourself the
question whether you know or do not know a word you may have
difficulty in thinking of at a particular juncture. Would you know it
better if it were written? Less well if you heard it spoken? If you could
understand its use by someone else how does this relate to your ability
or inability to use the same word appropriately? It would certainly
appear that there is room for the view that a single lexicon may
account for word knowledge (whatever it may be) across all four
skills. It may be merely the accessibility of words that changes With
the nature of the process~ng task rather than the words actually in the
lexicon.
In any event, discrete point theory requires tests of vocabulary and
often insists that there must be separate tests for what are presumed
to be different skill areas. A frequently-used type of vocabulary test is
one aimed specifically at the so-called reading skill. For instance,
Davies (1977) suggests the following vocabulary item:
Our tom cat has been missing ever since that day I upset his
'
milk.
C.
name
A. wild
D. male
B. drum
One might want to ask how useful the word tom in the sense given is
for the students in the test population. Further, is it not possible that a
225
wild tom cat became the pet in question and was then frightened off
by the incident? What diagnostic information can one infer (since this
is supposed to be a diagnostic type of item) from the fact that a
particular student misses the item selecting, say, choice C, name?
Does it mean that he does not know the meaning of tom in the sense
given? That he doesn't know the meaning of name? That he doesn't
kno~ any of the words used? That he doesn't understand the
sentence? Or are not all of these possibilities viable as well as many
other combinations of them? What is specific about the diagnostic
information supposedly provided by such an item?
Another item suggested by Davies (1977) is of a slightly different
type:
Cherry:
Red Fruit Vegetable Blue Cabbage
Sweet Stalk Tree Garden
The task of the examinee is to order the words offered in relation to
their closeness in meaning to the given word cherry. Davies avows, 'it
may be argued that these tests are testing intelligence, particularly
example 2 [of the two examples given immediately above] which
demands a very high degree ofliteracy, so high that it may be entirely
intelligence that is being tested here' (p. 81). There are several
untested presuppositions in Davies' remark. One of them is that we
know better what we are talking about when we speak of 'intelligence'
than when we speak of language skill. (On this topic see the
Appendix, especially part D.) Another is that the words in the
proffered set of terms printed next to the word cherry have some
intrinsic order in relation to cherries.
The difficulty with this item, as with all of the items of its type isthat
the relationship between cherries and cabbage, or gardens, etc., has a
great deal more to do with where one finds the cherries at the moment
than with something intrinsic to the nature of the word cherry. At one
moment the fact that a cherry is apt to be found on a cherry tree may
be the most important defining property. In a different context the
fact that some of the cherries are red and therefore edible may carry
more pragmatic cash value than the fact that it is a fruit. In yet
another context sweetness may be the property of greatest interest. It
is an open empirical question whether items of the sort in question
can be scored in a sensible way and whether or not they will produce a
high correlation with tests of reading ability.
Lado (1961) was among the first language testers to sugge~t
vocabulary items like the first of Davies' examples given above. For
226
LANGUAGET~TSATSCHOOL
, - -
f
instance, Lado suggested items in the following form:
i
C',
Integrity
A. intelligence
B. uprightness
!
C. intrigue
D. weakness
Another alternative sugge~ted by Lado (1961, p. 189) was:
The opposite of strong is
A. short
C. weak
B. poor
D. good
Similar items in fact can be found in books on language testing by
Harris (1969), Clark (1972), Valette (1967, 1977), Heaton (1975) and
in many other sources. In fact, they date back to the earliest form~ of
so-called 'intelligence' and also 'reading' tests (see Gunnarsson,
1978, and his references).
Two nagging questions continue to plague the user of discrete
point vocabulary tests. The first is whether such tests really measure
(reliably and validly) something other than what is measured by tests
that go by different names (e.g., grammar, or pronunciation, not to
mention reading comprehension or IQ). The second is whether the
kind of knowledge displayed in such tests could not better be
demonstrated in tasks that more closely resemble normal uses of
language.
3. Grammar items. Again, there is the problem of deciding what
modality is the appropriate one, or how many different modalities
must be used in order to test adequately grammatical knowledge
(whatever the latter may be construed to be). Sample items follow
with sources indicated (all of them were apparently intended for a
written mode of presentation) :
i. 'Does John speak French?'
'I don't know what .. .'
A. does
B. speaks
C. he (Lado, 1961, p. 180).
ii. When _ _ _ _ _ _ _ _ _ ?
c. to go
A. plan
D. you (Harris, 1969, p. 28).
B. do
iii. I want to ____ home now.
A. gone
C. go
B. went
D. going (Davies, 1977, p. 77).
Similar items can be found in Clark (1972), Heaton (1975), and
~
DISCRETE POINTTESTI!
227
Valette (1977). Essentially they concentrate on the ordering of words
or phrases in a minimal context, Qr they require selection of the
appropriate continuation at some point in the sentence. Usually no
larger contexUs implied or otherwise indicated.
In the Appendix we examine the correlation between a set of tests
focussing o~ the formation of appropriate continuations in a given
text, another set requiring the ordering of words, phrases, or clauses
in similar texts, and a large battery of other tests. The results suggest
that there is no reasonable basis for claiming that the so-called
vocabulary (synonym matching) type of test items are measuring
anything other than what the so-called grammar items (selecting the
appropriate continuation, and ordering elements appropriately) are
measuring. Further, these tests do not seem to be doing anything
different from what standard dictation and cloze procedure can
accomplish. Unless counter evidence can be produced to support the
super-structure of discrete point test theory, it would appear to be in
grave empirical difficulty.
D. A proposed reconciliation with pragmatic testing theory
From the arguments presented in this chapter and throughout this
entire book - especially all of Part Two - one might be inclined to
think that the 'elements' of language, whatever they may be, should
never be considered at all. Or at least, one might be encouraged to
read this recommendation between the lines. However, this would be
a mistake. What, after all, does a pragmatic test measure? Does it not
in fact measure the examinee's ability to make use of the sounds,
syllables, words, phrases, intonations, clauses, etc. in the contexts of
normal communication? Or at least in contexts that faithfully mirror
normal uses oflanguage? If the latter is so, then pragmatic tests are in
fact doing what discrete point testers wanted done all along. Indeed,
pragmatic tests are the only reasonable approach to testing language
skills if we want to know how well the examinee can use the elements
of the language in real-life communication contexts.
What pragmatic language .tests accomplish is precisely what
discrete point testers were hoping to do. The advantage that pragmatic
tests offer to the classroom teacher and to the educator in general is
that they are far easier to prepare than are tests of the discrete point
type, and they are nearly certain to produce more meaningful and
more readily interpretable results. We will see in Chapter 9 that the
preparation and production of multiple choice tests is no simple
228
LANGUAGE TESTS AT SCHOOL
task. We have already seen that the determination of how many items
of certain types to include in discrete point tests poses intrinsically
insoluble and pointless theoretical and practical mind-bogglers. For
instance, how many vocabulary items should be included? Is tom as
in tom cat worth including? What is the relative importance of vowel
contrasts as compared against morphological contrasts (e.g., plural,
possessive, tense marking, and the like)? Which grammatical points
found in linguistic analyses should be found in language tests
focussed on 'grammar'? What relative weights should be assigned to
the various categories so determined? How much is enough to
represent adequately the importance of determiners? SUbject raising?
Relativization? The list goes on and on and is most certainly not even
close to being complete in the best analyses currently available.
The great virtue, the important insight of linguistic analysis, is in
demonstrating that language consists of complicated sequences of
elements, subsequences of sequences, and so forth. Further, linguistic
research has helped us to see that the elements of language are to a
degree analyzable. Discrete point theory tried to capitalize on this
insight and pushed it to the proverbial wall. It is time now to reevaluate the results of the application. Recent research with
pragmatic language tests suggests that the essential insights of
discrete point theories can be more adequately expressed in
pragmatic tests than in overly simplistic discrete point approaches
which obliterate crucial properties of language in the process of
taking it to pieces. The pieces should be observed, studied, taught,
and tested (it would seem) in the natural habitat of discourse rather
than in isolated sentences pulled out of the clear blue sky.
In Part Three we will consider ways in which the diagnostic
information sought by discrete point theory in isolated items aimed at
particular rules, words, sound contrasts and the like can much more
sensibly be found in learner protocols related to the performance of
pragmatic discourse processing tasks - where the focus is on
communicating something to somebody rather than merely filling in
some blank in some senseless (or nearly senseless) discrete item pulled
from some strained test writer's brain. The reconciliation of discrete
point theory with pragmatic testing is accomplished quite simply. All
we have to do is acknowledge the fact that the elements of language
are normally used in discourse for the purposes of communication by the latter term we include all of the abstract, expressive, and poetic
uses oflanguage as well as the wonderful mundane uses so familiar to
all normal human beings.
DISCRETE POINT TESTS
229
KEY POINTS
1. Discrete point approaches to testing derive from discrete point
approaches to teaching. They are mutually supportive.
2. Discrete point tests are supposed to provide diagnostic input to specific
teaching remedies for specific weaknesses.
3. Both approaches stand or fall together. If discrete point tests cannot be
shown to have substantial validity, discrete point teaching will be
necessarily drawn into question.
4. Similarly, the validity of discrete point testing and all of its instructional
applications would be drawn into question if it were shown that discrete
point teaching does not work.
5. Discrete point teaching is a notorious failure. There is an almost
complete scarcity of persons who have actually learned a foreign
language on the basis of discrete point methods of teaching.
6. The premise of discrete point theories, that language can be taken to
pieces and put back together in the curriculum, is apparently false.
7. Any discourse in any natural language is more than the mere sum of its
analyzable parts. Crucial properties of language are lost when it is
broken down into discrete phonemic contrasts, words, structures and the
like.
8. Nonsense, of the sort recommended by some experts as a basis for
discrete point test items, does not exhibit many of the pragmatic
properties of normal sensible utterances in discourse contexts.
9. The trouble is that the lowest level units of discourse are involved in the
production and interpretation of the highest level units. They cannot,
therefore, be separated without obliterating the characteristic relationships between them.
10. No one can learn a language (or teach one) on the basis of the principles
advocated by discrete point theorists.
11. Discrete point tests of posited components are often separated into the
categories of phonological, lexical, and syntactic tests.
12. It can easily be shown that such tests, even though they are advocated as
diagnostic tests, do not provide very specific diagnostic information at
all.
13. Typically, discrete items in multiple choice format require highly
artificial and unnatural distinctions among linguistic forms.
14. Further, when an attempt to contextualize the items is made, it usually
falls fiat because the contrast itself is an unlikely one in normal discourse
(e.g., watching the baby versus washing the baby).
15. Discrete items offer a rich basis for confusion to any student who may
already be having trouble with whatever distinction is required.
16. Pragmatic tests can be shown to do a better job of what discrete point
testers were interested in accomplishing all along.
17. Pragmatic tests assess the learner's ability to use the 'elements' of
language (whatever they may be) in the normal contexts of human
discourse.
18. Moreover, pragmatic tests are superior diagnostic devices.
230
LANGUAGE TESTS AT SCHOOL
DISCUSSION QUESTIONS
, 1. Obtain a protocol (answer sheet and test booklet) from a discrete point
test (sound contrasts, votabulary, or structure). Analyze each item,
trying to determine exactly what it is that the student does not know on
each item answered incorrectly.
2. Repeat the procedure suggested in question 1, this time with any
protocol from a pragmatic task for the same student. Which procedure
yields more finely grained and more informative data concerning what
the learner does and does not know? (For recommendations on particular pragmatic tests; see Part Three.)
3. Interpret the errors found in questions I and 2 with respect to specific
teaching remedies. Which of the two procedures (or possibly, the several
techniques) yields the most obvious or most useful extensions to
therapeutic interventions? In other words, which test is most easily
interpreted with respect to instructional procedures?
4. Is there any information yielded by discrete point testing procedures that
is not also available in pragmatic testing procedures? Conversely, is there
anything in the pragmatic procedures that is not available in the discrete
point approaches? Consider the question of sound contrasts, word
usages, structural manipulations, and communicative activities.
5. What is the necessary relationship between being able to make a
particular sound contrast in a discrete item test and being able to make
use of it in communication? How could we determine if a learner were
not making use of a particular sound contrast in conversation ? Would
the discrete point item help us to make this determination? How? What
about word usage? Structural manipulation? Rhythm? Intonation'?
16. Take any discrete point test and analyze it for content coverage. How
many of the possible sound contrasts does it test? Words? Structural
manipulations? Repeat the procedure for a pragmatic task. Which
procedure is more comprehensive? Which is apt to be more representative? Reliable? Valid? (See the Appendix on the latter two issues,
also Chapter 3 above.)
7. Analyze the two tests· from the point of view of naturalness of what they
require the learner to do with the language. Consider the i!llpiications,
presuppositions, entailments, antecedents, and consequences of the
statements or utterances used in the pragmatic context. For instance, ask
what is implied by a certain form used and what it suggests which may
not be stated overtly in the text. Do the same for the discrete point items.
/8. Can the content of a pragmatic test be summarized? Developed?
Expanded? Interpolated? Extrapolated? What about the content of
items in a discrete point test? Which test has the richer'forms, meaning
wise? Which forms are more explicit in meaning, more determinate?
Which are more complex?
SUGGESTED READINGS
1. John L. D. Clark, Foreign Language Testing: Theory and Practice.
Philadelphia: Center for Curriculum Development, 1972.
2. Robert Lado, Language Testing. London: Longman, 1961.
3. Rebecca Valette, Modern Language Testing. N ew York: Harcourt, 1977.
9
,Multiple Choice Tests
A. Is there any other way to ask a
question?
B. Discrete point and integrative
multiple choice tests
C. About writing items
D. Item analysis and its interpretation
E. Minimal recommended steps for
multiple choice test preparation
F. On the instructional value of
multiple choice tests
The. main purpose of this chapter is to clarify the nature of multiple
chOIce tests - how they are constructed, the subjective decisions that
gointo their preparation, the minimal number of steps necessary
before they can be reasonably used in classroom contexts the
incredible range and variety of tasks that they may embody: and
finally, their general impracticality for day to day classroom
application. It will be shown that multiple choice tests can be of the
discrete point or ihtegrative type or anywhere on the continuum in
between the two extremes. So~e of them may further meet the
naturalness requirements for pragmatic language tests. Thus, this
chapter provides a natural bridge between Part Two (contra discrete
point testing) and Part Three (an exposition of pragmatic testing
techniques).
)
A. Is there any other way to ask a question?
At a testing conference some years ago, it was reported that the
following exchange took place between two of the participants. The
speaker (probably John Upshur) was asked by a would-be discussant
if mUltiple choice tests were really all that necessary. To which
23L
:..""0'"------------------ ..--- --- ------
232
LANGUAGE TESTS AT SCHOOL
Upshur (according to Eugene Briere) quipped, 'Is there any other way
to ask a question?' End of discussion. The would-be contender
withdrew to the comfort and anonymity of his former sitting
position.
When you think about it, conversations are laced with decision
points where implicit choices are being constantly made. Questions
imply a range of alternatives. Do you want to go get something to
eat? Yes or no. How about a hamburger place, or would you rather
have something a liJtle more elegant? Which place did you have in
mind? Are you speaking to me (or to someone else)? Questions just
naturally seem to imply alternatives. Perhaps the alternatives are not
usually so well defined as they are in multiple choice tests, and
perhaps the implicit alternatives are not usually offered to confuse or
trap the person in normal communication though they are explicitly
intended for that purpose in multiple choice tests, but in both cases
there is the fundamental similarity that alternatives (explicit or
implicit) are offered. Pilot asked Jesus, 'What is truth?' Perhaps he
meant, 'There is no answer to this question,' but at the same time he
appeared to be interested in the possibility of a different view. Even
abstract rhetorical questions may implicitly afford alternatives.
It would seem that multiple choice tests have a certain naturalness,
albeit a strained one. They do in fact require people to make decisions
that are at least similar in the sense defined above to decisions that
people are often required to make in normal communication. But
this, of course, is not the main argument in favor of their useAndeed,
the strain that mUltiple choice tests put on the flow of normal
communicative interactions is often used as. an argument against
them.
The favor that multiple choice tests enjoy among professional
testers is due to their presumed 'Qpjectivity', and concomitant
reliability of scoring. Further, when large numbers of people are to be
tested in short periods oftime with few proctors and scorers, multiple
choice tests are very economical in terms of the effort and expense
they require. The questions of validity posed in relation to language
tests (or other types of tests) in general are still the same questions,
and the validity requirements to be imposed on such tests should be
no less stringent for multiple choice versions than for other test
formats. It is an empirical question whether in fact multiple choice
tests afford any advantage whatsoever over other types of tests. It is
not the sort of question that can be decided by a vote of the American
(or any other) Psychometric Association. It can only be decided by
MULTIPLE CHOICE TESTS
233
appropriate research (see the Appendix, also see Oller and Perkins,
1978).
The preparation and evaluation of specific multiple choice tests
hinges on two things: the nature of the decision required by test items,
and the nature of the alternatives offered to the examinee on each
item. It is a certainty that no multiple choice test can be any better
than the items that constitute it, nor can it be any more valid than the
choices it offers examinees at requisite decision points. From this it
follows that the multiple choice format can only be advantageous in
terms of scorIng and administrative convenience if we have a good
multiple choice test in the first place.
It will be demonstrated here that the preparation of sound multiple
choice tests is sufficiently challenging and technically difficult to make
them impracticable for most classroom needs. This will be
accomplished by showing some of the pitfalls that commonly trap
even the experts. The fqrmidable technical problem of item analysis
done by hand will be shown to all but completely eliminate multiple
choice formats from consideration. Further, it will be argued that the
multiple choice format is intrinsically inimical to the interests of
instruction. What multiple choice formats gain in reliability and ease
of administration, in other words, is more than used up in detrimental
instructional effects and difficulty of preparation.
B. Discrete point and integrative multiple choice tests
In Chapter 8 above, we already examined a number of multiple
choice items of a discrete point type. There were items aimed at
phonological contrasts, vocabulary, and 'grammar' (in the rather
narrow sense of surface morphology and syntax). There are,
however, many item types that can easily be put into a multiple choice
format, or which are usually found in such a format but which are not
discrete point items. For instance, what discrete elements oflanguage
are tested in a paraphrase recognition task such as the following?
Match the given sentence with the alternative that most nearly
says the same thing:
Before the turn of the century; the tallest buildings were
rarely more than three storeys above ground (adapted from
Heaton, 1975, p. 186).
A. After the turn of the century, buildings had more
storeys above ground.
234
LANGUAGE TESTS AT SC ORL
B. Buildings rarely had as many as four storeys above
ground up until the turn of the century.
C. At about the turn of the century, buildings became
more numerous and c;onsiderably taller than ever
before.
D. Buildings used to have more storeys above ground
than they did at about the turn of the century.
It would be hard to say precisely what point of grammar, vocabulary,
etc. is being tested in the item just exemplified. Could a test composed
of items of this type be called a test of reading comprehensi9fl? How
about paraphrase recognition? Language proficiency in general?
What if it were presented in a spoken format?
As we have noted before, the problem of what to say a test is a test
of is principally an issue of test validity. It is an empirical question.
What we can safely say on the basis of the item format alone is what
the test requires the learner to do - or at least what it appears to
require. Perhaps, therefore, it is best to call the item type a 'sentence
,paraphrase recognition' task. Thl!i>,J>y_na111ing the task rather than
positing some abstract construct we avoid a priori validIty'
commitments - that' is, we suspend judgement on the' vaildity'
questions pending empirical investigatioIL Nevertheless, whatever we
choose {oean the speciRc item type, it is clearly more at the integrative
side of the continuum than at the discrete point end.
There are many other types of multiple choice items that are
integrative in nature. Consider the problem of selecting answers to
questions based on a text. Such questions may focus on some detail of
information given in the text, the general topic of the text, something
implied by the text but not stated, the meaning of a particular word,
phrase, or clause in the text, and so forth. For example, read the
following text and then select the best answers to the questions that
follow:
Black Students in Urban Canada is an attempt to provide
information to urban Canadians who engage in educational
transactions with members of this ethnicity. Although the
OISE conference did not attract educators from either west of
Manitoba or from Newfoundland, it is felt that there is an
adequate !ninimum of relevance such that concerned urban
teachers from all parts of this nation may uncover something of
profit (D'Oyley and Silverman, 1976, p. vi).
I
(1) This paragraph is probably
A. an introduction to a Canadian novel.
B. a recipe for transactional analysis.
MULTIPLE CHOICE TESTS
235
C. a preface to a conference report.
D: an epilog to ethnic studies.
(2) T~e word ethnicity as used in the paragraph has to do
WIth
,
A. sex.
B. skin color.
C. birthplace.
D. all of the above.
(3) The message of the paragraph is addressed to
A. all educators.
B. urban educators.
C. urban Canadians involved in education.
D. members of the ethnic group referred to.
(4) The abbreviation OISE probably refers to the
A. city in question.
B. relevant province or state.
C. journal that was published.
D. sponsoring organization.
(5) It is implied that the ethnic group in question lives in
predominantly
A. rural settings.
B. suburban areas.
C. urban settings.
D. ghetto slums.
(6) Persons attending the meetings referred to were
apparently
A. law enforcement officers.
B. black students.
C. educators.
D. all of the above.
The preceding item type is usually found in what is called a 'reading
comprehension' test. Another way of referring to it is to say that it is a
task that requires reading and answering questions -leaving open the
question of what the test is a test of. It may, for instance, be a fairly
good test of overall language proficiency. Or, it may be about as good
a test oflistening comprehension as of reading comprehension. These
possibililies-eanno.t be ruled out in advance on the basis of the
superficial appearance of the test. Furthermore, it is certainly
possible to change the nature of the ta,sk and make it into a listening
and question answering problem. In fact, the only 10gicailiInits on
the types of tests that !night be constructed in siInilar formats are
whatever limitations exist on the creative imaginations of the test
writer. They could be converted, for instance, to an open-ended
236
LANGUAGE TESTS AT SCHOOL
format requiring spoken responses to spoken questions over a heard
text.
- Not only is it possible to fiQd many alternate varieties of multiple
choice tests that are clearly integrative in nature, but it is quite
possible to take just about any pragmatic testing technique and
convert it to some kind of multiple choice format more or less
resembling the original pragrp.atic technique. For example, consider a
cloze test over the preceding text - or, say, the continuation of it. We
might delete every fifth word and replace it with a field of alternatives
as follows:
Black Students in Urban (1) _ _ _ _ _ (A) Europe
(B) America
(C) New Guinea
(D) Canada
is an attempt to (2)
(A) find
(B) provide
(C) include
(D) take
. (A) which
information to urban Canadians (3)
(B) while
(C) who
(D) to
engage in educational transactions (4)
(A) with
(B) on
(C) to
(D) by
members of this ethnicity ___ .
Bear in mind the fact that either a printed format (see Jonz, 1974,
Porter, 1976, Hinofotis and Snow, 1977) or a spoken format would be
possible (Scholz, Hendricks, Spurling, Johnson, and Vandenburg, in
press). For instance, in a spoken format the text might be recorded as
'Black students in Urban blank one is an attempt to blank two
information to urban Canadians blank three engage in education
transactions blank four members of this ethnicity .... ' The examinee
might see only the alternatives for filling in the numbered blanks, e.g.,
(1) __(A) Europe (B) America (C) New Guinea (D) Canada
(2) __(A) find
(B) provide (C) include
(D) take
etcetera. To make the task a bit easier in the auditory mode, the
recorded text might be repeated one or more times. For. some
MULTIPLE CHOICE TESTS
237
exploratory work with 'such a procedure in an auditory mode see the
Appendix, also see Scholz, et al (in press). For other suggestions for
making the task simpler, see Chapters 10, 11, and 12 on factors that
affect the difficulty of discourse processing tasks.
Once we have broached the possIbility of using discourse as a basis
for constructing multiple choice tasks, many variations on test item
types ,can easily be conceived. For instance, instead of asking the
examinee to select the appropriate continuation at a particular point
in a text, he may be asked to select the best synonym or paraphrase
for an indicated portion of text from a field of alternatives. Instead of
focussing exclusively on words, it would be possible to use phrases or
clauses or larger units of discourse as the basis for items. Instead of a
synonym matching or paraphrase matching task, the examinee might
be required to put words, phrases, or clauses in an appropriate order
within a given context of discourse. Results from tasks of all these
types are discussed in greater detail in the Appendix (also see the
references given there). The important point is that any such tests are
merely illustrative of a bare smattering of the possible types.
The question concerning which of the possible procedures are best
is a matter for empirical consideration. Present findings seem to
indicate that the most promising multiple choice tasks are those that
require the processing offairly extended segments of discourse - say,
150 words of text or more. However, a note of caution should be
sounded. The construction of mUltiple choice tests is generally a
considerably more complicated matter than the mere selection of an
appropriate segment of discourse. Although pragmatic tasks can
with some premeditation and creativity be converted into a variety of
multiple choice tests, the latter are scarcely as easy to use as the
original pragmatic tasks themselves (see Part Three). In the next
secfion we will consider some of the technical problems in writing
items (especially alternatives).
C. About writing items
There are only a handful of principles that need to be grasped in
writing good items, but there are a great many ways to violate any or
all of them. The first problem in writing items is to decide what sort of
items to write. The second problem is to write the items with suitable
distractors in each set of alternatives. In both steps there are many
pitfalls. Professionally prepared tests are usually qased on explicit
instructions concerning the format for items in each section of the
238
LANGUAGE TESTS AT SCHOOL
t~st. Not only will the superficial lay-out of the items be described and
exemplified, but usually the kinds of fair content for questions will
also be more or less circumscribed, and the intended test population
will be described so as to inform item writers concerning "the
appropriate range of difficultyofltems t~ be included in eadh part of
the test. 1
Unfortunately, the questions of test validity are generally
consigned to the statistician's department. They are rarely raised at
the point of item writing. However, as a rule of thumb, all of the
normal criteria for evaluating the validity of test content should be
applied from the earliest stages of test construction. The' first
principle, therefore, would be to ask if the material to be included in
items in the test is somehow related to the skill, construct, or
curriculum that the test is supposed to assess or measure. Sad to say
many of the items actually included in locally prepared, teacher-made
or multiple choice standardized tests are not always subjected to this
primary evaluation. If a test fails this first evaluation, no matter how
elegantly its items are constructed, it cannot be any better than any
other ill-conceived test of whatever it is supposed to measure.
Assuming that the primary validity question has been properly
considered, the next problem is to write the best possible items of the
defined type(s). In some cases, it will not be necessary to write items
from scratch, but rather to select appropriate materials and merely
edit them or record them in some appropriate fashion. Letus assume
tha,t all test items are to be based on samples of realistic discourse.
Arguments for this choice are given throughout the various chapters
of this book. Other choices could be made - for instance, sentences in
isolation could be used - but this would not change the principles
directly related to the construction of items. It would merely change
their fleshed-out realization in particular instances.
During the writing stage, each item must be evaluated for
appropriateness of content. Does it ask for information that people
would normally be expected to pay attention to in the discourse
1 John Bormuth (1970) has developed an extensive argument for deriving multiplechoice items from curricula via explicit and rigorous linguistic transformations. 'The
items in his methodology are directly tied to sentences uttered or written in the
curriculum. The argument is provocativcr,However, it presupposes that the surface
forms of the sentences in the curriculum are all that could be tested. Normal discourse
processing, on the other hand, goes far. beyond what is stated overtly in surface forms
(see Frederiksen's recent articles and his references). Therefore, Bormuth's interesting
proposal will not be considered further here. I am indebted to Ron Mackay (of
Concordia University in Montreal) for calling Bormuth's argument to my attention. I
MULTIPLE CHOICETESTS
239
context in question? Is the decision that is required one that really
seems to exercise the skill that the test as a whole is aimed at
measuring? Is the correct choice really the best choice for someone
who is good at the skill being measured (in this case, a good language
user)? Are the distractors actually attractive traps for someone who is
not so good at the skill in question? Are they well balanced in the
sense of going together as a set? Do they avoid the inclusion of
blatant (but extraneous) clues to the correct choice? In sum, is the
item a well-conceived basis for a choice between clear alternatives?
In addition to insisting that the items of interest be anchored to
realistic discourse contexts, on the basis of research findings
presented elsewhere in this text (especially the Appendix), we will
disregard many of the discrete point arguments of purity in item
types. In other words, we' will abandon the notion that vocabulary
knowledge must be assessed as if it were independent of grammatical
skill, or that reading items should not include a writing aspect, etc. All
of the available empirical research seems to indicate that such
distinctions are analytical niceties that have no fundamental factual
counterparts in the variance actually produced by tests that are
constructed on the basis of such distinctions. Therefore, the
, distinctions that we will make are principally in the types of tasks
required of the learner - not in the hypothetical skills or constructs to
be tested. For all of these reasons, we should also be clear about the
fact that the construction of a multiple choice test is not apt to
produce a test that is more valid than a test of a similar sort in some
other format. The point in building a multiple choice test is to attain
greater economy of administration and scoring. It is purely a
question of practicality and has little or nothing to do with reliability
and validity in the broader sense of these terms.
Most any discourse context can be dealt with in just about any
processing mode. For instance, consider a. breakfast conversation.
Suppose that i~ involves whaf the various members of a family are
going to do that day, in addition to the normal ~pass the salt and
pepper' kind of conversation at breakfast. Nine-year-old Sarah spills
. the orange juice while Mr Kowolsky is scalding his mouth on boiling
coffee and remonstrating that Mrs Kowolsky can't seem to cook a
thing without getting it too hot to eat. Thirteen-year-old Samuel
wants to know if hecan have +a dollar-(make that five dollars) so he
can see that latest James Bond movie, and his mother insists that he
not forget the piano lesson at four, and to feed the cat ... It is
possible to talk about such a context; to listen to talk about it; to read
240
LANGUAGE TESTS AT SCHOOL
about it; to write about it. The same is true for almost any context
conceivable where normal people interact through the medium of
language.
It might be reasonable, of course, to start with a written text.
Stories, narratives, expository samples of writing, in fact, just about
any text may provide a suitable basis for language text material. It is
possible to talk about a story, listen to a story, answer questions
about a story, read a story, retell a story, write a story, and so forth.
What kinds of limits should be set on the selection of materjals?
Obviously, one would not ;want to select test material that would
distract the test taker from the main job of selecting the appropriate
choices of the multiple choice items presented. Therefore, supercharged topics about such things as rape, suicide, murder, and
heinous crimes should probably be avoided along with esoteric topics
oflimited interest such as highly technical crafts, hobbies, games, and
the like (except, of course, in the very special case where the esoteric
or super-charged topic is somehow central to the instructional goals
to be assessed). Materials that state or imply moral, cultural, or racial
judgements likely to offend test takers should also probably be
avoided unless there is some specific reason for including them.
Let us suppose that the task decided on is a reading and question
answering type. Further, for whatever reasons, let us suppose that the
following text is selected:
Oliver Twist was born in a workhouse, and for a long time
after his birth there was considerable doubt whether the child
would live. He lay breathless for some time, rather unequally
balanced between this world and the next. After a few struggles,
however, he breathed, sneezed and uttered a loud cry.
The pale face of a young woman lying on the bed was raised
weakly from the pillow and in a faint voice she said, 'Let me "See
the child and die.'
'Oh, y~u must not talk about dying yet,' said the doctor, as he
rose from where he was sitting near the fire and advanced
towards the bed.
'God bless her, no!' added the poor old pauper who was
acting as nurse.
The doctor placed the child in its mother's arms; she pressed
her cold white lips on its forehead; passed her hands over her
face; gazed wildly around, fell back - and died.
'It's all over,' said the doctor at last.
'Ah, poor dear, so it is!' said the old nurse.
.
'She was a good-looking girl, too,' added the doctor: 'where
did she come from?'
r
i
I
I
I
MULTIPLE CHOICE TESTS
241
'She was brought here last night,' replied the old woman. 'She
was found lying in the street. She had walked some distance, for
her shoes were worn to pieces; but nobody knows where she
came from, or where she was going, nobody knows.'
'The old story,' said the doctor, shaking his head, as he
leaned over the body, and raised the left hand; 'no weddingring, I see. Ah! Good night!' (Dickens, 1962, p. 1)
In framing questions concerning such a text (or any other) the first
thing to be considered is what the text says. What is it about? If it is a
story, like this one, who is referred to in it? What are the important
events? What is the connection between them? What is the
relationship between the people, events, ,and states of affairs referred
to? In other words, how do the surface forms in the text
pragmatically map onto states of affairs (or facts, imaginings, etc.)
which the text is about? The author of the text had to consider these
questions (at least implicitly) the same as the reader, or anyone who
would retell the story or,discuss it. Linguistically speaking, this is the
problem of pragmatic mapping.
Thus, a possible place to begin in constructing test items would be
. with the topic. What is the text about? There are many ways of posing
the question clearly, but putting it into a multiple choice format is a
bit more complicated than merely asking the question. Here we are
concerned with better and worse ways of forming such multiple
choice questions. How should the question be put, and what
alternatives should be offered as possible answers?
Consider some of the ways that the question can be badly put:
(1) The passage is about __' _ _ __
A. adoctor
B. anurse
C. an unmarried woman
D. achild
The trouble here is that the passage is in' fact about all of the
foregoing, but is centered on none of them. If any were to be selected
it would probably have to be the child, because we und~rstand from
the first paragraph of the text that Oliver Twist is the child who is
being born. Further, the attention of every person in the story is
prim~ directed. to the· birth of this child. Even the mother is
concerned merely to look at him before she dies.
(2) A good title for this passage might be _ _ _ __
A. 'Too young to die.'
B. 'A cross too heavy.'
242
LANGUAGE TESTS AT SCHOOL
C. 'A child is born.'
D. 'God bless her, no!'
Perhaps the author has C in mind, but the basis for that choice is a bit
obscure. Mter all, it isn't just any child; and neither is it some child of
great enough moment to justify the generic sense of 'a child'.
Now, consider a question that gets to the point:
(3) The central fact talked about in the stor~ is _ _ _ __
A. the birth of Oliver Twist
B. the death of an unwed mother
C. an experience of an old doctor
D. an old and common story
Hence, the best choice must fit the facts well. It is essential that the
correct answer be correct, and further that it be better than the other
alternatives offered.
Another common·problem in writing items arises when the writer
selects facts that are in doubt on the basis of the given information
and forces a choice between two or more possible alternatives.
(4) When the author says that 'for a long time after his birth there
was considerable doubt whether the child would live' he probably
means that
I
A. the child was sickly for months or possibly years
B. Oliver Twist did not immediately start breathing at birth
C. the infant was born with a respiratory disease
D. the child lay still without breathing for minutes after birth
The trouble here is that the text does not give a sufficient basis for
selecting between the alternatives. While it is possible that only B is
correct, it is not impossible (on the basis of given information) that
one of the other three choices is also correct. Therefore, the choice
that is intended by the author to be the correct one (say, B) is not a
very reasonable alternative. In fact, none of the alternatives is really a
good choice in view of the indeterminacy of the facts. Hence,the facts
ought to be clear on the basis of the text, or they should not be us~d as
content for test items.
Finally, once the factual content of the item is clear and after the
correct alternative has been decided on, there is the matter of
constructing suitable distractors, or incorrect alternatives. The
distractors should not give away the correct choice or call undue
attention to it. They should be similar in form and content to the
correct choice and they should have a certain attractiveness about
them.
MULTIPLE CHOICE TESTS
243
For instance, consider the following rewrite of the alternatives
offered for (3):
A. the birth of Oliver Twist
B. the death of the young unwed mother of Oliver Twist
C. the experience of the old doctor who delivered Oliver Twist
D. a common story about birth and death among unwed mothers
There are severat problems here. First, Oliver Twist is mentioned in
all but one of the alternatives ~ thus drawing attention to him and
giving a clue as to the correct choice. Second, the birth of Twist is
mentioned or implied in all four alternatives giving a second
unmistakable clue as to the correct choice. Third, the choices are not
well balanced - they become increasingly specific (pragmatically) in
choices B, and C, and then jump to a vague generality in choice D.
Fourth, the correct choice, A,'is the most plausible of the four even if
one has not read the text.
There are several Common ways item writers often draw attention to
the correct choice in a field of alternatives without, of course,
intending to. For one, as we have already seen, the item writer may be
tempted to include the same information in several forms among the
various alternatives. This highlights that information. Another way
of highlighting is to include the opposite of the correct response. For
instance, as alternatives to the question about a possible title for the
text, consider the following:
A. 'The death of Oliver Twist.'
B. 'The birth of Oliver Twist.'
C. 'The same old story.'
D. 'God bless her, no!'
The inclusion of choice A calls attention to choice B and tends to
eliminate the other alternatives immediately.
The tendency to include the opposite of the correct alternative is
very common, especially when the focus is on a word or phrase
meaning:
(5) In the opening paragraph, the phrase 'unequally balanced
between this world and the next' refers to the fact that Oliver appears
tobe _ _ _ __
A. more alive than dead
B. more dead than alive
C. about to lose his balance
D. in an unpleasant mental state
To the test-wise examinee (or any moderately clever person), A and B
are apt to seem more attractive than CorD even ifthe examinee has
244
1
I
: i
ii
il
,,'
I:
"
II
I:
I:
: I'
I:
I:
I
MULTIPLE CHOICE TESTS
LANGUAGE TESTS AT SCHOOL
245
the given sentence or the story, choice D is the only sane alternative.
In sum, the common foul-ups in multiple choice item writing
include the following:
(1) Selecting inappropriate content for the item.
(2) Failure to include the correct answer in the field of alternatives.
(3) Including two or more plausible choices among the
alternatives.
(4) Asking the test taker to guess facts that are not stated or
implied,
.
(5) Leaving unintentional clues about the correct choice by
making it either the longest or shortest, or by including its
opposite,or by repeatedly referring to the information in the
COl"fect choice in other choices, or by including ridiculous
alternatives.
(6) Writing distractors that don't fit together with the correct
choice - i.e., that are too general or too specific, too abstract
or too concrete, too simple or too complex.
These are only a few of the more common problems. Without doubt
there are many other pitfalls to be avoided.
not read the original text about Twist.
Yet another way of cluing the test taker as to the appropriate
choice is to make it the longest and most complexly worded
alternative or the shortest and most succinct. We saw an example of
the latter ~bove with reference to item (3) where the correct choice
was obviously the shortest and the clearest one of the bunch. Here.is
another case. The only difference is that now the longest alternative,is
the correct choice:
(6) The doctor probably tells the young mother not to talk about
dying because _ _ _ __
A. he doesn't think she will die
B. she is not at all ill
C. he wants to encourage her and hopes that she will not die
D. she is delirious
The tendency is to include more information in the correct alternative
in order,.to make absolutely certain that it is in fact the best choice.
Another motive is to make the distractors short to save time in
writing the items.
Another common problem in writing distractors is to inclu~e
alternatives that are ridiculous and often (perhaps only to the test
writer) hilarious. After writing forty or fifty alternatives there is a
certain tendency for the test writer to become a little giddy. It is
difficult to think of distractors without occasionally cowing up with a
real humdinger. After one or two, the stage is set for a hil~rious test,
but hilarity is not the point of the testing and it may be deleterious to
the validity of the test qua test. For instance, consider the following
item where the task is to select the best paraphrase for the given
sentence:
(7) The pale face of a young woman lying on the bed was raised
weakly from the pillow and she spoke in a faint voice,
A. A fainting face on a pillow rose up from the bed and spoke
softly to .the young woman.
B. The pale face and the woman were lying faintly on the bed
when she spoke.
C. Weakly from the pillow the pale face rose up and faintly spoke
to the woman.
D. The woman who was very pale and weak lifted herself from the
pillow and spoke.
Alternative B is distracting in more ways than one. Choice C
continues the metaphor created, and neither is apt to be a very good
distractor except in a hilarious diversionary sense. Without reading
D. Item analysis and its interpretation
Sensible item analysis involves the careful SUbjective interpretation of
some objective facts about the way examinees perform on multiple
choice items. Insofar as all tests involve determinate and quantifiable
choices (i.e., correct and incorrect responses, or subjectively
determined better and worse responses), item analysis at base is a very
general procedure. However, we will consider it here specifically with
reference to multiple choice items and the very' conveniently
quantifiable data that they yield. In particular, we will discuss the
statistics that usually go by the names of item facility and item
discrimination. Finally, we will discuss the interpretation of response
frequency distributions.
We will be concerned with the meaning of the statistics, the
assumptions on which they depend in order to be useful, and their
computation. It will be shown that item analysis is generally so
tedious to perform by hand as to render it largely impracticable for
classroom use. Nevertheless, it will be argued thatitem analysis is an
important and necessary step in the preparation of good multiple
choice tests. Because of this latter fact, it is suggested that every
classroom teacher and educator who uses mUltiple choice test data
.
,
1
246
LANGUAGE TESTS AT SCHOOL
should know something of item analysis - how it is done, and what it
means.
(i) Itemfacility. One of the basic item statistics is item facility (IF).
It has to do with how easy (or difficult) an item is from the viewpoint
of the group of students or examinees taking the test of which that
item is a part. The reason for concern with IF is very simple - a test
item that is too easy (say, an item that every student answers
correctly) or a test item that is too difficult (one, say, that every
student answers incorrectly) can tell us nothing about the differences
in ability within the test population. There may be occasions when a
teacher in a classroom situation wants all of the students to answer an
item (or all the items on a test) perfectly. Indeed, such a goal seems
tantamount to the very idea of what teaching is about. Nevertheless,
in school-wide exams, or in tests that are intended to reveal
differences among the students who are better and worse performers
on whatever is being tested, there is nothing gained by including test
items that every student answers correctly or that every student
"
answers incorrectly.
The computation of IF is like the computation of a mean score f<1>r
a test only the test in this case is a single item. Thus, an IF value can be
computed for each item on any test. It is in each case like a miniature
test score. The only difference between an IF value and a part score or
total score on a test is that the IF value is based on exactly one item. It
is the mean score of all the examinees tested on that particular item.
Usually it is expressed as a percentage or as a decimal indicating the
number of students who answered the item correctly:
IF
=
tI
I
the number of students who answered the item correctly
divided by the total number of students
This formula will produce a decimal value for IF. To convert it to a
percentage, we multiply the result by 100. Thus, IF is the proportion
of students who" answer the item in question correctly.
Some authors use the term 'item difficulty', but this is not what the
proportion of students answering correctly really expresses. The IF
increases as the item gets easier and decreases as it gets more difficult.
Hence, it really is an index of facility. To convert it to a difficulty
measure we would have to subtract the IF from the maximum
possible score on the item - i.e., 1.00 if we are thinking in terms of
decimal values, and 100 % if we are thinking in terms of percentage
values. The proportion answering incorrectly should be referred to as
the item difficulty. We will not use the latter notion, however, because
1
MULTIPLE CHOICE TESTS
247
it is completely redundant once we have the IF value.
By pure logic (or mathematics, if you prefer), we can see that the IF
of any item has to fall between zero and one or between 0 % and
100 %. It is not possible for more than 100 % of the examinees to
answer an item correctly, nor for less than 0 % to fail to answer the
item correctly. The worst any group can do on an item is for all of
them to answer it incorrectly (IF = .00 = 0 %). The best they can do
is for all of them to answer it correctly (IF = 1.00 = 100 %). Thus, IF
necessarily falls somewhere between 0 and 1. We may say that IF
ranges from 0 to 1.
For reasons given above, however, an item that everyone answers
correctly or incorrectly tells us nothing about the variance among
examinees on whatever the test measures. Therefore, items falling
somewhere between about .15 and .85 are usually preferred. There is
nothing absolute about these values, but professional testers always
set some such limits and throwaway or rewrite items that are judged
to be too easy or too difficult. The point of the test items is almost
always to yield as much variance among examinees as possible. Items
that are too easy or too difficult yield very little variance - in fact, the
amount of meaningful variance must decrease as the item approaches
an IF of 100 % or 0 %. The most desirable IF values, therefore, are
those falling toward the middle of the range of possible values. IF
values falling in the middle of the range guarantee some variance in
scores among the examinees.
However, merely obtaining variance is not enough. Meaningful
variance is required. That is, the variance must be reliable and it must
be valid. It must faithfully reflect variability among tested subjects on
the skill or knowledge that the test purportedly measures. This is
where another statistic is required for item evaluation.
(ii) Item discrimination. The fundamental issue in all testing and
measurement is to discriminate between larger and smaller quantities
of something, better and worse performances, success and failure,
more or less of whatever one wants to test or measure. Even when the
objective is to demonstrate mastery, as in a classroom setting where it
may be expected that everyone will succeed, the test cannot be a
measure of mastery at all unless it provides at least an opportunity for
failure or for the demonstration of something less than mastery. Or to
take another illustration, consider the case of engineers who 'test' the
strength of a railroad trestle by running a loaded freight train over it.
They don't expect the bridge to collapse. Nonetheless, the test
discriminates between the criterion of success (the bridge holding up)
248
LANGUAGE TESTS AT SCHOOL
MULTIPLE CHOICE TESTS
249
could be used, however. For instance, the items on one test could
easily be assessed against the scores on some different test or tests
purporting to measure the same thing. In the latter instance, the other
test or tests would be used as bases for evaluating the validity of the
items on the first test.
In brief, the question of whether an individual test item
discriminates between examinees on some dimension of interest is
a matter of both reliability and validity. We cannot read an index of
item discrimination as anything more than an index of reliability,
however, unless the criterion against which the item is correlated has
some independent claims to validity. In the latter case, the index of
item discrimination becomes an index of validity over and above the
mere question of reliability.
The usual criterion selected for determining item discrimination is
the total test score. It is simply assumed that the entire test is apt to be
a better measure of whatever the test purports to measure than any
single test item by itself. This assumption is only as good as the
validity of the total test score. If the test as a whole does not measure
what it purports to measure, then high item discrimination values
merely indicate that the test is reliably measuring something - who
knows what. If the test on the whole is a valid measure of reading
comprehension on the other hand, the strength of each item
discrimination value may be taken as a measure of the validity of that
item. Or, to put the matter more precisely, the degree of validity of the
test as a whole establishes the limitations on the interpretation of item
discrimination values. As far as human beings are concerned, a test is
never perfectly valid, only more or less valid within limits that can
be determined only by inferential methods.
To return to the example of the 100 item reading comprehension
test, let us consider how an item discrimination index could be
computed. The usual method is to select the total test score as the
criterion against which individual items on the test will be assessed.
The problem then is to compute or estimate the strength of the
correlation between each individual item on the test and the test as a
whole. More specifically, we want to know the strength of the
correlation between the scores on each item in relation to the scores
on all the items.
Since 100 correlations would be a bit tedious to compute, especially
when each one would require the manipulation of at least twice as
many scores as there are examinees (that is, all the total scores plus all
the scores on the. item in question), a simpler method would be
and failure (the bridge collapsing). Thus, any valid test must
discriminate between degrees of whatever it is supposed to measure.
Even if only two degrees are distinguished - as in the case of mastery
versus something less (Valette, 1977) - discrimination betw~en those
two degrees is still the principal issue.
In school testing where multiple choice tests are employed, it is
necessary to raise the question whether the variance produced by a
test item actually differentiates the better and worse performers, or
the more proficient examinees as against the less proficient ones.
What is required is an index of the validity of each item in relation to
some measure of whatever the item is supposed to be a measure.
Clearly if different test items are of the same type and are supposed to
measure the same thing, they should produce similar variances (see
Chapter 3 above for the definition of variance, and correlation). This
is the same as saying that the items should be correlated. That is, the
people who tend to answer one of the items correctly should also tend
to answer the other correctly and the people who tend to answer the
one item incorrectly should also tend to answer the other incorrectly.
If this were so for all of the items of a given type we would take the
degree of their correlatiol!- as an index of their reliability - or in some
terminologies their internal consistency. But what if the items could be
shown to correlate with some other criterion? What if it could be
shown that a particular item, for instance, or a batch of items were
correlated with some other measure of whatever the items purport to
measure? In the latter case, the correlation would have to be taken as
an index of validity - not mere internal consistency of the items.
What criterion is always available? Suppose we think of a test
aimed at assessing reading comprehension. Let's say the test consists
of 100 items. What criterion could be used as an index of reading
comprehension against which the validity of each individual item
could be assessed? In effect, for each subject who takes the test there
will be a score on the entire test and a score on each item of the test.
Presumably, if the subject does not answer certain items they will be
scored as incorrect. Now, which would be expected to be a better
measure of the subject's true reading comprehension ability, the total
score or a particular item score? Obviously, since the total score is a
composite of 100 items, it should be assumed to be a better (more
reliable and more valid) index of reading comprehension than any
single item on the test. Hence, since the total score is easily obtainable
and always available on any multiple choice test, it is the usual
criteribn for assessing individual item reliabilities. Other criteria
*
r
Ii
250
LANGUAGE TESTS AT SCHOOL
desirable if we were to do the job by hand. With the present
availability of computers, no one would be apt to do the procedure by
hand any more, but just for the sake of clarity the Flanagan (1939)
technique of estimating the correlation between the scores on each
item and the score on the total test will be presented in a step by step
fashion.
Prior to computing anything, the test of course has to be
administered to a group of examinees. In order to do a good job of
estimating the discrimination values for each test item the selected
test population (the group tested) should be representative of the
people for whom the test is eventually intended. Further, it should
involve a large enough number of subjects to ensure a good sampling
of the true variability in the population as a whole. It would not make
much sense to go to all the trouble of computing item discrimination
indices on a 100 item test with a sample of subjects ofless than say 25.
Probably a sample of 50 to 100 subjects, however, would provide
meaningful (reliable and valid) data on the basis of which to assess the
validities of individual test items in relation to total test scores.
Once the test is administered and all the data are in hand, the first
step is to score the tests and place them in order from the highest score
to the lowest. If 100 subjects were tested, we would have 100 test
booklets ranking from the highest score to the lowest. If scores are
tied, it does not matter what order we place the booklets in for those
particular cases. However, all of the 98s must rank ahead of all of the
97s and so forth;
The next step (still following Flanagan's method) is to count off
from the top down to the score that falls at the 82! percentile. In the
case of our data sample, this means that we count down to the student
that falls at· the 28th position down from the top of the stack of
papers. We then designate that stack of papers that we have just
counted off as the High Scorers. This group will contain approximately 27! % o( all the students who took the test. In fact it contains
the 27! % of the students who obtained the highest scores on the test. .
Then, in similar fashion we count up from the bottom of the
booklets remaining in the original stack to position number 28 to
obtain the corresponding group that will be designated Low Scorers.
The Low Scorers will contain as near as we can get to exactly 27!% of
the people who achieved scores ranking at the bottom of the stack.
We now have distinguished between the 27!% (rounded off in this
case to 28 %) of the students who got the highest scores and the 27! %
who got the lowest scores on the test. From what we already know of
MULTIPLE CHOICE TESTS
251
correlation, if scores on all individual item are correlated with the
total score it follows that for any item, more of the High Scorers
should get it right than of the Low Scorers. That is, the students who
are good readers should tend to get an item right more often than the
students who are not so good at reading. We would be disturbed if we
found an item that good readers (High Scorers) tended to miss more
frequently than weak readers (Low Scorers). Thus, for each item we
count the number of persons inthe High Scorers group who answered
it correctly and compare this with the number of persons in the Low
Scorers group who answered it correctly. What we want is an index of
the degree to which each item tends to differentiate High and Low
Scorers the same as the total score does - i.e., an estimate of the
correlation between the item scores and the total score.
For each item, the following formula will yield such an index:
ID = the· number of High Scorers who answered the item
correctly minus the number of Low Scorers who
answered the item correctly, divided by 27! % ofthe total
number of students tested
Flanagan showed that this method provides an optimum estimate of
the correlation between the item in question and the total test. Thus,
as in the case of product-moment correlation (see Chapter 3 above),
ID can vary from + I to - 1. Further, it can be interpreted as an
estimate' of the computable correlation between the item and the total
score. Flanagan has demonstrated, in fact, that the method of
comparing the top 27! % against the bottom 27! % produces the best
estimate of the correlatIon that can be obtained by such a method
(better for example than comparing the top 50 % against the bottom
50 %, or the top third against the bottom third, and so on).
A specific example of some dummy data (i.e., made up data) for
one of the above test items will show better how ID is computed in
actual practice. Suppose we assume that itein (3) above, based on the
text about Oliver Twist, is the item of interest. Further, that it is one
of 100 items constituting the reading comprehension test posited
earlier. Suppose we have already administered the test to 100
examinees and we have scored and ranked them.
After determining the High Scorers and the Low Scorers by the
method described above, we must then determine how many in each
group answered the item correctly. We begin by examining the
answers to the item given by students in the High Scorers group. We
look at each test booklet to find out whethtfr the student in question
252
got the item right or wrong. Ifhe got it right we add one to the number
of students in the High Scorers group answering the item correctly, If
he got it wrong, we disregard his score. Suppose that we find 28 out of
28 students in the High Scorers group answered the item correctly.
Then, we repeat the counting procedure for the Low Scorers.
Suppose that 0 out of 28 students in the Low Scorers group answered
the item correctly. The ID for item (3) is equal to 28 minus 0, divided
by 28, or + 1. That is, in this hypothetical case, the item discriminated
perfectlycbetween the better and not-so-good readers. We wouid be
inclined to conclude that the item is a good one.
Take another example. Consider the following dummy data on item
(5) above (also about the Oliver Twist text). Suppose that 14 of the
people in the High Scorers group and 14 of the people in the Low
Scorers group answered the item correctly (as keyed, that is,
, assuming the 'correct' answer really is correct). The ID would be 14
minus 14, divided by 28, or 0/28 = O. From this we would conclude
that the item is not producing any meaningful variance at all in
relation to the performance on the entire test.
Take one further example. Consider item (4) above on the Twist
text. Let us suppose that all of the better readers selected the wrong
answer - choice A. Further, that all of the poorer readers selected the
answer keyed by the examiner as the correct choice - say, choice D.
We would have an ID equal to 0 minus 28, divided by 28, or -1.
From this we would be inclined to conclude that the item is no good.
Indeed, it would be fair to say that the item is producing exactly the
wrong kind of variance. It is tending to place the low scorers on the"
item into the High Scorers group for the total score, and the high
scorers on the item are actually ending up in the Low Scorers group
for the overall test.
From all of the foregoing discussion about ID, it should be obvious
that high positive ID values are desirable, whereas low or negative
values are undesirable. Clearly, the items on a test should, be
correlated with the test as a whole. The stronger the correlations, the
more reliable the test, and to the extent that the test as a whole is valid,
the stronger the correlations of items with total, the more valid the
items must be. Usually, professional testers set a val~e of .25 or .35 as
a lower limit on acceptable IDs. If an item falls below the arbitrary
cut-off point set, it is either rewritten or culled from the total set of
items on the test.
'il
IIi
1
:i1
'1 ',111
1
!'I'
"
(iii) Response frequency distributions. In addition to finding out'
253
MULTIPLE CHOICE TESTS
LANGUAGE TESTS AT SCHOOL
how hard or easy an' item is, and besides knowing whether it is
correlated with the composite of item scores in the entire test, thetest
author often needs to know how each and all of the distractors
performed in a given test administration. A technique for determining whether a certain distractor is distracting any of the students or
not is simply to go through all of the test booklets (or answer sheets)
and see how many of the students selected the alternative in question.
A more informative technique, however, is to see what the
distribution of responses was for the High Scorers versus the Low
Scorers as well as for the group falling in between, call them the Mid
Scorers. In order to accomplish this, a response frequency
distribution can be set up as shown in Table 3 immediately below:
TABLE 3
Response Frequency Distribution Example One.
Item (3)
A*
B
C
D
Omit
High Scorers
(top 27t%)
28
0
0
0
0
Mid Scorers
(mid 45 %)
15
10
.10
9
0
Low Scorers
(low 27t%)
0
7
7
7
7
The table is based on hypothetical data for item (3) based on the
Oliver Twist text above. It shows that 28 of the High Scorers marked
!he correct choice, namely A, and none qf them marked B, C, or D,
and none of them failed to mark the item. It shows further that the
distribution of scores for the Mid group favored the correct choice A,
with B, C, and D functioning about equally well as distractors. No
one in the Mid group failed to mark the item. Finally, reading across
the last row of data in the chart, we see that no one in the Low group
marked the correct choiceA, and equal numbers marked B, C, and D.
Also, 7 people in the Low group failed to mark the item at all.
IF and ID are directly computable from such a response frequency
distribution. We get IF by adding the figures in the column headed by
the letter of the correct alternative, in this case A. Here, the IF is 28
plus 15 plus 0, or 42, divided by 100 (the total number of subjects. who
254
MULTIPLE CHOICE TESTS
LANGUAGE TESTS AT SCHOOL
took the exam) which equals .42. The ID is 28 (the number of persons
answering correctly in the High group) minus 0 (the number
answering correctly in the Low group) which equals 28, divided by
27t % of all the subjects tested, or 28 divided by 28, which equals 1.
Thus, the IF is .42 and the ID is 1.
We would be inclined to consider this item a good one on the basis
of such statistics. Further, we can see that all of the distractors in the
item were working quite well. For instance, distractors Band C
pulled exactly 17 responses, and D drew 16. Thus, there would appear
to be no dead wood among the distractors.
To see better what the response frequency distribution can tell us
about specific distractors, let's consider another hypothetical
example. Consider the data presented on item (4) in Table 4.
E. Minimal recommended steps for multiple choice
test preparation
By now the reader probably does not require' much further
convincing that multiple choice preparation is no simple matter.
Thus, all we will do here is state in summarial form the steps
considered necessary to the preparation of a good multiple choice
test.
(1) Obtain a clear notion of what it is that needs to be tested.
(2) Select appropriate item content and devise an appropriate
item format.
(3) Write the test items.
(4) Get some qualified person to read the test items for editorial
difficulties of vagueness, ambiguity, and possible lack of
clarity (this step can save much wasted energy on later steps).
(5) Rewrite any weak items or otherwise revise the test format to
achieve maximum clarity concerning what is required of the
TABLE 4
Response Frequency Distribution Example Two.
Item (4)
A
B
C
D*
Omit
High Scorers
(top 27t %)
28
0
0
0
0
Mid Scorers
(mid 45 %)
15
15
0
14
0
Low Scorers
(low 27t%)
0
0
0
28
0
255
ex~minee.
(6) Pretest the items on some suitable sample of subjects other
than the specifically targeted group.
(7) Run an item analysis over the data from the pretesting.
(8) Discard or rewrite items that prove to be too easy or too
difficult, or low in discriminatory power. Rewrite or discard
non-functional alternatives based on response frequency
distributions.
(9) Possibly recycle through steps (6) to (8) until a sufficient
number of good items has been attained.
(10) Assess the validity of the finished product via some one or
more of the techniques discussed in Chapter 3 above, and
elsewhere in this book.
(11) Apply the finished test to the target population. Treat the data
acquired on this step in the same way as the data acquired on
step (6) by recycling through steps (7) to (10) until optimum
levels of reliability and validity are consistently attained.
Reading across row one, we see that all of the High group missed the
item by selecting the same wrong choice, namely A. If we look back at
the item we can find a likely explanation for this. The phrase 'for a
long time after his birth' does seem to imply alternative A which
suggests that the child was sickly for 'months or possibly years'.
Therefore, distractor A should probably be rewritten. Similarly,
distractor A drew off at least 15 of the Mid group as well. Choice C,
on the other hand, was completely useless. It would probably have
changed nothing if that choice haJ not been among the field of
alternatives. Finally, since only the low scorers answered the item
correctly it should undoubtedly be completely reworked or
discarded.
In view of the complexity of the tasks involved in the construction
of multiple choice tests, it would seem inadvisable for teachers with
normal instructional loads to be expected to construct and use such
tests for normal classroom purposes. Furthermore, it is argued that
such tests have certain instructional drawbacks.
I
I
1
p
256
LANGUAGE TESTS AT SCHOOL
MULTIPLE CHOICE TESTS
257
deal on the other hand in terms of preparation and counter
productive instructional effects. Much research is needed to
determine whether the possibly contrary effects on learning can be
neutralized or even eliminated if the preparation of items is guided by
certain statable principles - for instance, what if all of the alternatives
were set up so that only factually incorrect distractors were used? It
might be that some types of mUltiple choice items (perhaps the sort
used in certain approaches to programmed instruction) are even
instructionally beneficial. But at this point, the instructional use of
multiple choice formats is not recommended.
F. On the instructional value of multiple choice tests
While multiple choice tests have rather obvious advantages in terms
of administrative and scoring convenience, anyone who wants to
make such tests part of the daily instructional routine must be willing
to pay a high price in test preparation and possibly genuine
instructional damage. It is the purpose of the multiple choices offered
in any field of alternatives to trick the unwary, illinformed, or less
skillful learner. Oddly, nowhere else in the curriculum is it common
procedure for educators to recommend deliberate confusion of the
learner - why should it be any different when it comes to testing?
It is paradoxical that all of the popular views of how learning can
be maximized seem to go nose to nose with both fists flying against the
very essence of multiple choice test theory. If the test succeeds in
discriminating among the stronger and weaker students it does so by
decoying the weaker learners into misconceptions, half-truths, and
Janus-faced traps.
Dean H. Obrecht once told a little anecdote that very neatly
illustrates the instructional dilemma posed by multiple choice test
items. Obrecht was teaching acoustic phonetics at the University of
Rochester when a certain student of Germanic extraction pointed out
. the illogicality of the term 'spectrogram' as distinct from 'spectrograph'. The student observed that a spectrogram might be like a
telegram, i.e., something produced by the corresponding -graph, to
wit a telegraph or a spectrograph. On the other hand, the student
noted, the 'spectrograph' might be like a photograph for which there
is no corresponding photogram. 'Now which,' asked the bemused
student, 'is the machine and which is the record that it produces?'
Henceforth, Dr. Obrecht often complained that he could not be sure
whether it was the machine that was the spectrograph, or the piece of
paper.
What then is the proper use of multiple choice testing? Perhaps it
should be thoroughly re-evaluated as a procedure for educational
applications. Clearly, it has limited application in classroom
testing. The tests are difficult to prepare. Their analysis is tedious,
technically formidable, and fraught with pitfalls. Most importantly,
the design of distractors to trick the learners into confusing dilemmas
is counter productive. It runs contrary to the very idea of education.
Is this necessary?
In the overall perspective multiple choice tests afford two principal
advantages: ease of administration and scoring. They cost a great
KEY POINTS
1. There is a certain strained naturalness about multiple choice test formats
inasmuch as there does not seem to be any other way to ask a question.
2. However, the main argument in favor of using multiple choice tests is
their supposed 'objectivity' and their ease of administration and scoring.
3. In fact, multiple choice tests may not be any more reliable or valid than
similar tests in different formats - indeed, in some cases, it is known that
the open-ended formats tend to produce a greater amount of reliable and
valid test variance, e.g., ordinary cloze procedure versus multiple choice
variants (see Chapter 12, and the Appendix).
4. Multiple choice test may be discrete point, integrative, or pragmatic there is nothing intrinsically discrete point about a multiple choice
format.
5. Pragmatic tasks, with a little imagination and a lot of work, can be
converted into multiple choice tests; however, the validity of the latter
tests must be assessed in all of the usual ways.
6. If one is going to construct a multiple choice test for language
assessment, it is recommended that the test author begin with a discourse
context as the basis for test items.
7. Items must be evaluated for content, clarity, and balance among the
alternatives they offer as choices.
8. Each set of alternatives should be evaluated for clarity, balance,
extraneous clues, and determinacy of the correct choice.
9. Texts, i.e., any discourse based set of materials, that discuss highly
technical, esoteric, super-charged, or otherwise distracting content
should probably be avoided in most instances.
10. Among the common pitfalls in item writing are selecting inappropriate
content; failure to include a thoroughly correct alternative; including
more than one plausible alternative; asking test takers to guess facts not
stated or implied; leaving unintentional clues as to the correct choice
among the field of alternatives; making the correct choice the longest or
shortest; including the opposite of the correct choice among the
alternatives; repeatedly referring to information in the correct choice in
other alternatives; and using ridiculous or hilarious alternatives.
11. Item analysis usually involves the examination of item facility indices,
"b
258
LANGUAGE TESTS AT SCHOOL
item discrimination indices, and response frequency distributions.
12. Item facility is simply the proportion of students answering the item
correctly (according to the way it was keyed by the test item writer) ..
13. Item discrimination by Flanagan's method is the number of students in
the top 27t %ofthe distribution (based on the total test scores) minus the
students in the bottom 27t %answering the item correctly, all divided by
the number corresponding to 27t %of the distribution.
14. Item discrimination is an estimate of the correlation between scores on a
given item considered as a miniature test, and scores on the entire test. It
can also be construed in a more general sense as the correlation between
the item in question and any criterion measure considered to have
independent claims to validity.
15. Thus, ID is always a measure of reliability and may also be taken as a
measure of validity in the event that the total test score (or other
criterion) has independent claims to validity.
16. Response frequency distributions display alternatives against groups of
respondents (e.g., high, mid, and low). They are helpful in eliminating
non-functional alternatives, or misleading alternatives that are trapping
the better students.
17. Among the minimal steps for preparing multiple choice tests are the
following: (1) clarify what is to be tested; (2) decide on type oftest to be
used; (3) write the items; (4) have another person read and critique the
items for clarity; (5) rewrite weak items; (6) pretest the items; (7) item
analyze the pretest results; (8) rewrite bad items; (9) recycle steps (6)
through (8) as necessary; (10) assess validity of product; (11) use the test
and continue recycling of steps (7) through (10) until sufficient reliability
and validity is attained.
18. Due to complexity of the preparation of multiple choice tests, and to
their lack of instructional value, they are not recommended for
classroom applications.
19.