Cuurent Development in Lanuage Testing
Cuurent Development in Lanuage Testing
Cuurent Development in Lanuage Testing
FL 021 757
ED 365 144
AUTHOR
TITLE
INSTITUTION
REPORT NO
PUB DATE
NOTE
General (020)
Collected Works
PUB TYPE
Collected Works
Serials (022)
EDRS PRICE
DESCRIPTORS
IDENTIFIERS
ABSTRACT
The selection of papers on language testing includes:
"Language Testing in the 1990s: How Far Have We Come? How Much
Further Have We To Go?" (J. Charles Alderson); "Current
Research/Development in Language Testing" (John W. 011er, Jr.); "The
Difficulties of Difficulty: Prompts in Writing Assessment" (Liz
Hamp-Lyons, Sheila Prochnow); "The Validity of Writing Test Tasks"
(John Read); "Affective Factors in the Assessment of Oral
Interaction: Gender and Status" (Don Porter); "Authenticity in
Foreign Language Testing" (Peter Doye); "Evaluating Communicative
Tests" (Keith Morrow); "Materials-Based Tests: How Well Do They
Work?" (Michael Milanovic); "Defining Language Ability: The Criteria
for Criteria" (Geoff Brindley); "The Role of Item Response Theory in
Language Test Validation" (T. F. McNamara); "The International
English Language Testing System IELTS: Its Nature and Development"
(D. E. Ingram); "A Comparative Analysis of Simulated and Direct Oral
Proficiency Interviews" (Charles W. Stansfield); "Southeast Asian
Languages Proficiency Examinations" (James Dean Brown, H. Gary Cook,
Charles Lockhart, Teresita Ramos); "Continuous Assessment in the Oral
Communication Class: Teacher Constructed Test" (Shanta
Nair-Venugopal); and "What We Can Do with Computerized Adaptive
Testing...And What We Cannot Do!" (Michel Laurier). (MSE)
CURRENT
DEVELOPMENTS
IN LANGUAGE
TESTING
REGIONAL
LANGUAGE
CENTRE
Edited by
SARINEE ANIVAN
SEAMEO
ANTHOLOGY SERIES 25
fill
11
III
111
till
1111111
U S. DEPAWTMENT OF EDUCATION
Othce or EctucahonaIResevch and Improvement
III
Pthntsol ,/vr Or opinions stated in this document do not neCessanly rnpresent otfrthal
OERI ouseton cr ooliCy
40
..,
CURRENT DEVELOPMENTS IN
LANGUAGE TESTING
CURRENT DEVELOPMENTS IN
LANGUAGE TESTING
Anthology Series 25
Published by
SEAMEO Regional Language Centre
RELCP383-91
c.4.
se.
Copyright
()
CONTENTS
Page
Foreword
iv
Introduction
27
John W 01 ler Jr
58
77
92
103
111
119
Michael Milanovic
139
165
185
199
210
230
244
List of Contributors
256
iii
FOREWORD
Moreover, in countries where education is seen as central to socioeconomic development, it is important that tests be valid and reliable. It is our
belief that where validity and reliability arc lacking, individuals as well as
educational programmes suffer, and society at large is the loser. Thus, testing
is an important tool in educational research and for programme evaluation,
and may even throw light on both thc nature of language proficiency and
language learning.
Although the theme of the 1990 Seminar encompassed both testing and
programme evaluation, it has not been possible to cover both areas in one
anthology, and the present volume deals with papers on testing. Those that
deal with programme evaluation will be published in a separate volume later.
Eatnest Lau
Director, RELC
iv
INTRODUCTION
Naturally, as in all research, thc approach taken and the basis of the
research is crucial to the findings. Oiler points out some of the fundamental
problems of research in language testing. He argues for a greater understanding
of language proficiency from the point of view of semiotic abilities and processes,
as well as thc different perspectives of the various people that are involved in the
test making and test taking processes. The variances that show up in language
proficiency tests could thus be correctly identified and their sources properly
attributed and controlled. The validity of the tests would be more secure and
According to Oiler, language learning will take place when learners can
connect their text (discourse) with their own experience, and it is the congruence
between tests and experience that is crucial in determining the validity of tests.
So long a ; this congruence or connection exists, there is less need for real world
experience or authenticity to be inserted into the tests. Boy similarly offers the
view that complete authenticity is neither possible nor desirable. He calls for
some balance between authenticity and abstraction, since hc believes that what
assessment, but there is a problem with reliability whcn raters tend to show
variability in their judgements. Brindley details the short-comings of these
procedures. Areas for future research are also discussed so that some of these
deficiencies can be reduced.
Measurement models such as Item Response Theory (IRT) and proficiency
vi
vii
IL
long-term implications for their careers, and tcst-takers deserve reliable and
valid tcsts. Innovation and experimentation arc, however, always necessary in
order for a discipline to progress. The impact of test results on testces should
always be part of the ethical consideration when writilig tests.
viii
INTRODUCTION
The metaphor I have chosen for my title relates to the notion of distance, of
movement from A to B, or to C or Z. Reviews of language testing often employ
a growth metaphor: papers and volumes are often called things like
for being a language tester, and tell strangers at cocktail parties that I was an
applied linguist, or a teacher trainer. No longer. I even find that my non-tester
colleagues are becoming interested in language testing, which they feel has
become a lively, interesting arca of research and development activity. Some
have even expressed an interest in learning about how one goes about measuring
testing specialists. This is a sure sign that the field has matured, gained in
confidence and is able to walk without holding its mother's hand, and is a
development very much to be welcomed.
pursued the analogy, we would have to think about how and when language
testing might enter middle and then old age, and what it could look forward to in
the future. If testing is mature now, is it soon going to enter into decline and
decay? Even if language testing is still in its adolescence, and has maturity and
middle age to look forward to, the metaphor still implies the inevitability of
decay and death. It is perhaps interesting to note that I once compared the life
of a test to that of a human, suggesting that its existence was typically of the
order of fifteen years:
and eight years. Towards the end of this period, however, signs of
senescence appear in the shape of increasing criticism of thc test's influence
on teaching and on students' ambitions and lives .... Pressure may then
build up within the test producing body itself ... for a change in test
specification, test content, test format.. It may be that the test no longer
fulfils its original function. Change may be instituted by academic applied
linguists.., or by the examination body itself, ... or it may be brought about
by direct rather than invited teacher involvement. Whoever the agent of
change, however, rebirth is then inevitable, usually after a gestation period
of two to three ycars. And so we have another innovation: another baby
test. However, the baby may have a very close resemblance to the parent,
or it may look %/cry different indeed from its predeccessor" (Alderson, 1986,
pp96-97)
were true: that testing would decline and die, and that teachers and learners
could then go about their daily lives unencumbered by tests. There is, after all,
considerable resentment against tests, and their influence, and teachers in
particular bemoan thc washback effect. Many teachers also believe that thcy
know what their learners have learned and have not learned, and how proficient
1 :3
or otherwise they are. The implication is that be.cause they know their learners,
they do not need tests, nor do they believe the information tests provide when it
is different from what they themselves believe. Clearly language learning is a
complex matter, as is the nature of the proficiency towards which learners are
striving. Equally clearly a language test is going to be only a small, probably
inadequate sample of what any learner has achieved or can do with language.
And so many teachers, and some learners, criticise tests for being
"unrepresentative" or even "misleading". Such teachers would be happy to see
testing die. So would those who feel that testing not only constrains the syllabus,
but also unduly restricts opportunities for learners, especially those learners who
perform less well on the tests - who "fail" them.
However much such people hope that testing will die, their hopes are
unlikely to be realised. Tests will probably always be needed as long as society is
obliged to make selection choices among learners (as in the case of university
entrance, for example), or as long as there is doubt about the validity or accuracy
how well they are doing with respect to the language, or some more or less
to improve the tests that we have, so that negative washback can become
positive, so that tests reflect what learners and teachers think learners have
learned and can do, and so that the decisions that are made on the basis of test
results arc as fair and reasonable as they can possibly be. Which is why
reliability and validity arc so important, and why it is important that publicly
available language tests and examinations should mcct clearly established and
accepted standards. One of the first tasks that the proposed association of
language testing specialists will need to address, is that of standards: what
represents good practice in language testing, how is it be identified, fostered and
maintained? What are the existing standards of our examining bodies and
language tests, and should and can these standards be improved? So one task
for the 1990s will certainly be to improve on the growing professionalism of
language testers and language tests, and to set standards to which tests might
should - must - aspire.
However, this notion of aspiration suggests a different metaphor from that
of growth, development, maturation, then decline, decay and death. It suggests
aspiring to a goal, something distant that is getting closer, or high that is getting
nearer. Hence the metaphor of distance contained in my title, and the idea it
suggests of progress over distance. I am interested in running, and have done
quite a few long-distance events - half marathons, marathons, and ultra events. I
find short-distance running - sprinting - very painful, and do not enjoy it. Which
partly explains why I do long-distance running, and it may also explain why I
to elucidate it. There are many different, and indeed sometimes competing
language tests any better now than they used to be? Have we now achieved a
greater understanding of what the problems in language testing are, or how they
of achievement tests; not only with testers, and the problems of our
professionalism but also with testees, with students, and their interests,
perspectives and insights. As I said at a conference in Sri Lanka five years ago,
"testing is too important to be left to testers".
Since 1980, language testing has indeed been moving apace. We now have
have many. Peter Skehan's recent survey article lists 215 publications in the
Bibliography, of which only thirty-five were published before 1980. Grant
Henning's book A Guide to Language Testing has been recently complemented
by Lyle Bachman's volume on Fundamental Considerations in Language
The Construct Validation of Oral Proficiency Tests. Ten years later, in San
discriminant validation of oral and written tests, and above all extended
discussion of Bachman and Palmer's pilot work on the construct validation of
tests of speaking and reading. The reader may recall that the upshot of the
Bachman-Palmer study was that speaking was shown to be measurably different
from reading, but that there was evidence of method effects.
In 1990, the colloquium was much larger - 106 people attending compared
with 29 invited participants in 1980. Partly as a consequence, there was a much
iii)
skills, ESP test contcnt and test bias, approaches to the validation of
reading tests
iv)
test development and test analysis: a colloquium on the TOEFL Cambridge Comparability study, and another on the development of the
new IELTS test, from the point of view of the role of grammar, the nature
of the listenirg and speaking tests, and the issue of subject-specific testing
Yet although clearly more varied, it is not clear to me that the 1990
colloquium was an advance on the 1980 one. In sonic ways, it was a step
backwards, since in 1980 there was a common theme, with research papers
bearing on the same issue from a variety of angles, and potentially throwing light
on problems that might have been said to persist in oral testing. However, many
of the problems that were aired in 1980 are still current: the issue of oral
proficiency scales is eternal, was, for example, addressed in the recent ELTS
Revision Project, and we will hear something about this at this conference from
David Ingram.
To turn to a national example. Almost exactly ten years ago, in May 1980,
Alan Davies and I hatched a plan to hold an invitational conference at Lancaster
with the aim of reviewing developments and issues in language testing. After
some deliberation and discussion, we agrecd that the three main "issues" of
interest to British language testers, and hopefully more widely also, were:
communicative language testing; testing English for specific purposes; the
unitary competence hypothesis: testing general language proficiency. The
results of the conference were eventually published as "Issues in Language
Testing" (Alderson and Hughes 1981); out of that meeting also came the
Language Testing Newsletter, which eventually became the journal Language
Testing and at the same conference, discussions were held which led to the
Edinburgh ELTS Validation Study. I think we had a very definite sense that we
were at the beginning of interesting developments, and that much could happen.
A subsequent follow-up conference was held at Reading University on the same
three topics, and the proceedings were published as Current Developments in
Language Testing (Hughes and Porter, 1983).
At the end of 1989, the Special Interest Group in Testing within IATEFL
organised a Testing Symposium in Bournemouth, entitled Language Testing in
the 1990s: The Communicative Legacy. The symposium attracted a variety of
presentations from examination bodies in the United Kingdom, from teachers
involved in testing, and from testing researchers. In addition, a mini-colloquium
took place. where precirculated papers were reacted to by invited speakers. The
proceedings are about to be published in the ELT Documents series: the main
themes centre around three areas: oral testing, computer based testing, testing
themes are by now probably familiar from the previous meetings I have
mentioned: communicative testing aad oral testing in particular, but also the
relationship between teaching and testing, and the nature of proficiency. The
newcomer is computer-based testing, and I shall come back to that topic shortly.
In general, however, I did not and do not get the impression that we have been
building upon previous research and previous discoveries and successes over the
past decade, and in this impression I am strengthened by Peter Skchan, who is
not only the author of an excellent survey article of language testing (Skehan,
1988), but also presented the opcning overview paper at the Bournemouth
symposium. In his paper, Skehan claims that there has been little notable
progress in testing in the past decade:, which he attributes in part to conservative
7
good casc point. Thc previous ELTS test had been based upon John Munby's
model for syllabus design - the communicative needs processor (Munby, 1978).
Brendan Carroll and his associates took the model and appear to have applied it
to thc dcsign of test specifications and test content. The Revision Project was
asked to re-examine this, on the grounds that the Munby model was old-
4 1..
fashioned, out of date, and needed to be replaced. So one of our tasks was to
identify a model of language proficiency on which our test should or could safely
be based. Alderson and Clapham (1989) report on the results of our attempts to
The point of this anecdote is simply to reinforce the important point that
Skehan makes: language testing has not been well served by applied linguistic
theory, and has been forced to reach its own solutions and compromises, or to
make its own mistakes. Indeed, it is significant, in my view, that the most likely
influential model of second language proficiency to emerge at thc end of the
1980s is the model proposed by Lyle Bachman in his 3990 book. I shall return to
this model shortly, but what is significant here is the fact that Bachman is a
later modifications, which model itself clearly has its origins in much
sociolinguistic thought, but as Skchan points out, it is surely significant that thc
model had to be elaborated by a language tester, and is now beginning to be
operationalised through work on thc TOEFL-Cambridge comparability Study of which more later.
So, in the United Kingdom as well as internationally, the impression I gain
is that although there has been much movement, a lot of it is sideways and
backwards, and not much of it is forwards. Are wc going round and round in
circles?
procedure, who would wish to dcny that such themes might not be just as
appropriate to the 1990 RELC Seminar? Communicative language testing
appears on the programme of this seminar in many disguises. The assessment of
oral proficiency is still topical, as is self assessment, thc relationship between
4 44
teaching and testing, test development and test validation. I shall be very
interested as I listen to the many presentations at this seminar to see whether we
If by the end of the Seminar we have got our personal answers to these
questions, and if we have also a sense of an emerging consensus among language
testers of the answers, then we will not only have achieved a great deal, but will
also be in a position to move forward.
I do not wish to pre-empt your own thoughts on progress to date and the
need for furthcr progress, but I fear that it would be remiss of me, given my title
and the expectations it may have aroused, not to offer my thoughts on what we
have and have not achieved to date. However, I shall do so only briefly, as I do
hope that this seminar will clarify my own thinking in this area. Nevertheless, let
me at least indicate the following areas of progress and lack of progress:
TEST CONTENT
The last ten years have seen apparent improvement in the content of
language tests, especially those that claim to be "communicative". Texts arc
more "authentic", test tasks relate more to what people do with language in "real
- life", our tests are more performance related and our criteria for evaluating
performance are more relevant to language use. Advances in performance
testing have accompanied a movement away from "discrete-point", knowledgefocussed tests and have benefited from more integrative approaches to language
assessment. However, wc do not know that this apparent improvement is a real
lc
establish that progress has indeed been made. Lyle Bachman and his co-workers
have developed a useful instrument for the TOEFL-Cambridge Comparability
Study, intended to identify and examine aspects of "Communicative Language
Ability" (based upon the Bachman model, Bachman 1990), but even this
instrument when perfected will only enable us to compare tests in content terms.
Much more work will need to be done before we can relate the empirical
analysis of test performance to an examination of tcst content. We are still very
far from being able to say "I know how to test grammatical competence", "I know
how to test the ability to read in a foreign language", and so on.
TEST METHOD
Research has clearly shown that there is such a thing as method effect in
testing. Given that we are not interested in measuring a person's ability to take a
test. We have, I believe, generally accepted that no one test method can be
thought of as superior, as a panacea. There have bccn interesting attempts to
devise new testing methods for particular purposes, which remain to be
validated, but which offer alternatives to the ubiquitous multiple-choice. We
have as yet no means, however, of estimating the method effect in a test score,
much less any way of predicting test method effects or of relating test method to
test purpose. The development of an instrument to identify possible test method
TEST ANALYSIS
We now know, or believe, that the answcr to the question: what is language
proficiency? depends upon why one is asking the question, how one seeks to
answer it, and what level of proficiency one might be concei ned with. It is
generally accepted that the UCH was overstated, and that proficiency consists of
both general and specific components. We know that speaking can be different
from reading. We also now know, thanks to the work of Gary Buck (Buck,
1990), that reading and listening can be empirically separated, provided that
certain conditions arc met. We also know from various sources, including the
empirical work of Vollmer, Sang et al, and Mike Milanovic in Hong Kong, that
differentiated proficiency. Whereas intermediate level learners - the majority tend to show differentiated abilities across macro-skills, and therefore thcir test
performance tcnds to be multifactorial. However, we are still at the beginning of
this sort of research, and many carefully designed studies with large test
batteries, homogeneous subjects, and adequate background information on
learning histories and biodata will be needed before we can be more definitive in
our statements about the nature of language proficiency than that.
will tell us at this conference). We lack more sensitive measures, and we arc
12
therefore unable even to make useful statements about the impact on learning
Language testing is a young discipline, and has only been taken seriously,
and taken itself seriously, in the past twenty ycars or so. As Alan Davies
13r,
t)
result, the research that gets done tends to be done by doctoral students,
striving to be original, or by individual researchers working on their own
(this is a point also made by Skehan), or by researchers within organisations
like the Defense Language Institute in the USA, or the Centre for Applied
are some of the research studies reported in the ETS TOEFL Research
Report Series. However, understandably enough, the examination and test
these problems. The first problem will, of course, resolve itself as testing
becomes a more established field of study. There are reasons to believe, as I
stiggested at the beginning, that this has already occurred, that language testing
is now sufficiently developed and mature, or language testers are sufficiently well
criticisms of parts of thc model, but that should not prevent us from
endeavouring to operationalise aspccts of it, in order to explore relationships
among them. The third problem is perhaps the most problematic: funding for
research and development. I have no easy solutions to that, but will be very
interested to hear what other have to say about this, from their institutional and
national perspectives.
14
DIRECTIONS
In what follows, I offer a few of my own thoughts, on what I believe to be
important areas for language testing to pay attention to in the next decade and
more.
Achievement and research in language learning
1.
Now that testing has come of age, it is time for testers to make major
contributions to other areas of applied linguistics. Three related areas come to
mind immediately: programme evaluation, second language acquisition research
and classroom learning. In each case, it will be important for language testers to
pay much more attention to the development and researching of achievement
tests. In each case, what is needed is a set of carefully constructed, highly
specific tests which can be shown to be sensitive to learning. The concentration
of language testing researchers on developing and researching proficiency tests is
understandable: most funding bodies want proficiency tests, rather than tcsts
that relate reliably, and validly and directly to specific achievement on particular
are not insuperable, and I believe that the 1990s will see much more
collaboration between language testers, sccond language acquisition researchers,
programme evaluators and language teachers than we have seen hitherto.
2.
Washback
It is a commonplace to declare that tests have an impact on teaching washback is everywhere acknowledged and usually deplored. At the same time,
15
frI
%...
it is not uncommon to point out that tests can have both negative and positive
influences on the curriculum, a fact which has been used in some settings in
order to bring about innovation in the curriculum through the test. Usually, the
test is said to lag behind innovations and progress in materials, teacher training
and classroom practice, until the dissonance between the two becomes so
uncomfortable that the test has to change. In some settings, however, deliberate
innovation in the content and method of the examinations has been used to
reinforce or in some occasions even go in advance of changes in materials and
methods, However, in both sets of circumstances - whcrc the test is held to have
negative washback on teaching and where the test is being used to bring about
classroom change - there is remarkably little evidence of the impact of the test.
What there is is largely anecdotal, and not the result of systematic empirical
research. What is needed, and there are signs that this will become an increased
focus for testing related research in the future, is research into the impact of
tcsts on classrooms. Do teachcrs "simply" use previous exam papers as textbook
material? If so, do they simply expect students to take the tests, and then to
receive the answers? How long in advance of the exam does such teaching
begin, and what do students think of it , and how do they benefit from it? Why
3.
Test Content
The last few years have seen a resurgence in interest in the content validity
of tests. Over the past ten years there has developed a tendency for test
developers to devise taxonomies of thc skills and content being tested by their
tests. Such taxonomies are typically contained in the Specifications of the test,
which are used for thc guidance of item writers, and they have been heavily
influenced by the writings of curriculum and syllabus developers. The classic
example of these in the USA is Benjamin Bloom and his associates, in the
Taxonomy of Educational Objectives, and in the United Kingdom in EFL/ESP
in the work of John Munby and his Communicative Needs Processor. The
existence of taxonomies in test specifications has led to an attempt to test
16
individual skills and objectives in individual items, and to the concomitant claim
that certain items do indeed test certain skills/objectives.
Unfortunately, however, recent research has begun to cast doubt on these
claims, at least to the extent that it has proved somewhat difficult in some
circumstances to get "expert" judges to agree on what is being tested by
individual items. If judges do not agree with each other, or with the test
constructor, on what items are testing, then it becomes somewhat difficult to
substantiate claims as to the content validity of an item, and conceivably also a
test.
to show is that students approach test items in a highly individual way and,
moreover, that students get the correct answer for a variety of different rcasons.
Sometimes they get the answcr right without knowing the right answer,
sometimes they gct thc answer wrong whilst clearly displaying the ability being
tested. Even more worrisomely, in some ways, is the fact that individual students
have been seen to get the answer right, yet have displayed abilities that were not
supposedly being tested, nor have they displayed evidence of the ability that the
test constructor believed was being tested.
If individuals respond to single items individually, revealinb different skills
and abilities in so doing, and if "expert" judges disagree about what is being
tested by individual test items, thcn it is unclear whether we are justified (a) in
saying that a given item is testing a particular skill for any group of learners, and
(b) in grouping together the responses of different learners to the same itcm for
the purposes of item analysis (even facility values and discrimination indices). If
there are doubts about the legitimacy of grouping individual responses (which
arc at least potentially different psycholinguistically) to one itcm, there must also
be doubts about the wisdom and indeed interpretability of grouping responses to
items to arrive at test scores for one individual, and even less to arrive at group
test results. Given that traditional test statistics - reliability indices and validity
coefficients and calculations - depend upon grouping data - perhap.; it is small
wonder that factor analyses of large datasets of performance on large numbers
of items result more frequently than not in unifactorial structures, or in multfactorial views of proficiency that are difficult to interpret. This is at least an
4.
Fred Davidson and the work of the Comparability Study has led some to
reconsider their position. The to-some-disappointing finding that the factor
structure of test batteries is more unifactorial than theory would lead us to
expect is being accounted for by the notion that the nature of language
proficiency may vary dcpcnding upon the level of proficiency. Thus advanced
learners might be thought to have integrated their abilities in different skill
areas, and therefore to manifest a general proficiency, or at least a proficiency
together in order to make inferences about test content and construct and
therefore also about language proficiency, is a dubious exercise at best, and also
18
need for closer attention to the need to test language achievement rather than
proficiency, might well lead to an increase in the studies that are conducted on
populations in individual countries, within particular educational settings. This
might also enable us to focus more clearly on achievement and learning within
institutions and systems.
5.
One major development since 1980 has been the advent of personal
computers utilising powerful and advanced micro-processors. Such computers
are increasingly being used not only for the calculation of test results and the
issuing of certificates, but also for test delivery and scoring. Computerised
adaptive testing is an important innovation, where the computer "tailors" the test
that any candidate takes to that candidate's ability level as revealed by his/her
by the exact word procedure. Whereas such test methods are coming under
increasing scrutiny and criticism elsewhere in language tcsting, the advent of the
computer has to date proved to be a conservative force in test development: test
The speed, mcmory, patience and accuracy of the computer would appear
to offer a variety of possibilities for innovation in test method that should be
actively explored in the 1990s. In addition, however, it is already clear that
delivering tests on computer allows the possibility for a blurring of the
19
L.
distinction between a test and an exercise. The computer can assess a student's
response as soon as it has been made: this then allows the possibility for
immediate feedback to the student before he/she progresses to the next item. It
also allows the possibility of giving the student a "second chance", possibly for
reduced credit. The computer can also provide a variety of help facilities to
students: on-line dictionaries can be made easily available, and with not much
more effort, tailor-made dictionaries - directly relevant to the meanings of the
words in the particular context - can also be available, as can mother tongue
equivalents, and so on. In addition, the computer can deliver clues to learners
who request them. These clues can be specific to particular items, and can
consist of hints as to the underlying rules, as to meanings, as to possible
inferences, and so on. Again, the test developer has the possibility of allowing
access to such clues only for reduced credit. Moreover, the computer can also
offer the possibility of detailed exploration of a particular area of weakness. If a
student performs poorly on, say, two items in a particular arca, the machine can
branch the student out into a diagnostic loop that might explore in detail the
student's understanding and weaknesses/strengths in such an area. If thought
desirable, it would be easy to branch students out of thc test altogether into
some learning routine or set of explanations, and then branch them back in,
either when they indicated that thcy wished the test to continue, or once thcy had
performed at some pre-specified criterion level.
In short, a range of support facilities is imaginable through computers - and
indeed software already exists that allows thc provision of some of these ideas.
The provision of such support raises serious questions about the distinction
between tests and exercises, and the consequences of providing support for our
understanding of candidates' proficiency and achievement. Since the computer
can also kccp track of a student's use of such facilities, it is possible to produce
very detailed reports of progress through a test and of performance on it, thus
allowing the possibility of detailed diagnostic information. The big question at
present is: can teachers and testers use this information? Will it reveal things
about a student's proficiency or achievement or learning or test taking strategies
that will be helpful? We do not yet know, but we now have thc hardware, and
partly the software, to find out, and a duty to explore the possibilities and
consequences.
6.
Learner-centered testing
The very real possibility of the provision of support during tests, and thc
tailoring of tests to students' abilities and needs, raises the important issue of
3,,
20
student-centered testing. For a long time in language teaching there has been
talk of, and some exploration of the concept of, learner-centered teaching and
learning. This now becomes an issue in testing, and as teachers we will need to
programmes might be. What do they think they have learned during a
programme and how do they think they can demonstrate such learning? An
increased focus on such questions could help learners as well as teachers become
more aware of the outcomes of classroom learning, which would in turn inform
those who need to develop classroom progress and achievement tests.
Clearly such suggestions are revolutionary in many settings, and I am not
necessarily advocating that students design tests for themselves, thcir peers or
their successors, at least not immediately. It is often necessary to begin such a
development cautiously: one way already suggested and indeed being tried out
in various contexts is to ask students what they are doing when they respond to
test items. Another way is to ask students to comment and reflect on the
discrepancy between their test results or responses, and their own view of their
ability, or their peers' views or their teachers views. Such explorations may well
help us to understand better what happens when a student meets a test item, and
that might help us to improve items. It might also help students to understand
their abilities better, and might even encourage students to contribute more
substantially to test development. The ideas may appear Utopian, but I would
argue that we would be irresponsible not to explore the possibilities.
7.
21
34
judges, this did not necessarily agree with the intcntions of the test constructor,
nor with what students reported of thcir test-taking processes. It is much more
difficult than we may have thought to decide by content inspection alone what a
test is testing. Yet much testing practice assumes we can make such judgements.
In a second study, I showed that test writcrs, experienced test scorers, and
experienced teachers, were unable to agree on the difficulty of a set of items, for
a given population, and were unable to predict the actual difficulty of items. This
shows clearly the need for pre-testing of items, or at the very least for post-hoc
adjustments in test content, after an analysis of item difficulty. Declarations of
the suitability, or even unsuitability of a test for a given population are likely to
he highly inaccurate.
In a third study, I investigated agreement among judges as to the suitability
of cut-offs for grades in school-leaving examinations. There was considerable
disagreement among judges as to what score represented a "pass ", a "credit" and
a "distinction" at 0 Level. Interestingly, it proved possible to set cut-offs for the
new examination by pooling the judgements - the result of that exercise came
22
What will clearly be needed in the cuizing years is a set of studies into the
accuracy and nature of the range of judgements that language testers are
required to make, in order to identify ways in which such judgements can be
made reliable and also more valid. Testing depends upon judgements by
"experts". We need to know how to improve these judgements, and how to
guarantee their reliability and validity.
8.
Traditional concerns
of new ways to validate tests, both statistical through Item Response Theory,
Confirmatory Factor Analysis, Multi Trait, Multimethod analyses of convergent
and discriminant validities, and the like, and qualitative, through introspective
studies, through comparisons with "real-life" language use, through increased
a set of standards for good practice in language testing. There already exist
general standards for educational and psychological testing - the APA, AERA
and NCME standards. However, these do not refer specifically to language
tests, nor, I believe, do they take account of, or accommodate to, thc variety of
23
test development procedures that exist around the world. The TOEFL Cambridge Comparability Study I have referred to revealed considerable
differences in approaches to test development, and to the establishing of
standards for tests in the United Kingdom and the USA. Similar cross-national
comparisons elsewhere would doubtless also reveal considerable differences.
What we need to do is not to impose one set of standards on other systems, but
to explore the advantages and disadvantages, the positives and negatives of the
different traditions that might emerge from a survey of current practice, and to
incorporate the positive features of current practice into a set of standards that
could - should? be followed by those who develop language tests. There
already exists a well-documented and well-articulated psychometric tradition for
establishing tcst standards, especially but not exclusively in the USA. What we
now need to do is to identify the positive features of other traditions, and to
explore the extent to which these are compatible or incompatible with the
psychometric tradition.
Clearly this will take time, and considerable effort, and may well cause
some anguish. Just as does a long-distance run. The analogy may not be entirely
inappropriate, since the effort may well need stamina and determination. And
the end-point may well he distant. But I believe it to be worthwhile.
To summarise: Recent research is beginning to challenge some of the basic
assumptions we have made in the past 20-30 years: our judgements as experts
are suspect; our insights into test content and validity are challengeable; our
methods of test analysis may even be suspect. The apparent progress we think
we have made - that we celebrate at conferences and seminars like this one, that
we publish and publicise - may well not represent progress so much as activity,
sometimes in decreasing circles.
It may at times appear, it may even during this talk have appeared as if the
goal is reducing into the distance. Are we facing a mirage, in which our goals
appear tantalisingly close, yet recede as wc try to rcach them? I believe not, but
I do believe that we need patience and stamina in order to make progress. At
least language testing is now maaire enough, confident enough, well trained
enough, to take part in the run, to begin thc long distance journey. Did you
know that to take part in long-distance events, at least in the United Kingdom,
you have to be at least eighteen years old? Modern approaches to language
testing are at least that old. We should not despair, but should identify thc
direction in which we want and need to move, and continue to work at it.
3'
24
BIBLIOGRAPHY
Alderson, J Charles and Hughes A (eds) (1981) Issue in ' angua ,e Testing. ELT
Documents, Number 111. London: The British Council
Alderson, J Charles and Clapham C ( 1989) "Applied Linguistics and Language
Testing: A Case Study of the ELTS Test" Paper presented at the BAAL
Conference, Lancaster, September 1989
25
Evaluation,
Cambridge
Hughes A (1989) Testing for Lansuage Teachers Cambridge:
University Press
Testing.
Hughes A and Porter D (eds) (1983) Current Developments in Lanaztaee
London: Academic Press
Annual
Jones S, DesBrisay M and Paribakht T (eds) (1985) Proceedings of the 5th
Ottawa:
Carleton
University
Language Testing Colloquium.
Rowley: Newbury
011er, I W lnr (ed) (1983) Issue in Language Testing Research
House
SEAMEO
Read, I A S (ed) (1981) Directions in Language Testing. Singapore:
Singapore
University
Press
Regional Language Centre/
the Art Article,
Skehan P (1988) Language Testing, Part 1 and Part 2. State of
Language Teaching p211 - 221 and p 1 - 13
presented at the
Skchan, P (1989) Progress in Language Testing: 77te 1990s Paper
Language
Testing
Symposium
on Language
IATEFL Special Interest Group in
Legacy.
Bournemouth,
November,
Test:ng in the 1990s: the Communicative
1989
TESOL:
Stansfield C W (ed) (1986): Technolo 7v and Language Testing.
Washington, DC
Ewer Linguistic
Weir C J (1988) Commtwicative Language Testing. Volwne 11:
Studies. Exeter: University of Ereter.
3;)
26
CURRENT RESEARCH/DEVELOPMENT IN
LANGUAGE TESTING
John W. 01 ler. Jr
INTRODUCTION
Without question, the most important item on the present agenda for
language testing research and development is a more adequate
theoretical
perspective on what language proficiency is and what sources of variance
contribute to its definition in any given test situation. Perhaps the
least
developed idea with reference to the research has been thc differentiation
of
sources of variance that are bound to contribute to observed differences in
measures of language proficiency in different test situations.
27
t,
GREETING
back in Singapore again and t.
After ten years, it is a distinct pleasure to be
testing. As
at RELC on language
attend once more an international conference
in
thc
interim
little" has happened
Charles Alderson reminded us at least "a
the next decade
forward
to
seeing
what
(since the 1980 conference) and we look
that all of us who were able to attend this year
Dr.
may bring forth. We may hope
in
tcn
years time. We are saddened to note that
will be able to come hack
mortality.
reminded
of
our
own
Michael Cana le is no longer with us, and areRatanakoses (Minister of Education
It is a "noble undertaking", as General
told us yesterday that we arc embarked
in Thailand and President of SEAMEO)
if we arc to stay in it for the long haul, as
upon, but a difficult one. Therefore,
Director of
level of "stamina". The
Alderson said, wc will require a certainIsman,
the
SEAMEO
the Director of
is all about.
28
what I will say here today, it is convenient that it may be put in all of the
grammatical persons which we might have need of in reference to a general
For instance, with respect to the first person, whether speaker or writer, it
would be best for that person to try to see things from the viewpoint of the
second person, the listener or reader. With reference to the second person, it
would be good to see things (or to try to) from the vantage point of the first.
From the view of a third person, it would be best to take both the intentions of
the first and the expectations of the second into consideration. And, as Ron
MacKay showed so eloquently in his paper at this meeting, even evaluators
(acting in the first person in most cases) are obliged to consider the position of
"stakeholders" (in the second person position). The stakeholders arc thc persons
who are in the position to benefit or suffer most from program evaluation. They
are the persons on the scene, students, teachers, and administrators, so it follows
from the generalized version of Peirce's maxim for writers (a sort of golden rule
for testers) that evaluators must act as if they were thc stakeholders.
Therefore, with all of the foregoing in mind, I will attempt to express what I
have to say, not so much in terms of my own experience, but in terms of what we
have shared as a community at this conference. May it be a sharing which will go
on for many years in a broadening circle of friendships and common concerns. I
suppose that our common goal in thc "noble undertaking" upon which we have
embarked from our different points of view converging hcrc at RELC, is to
share our successes and our quandaries in such a way that all of us may benefit
and contribute to the betterment of our common cause as communicators,
teachers, educators, experimentalists, theoreticians, practitioners, language
testers, administrators, evaluators, and what have you.
29
42
more abstract set of terms drawn from the semiotic theory of Charles Sanders
Peirce. It is true that the term "pragmatics" has been at least partially
assimilated. It hk. come of age over the last two decades, and theoreticians
around the wolld i ow use it commonly. Some of them even gladly incorporate
its ideas into grammatical theory. I am very pleased to see that at RELC in 1990
there is a course listed on "Pragmatics and Language Teaching".
Well, it was Peirce who invented the term, and as we press on with the
difficult task of sinking a few pilings into solid logic in order to lay as strong a
foundation as possible for our theory, it may be worthwhile to pause a moment
to realize just who he was.
C. S. Peirce [1839-19141
In addition to being the thinker who invented the basis for American
pragmatism, Peirce did a great deal else. His own published writings during his
75 years, amounted to 12,000 pages of material (the equivalent of 24 books of
500 pages each). Most of this work was in the hard sciences (chemistry, physics,
astronomy, geology), and in logic and mathematics. During his lifetime, however,
he was hardly known as a philosopher until after 1906, and his work in grammar
and semiotics would not become widely known until after his death. His
followers, William James [1842-1910] and John Dewey [1859-1952], were better
known during thcir lifetimes than Peirce himself. However, for those who have
studied the three of them, thcre can be little doubt that his work surpassed theirs
(see, for example, comments by Nagel, 1959).
Until the 1980s, Peirce was known almost exclusively through eight volumes
(about 4,000 pages) published by Harvard University Press between 1931 and
1958 under the title Collected Writings of Charles S. Peirce (the first six volumes
were edited by Charles Hartshorne and Paul Weiss, and volumes seven and eight
by Arthur W. Burks). Only Peirce scholars with access to the Harvard archives
could have known that those eight volumes represented less than a tenth of his
total output.
30
Peirce's writings have been published by Indiana University Press. The series is
titled Writings of Charles S. Peirce: A Chronological Edition and is expected,
when complete, to contain about twenty volumes. The first volume has been
edited by Max Fisch, et al., (1982) and the second by Edward C Moore, et al.,
(1984). In his Preface, to the first volume (p. xi), Moorc estimates that it would
require an additional 80 volumes (of 500 pages each) to complete the publication
of the remaining unpublished manuscripts of Peirce. This would amount to a
total output of 104 volumes of 500 pages each.
novels) consider Peirce to have been a philosopher. In fact, he was much more.
He earned his living from the hard sciences as a geologist, chcmist, and engineer.
His father, Benjamin Peirce, Professor of Mathematics at Harvard was widely
regarded as the premier mathematician of his day, yet the work of the son by all
measures seems to have slit pas.sed that of the father (cf. Eisele, 1979). Among
Peirce himself saw abstract representation and inference as the same thing.
Inference, of course, is the process of supposing something on the warrant of
31
4i 44
ft
something else, for example, that there will be rain in Singapore because of the
symbol ... contains information. And ... all kinds of information involve
inference. Inference, then, is symbolization. They are the same notions" (1865,
in Fisch, 1982, p. 280). The central issue of classic pragmatism, the variety
advocated by Peirce, was to investigate "the grounds of inference" (1865, in Fisch,
PRAGMATIC MAPPING
the articulate linking of text (or discourse) in a target language (or in fact any
semiotic system whatever), with facts of experience known in some other manner
e , through a different semiotic system or systems).
TEXTS
FACTS
(Representations
of all sorts)
e World of
Experience)
Einstein s
Gulf
45
32
the facts most of the time and we are seeking to discover others or we may
merely be speculating about them.
There are two interpretations of thc figure that are of interest here. First,
there is the general theory that it suggests for the comprehension of semiotic
material, i.e., texts or discourse, in general, and second, there is the more specific
application of it to language testing theory which we arc about to develop and
elaborate upon.
33
4;;
With respect to the first interpretation we may remark that the theory of
pragmatic mapping, though entirely neglected by reviewers like Skehan (1989),
offers both the necessary and sufficient conditions for language comprehension
and acquisition. In order for any individual to understand any text it is necessary
for that individual to articulately map it into his or her cwn personal experience.
That is, assuming we have in mind a particular linguistic text in a certain target
language, the comprehender/acquirer must determine the referents of referring
noun phrases (who, what, where, and the like), the deictic significances of verb
phrases (when, for how long, etc.), and in general the meanings of the text. The
case is similar with the producer(s) of any given text or bit of text. All of the
same connections must be established by generating surface forms in a manner
EINSTEIN'S GULF
34
vast extent, we do not know directly--only indirectly and inferentially through our
representations of it.
Another point worthy of a book or two, is that what the material world is,
or what any other fact in it is, i.e., what is real, in no way depends on what we
may think it to be. Nor does it depend on any social consensus. Thus, in spite of
the fact that our determination of what is in the material world (or what is
factual concerning it), is entirely dependent on thinking and social consensus
(and though both of these may be real enough for as long as they may endure),
reality in general is entirely independent of any thinking or consensus. Logic
requires, as shown independently by Einstein and Peirce (more elaborately by
Peirce), that what is real must be independent of any human representation of it.
But, we cannot develop this point further at the moment. We must press on to a
more elaborate view of the pragmatic mapping process and its bearing on thc
concerns of language testers and program evaluators.
In fact, the simplest form of the diagram, Figure 1, shows why language
tcsts should be made so as to conform to the naturalness constraints proposed
earlier (01 ler, 1979, and Doye, this conference). It may go some way to
explaining what Read (1982, p. 102) saw as perplexirs. Every valid language test
that is more than a mere working over of surface forms of a target language
must require the linking of text (or discourse) with the facts of the test taker's
experience. This was called the meaning constraint. The pragmatic linking,
moreover, ought to take place at a reasonable speed--the time constraint. In his
35
Linguistic
Semiotic
Capacity
Kinesic
Semiotic
Capacity
SensoryMotor Semiotic
Capacity
SM
L2
Pr
LI
Kn
36
SM
SM n
SM2
K1
SM
of gestures, some aspects of which arc universal and some of which are
conventional and must be acquired. Smiling usually signifies friendliness, tears
sadness, and so on, though gestures such as these are always ambiguous in a way
that linguistic forms are not ordinarily. A smile may be the ultimate insult and
tears may as well represent joy as sorrow. Sensory-motor representations are
what we obtain by seeing, hearing, touching, tasting, and smelling. They include
all of the visceral and other sensations of the body.
Sensory-Motor Capacity. Sensory-motor representations, as we learn from
the skepticism of Hume and Russell, see Oiler, 1989 for elaboration). The
problem with sensory-motor representations is to determine what precisely thcy
are representations of. What do we see, hear, etc? The general logical form of
the problem is a Wh-question with an indeterminate but emphatic demonstrative
in it: namely, "What is that?" To see the indeterminacy in question, picture a
37
cxcept to bring certain significances to our attention, language affords the kind
of abstract conceptual apparatus necessary to fully detcrmine many of the facts
of experience. For instance, it is only by linguistic supports that we know that
today we are in Singapore, that it is Tuesday, April 9, 1990, that Singapore is an
island off the southern tip of Malaysia, and west of the Philippines and north of
Australia, that my name is John Oiler, that Edith Hanania, Margaret Des Brisay,
Liz Parkinson, J4cet Singh, Ron MacKay, Adrian Palmer, Kanchana Prapphal,
P. W. J. Nababan, James Pandian, Tibor von Elck, and so forth, are in the
audience. We know who we are, how we got to Singapore, how we plan to leave
and where we would like to go back to after the meeting is over, and so forth.
Our knowledge of all of these facts is dependent on linguistic representations. If
any one of them were separated out from the rest, perhaps some reason could be
found to doubt it, but taken as a whole, the reality suggested by our common
representations of such facts is not the least bit doubtful. Anyone who pretends
to think that it is doubtful is in a state of mind that argumentation and logic will
not be able to cure. So wc will pass on.
Particular Systems and Their Texts. Beneath the three main universal
the writing process is not merely language proficiency per se, which is
represented as any given Li, in the diagram, but is also dependent on background
knowledge which may have next to nothing to do with any particular Li. The
51
38
39
5'4;
5
40
cApAcrrY
GENERAL
LINGUISTIC SEMIOTIC
CAPACITY
FONESIC SEMIOTIC
APACITY
SENSORY MOTOR
SEMIOTIC CAPACITY
LONG-TERM
MEMORY
Affective Evaluation
or - with vanable
strength
SHORT-TERM
MEMORY
SIGHT
HEARING
TOUCH
FACTS
TASTE
SMELL
TEXTS
(Representations
of all sorts)
(The World of
Experience)
Einstein's
Gulf
Figure 3. A modular information processing expansion of the pragmatic
mapping process.
Language IL
41
54
However, as more and more experience is gained, the growth will tend
to fill in gaps and deficiencies such that a greater and greater degree of
convergence will naturally be observed as individuals conform more and
more to the semiotic norms of the mature language users of the target
language community (or communities). For example, in support of this
general idea, Oltman, Stricker, and Barrows (1990) write concerning
thc factor structure of the Test of English as a Foreign Language that
"the test's dimensionality depends on the examinee's overall level of
performance, with more dimensions appcaring in the least proficient
populations of test takers" (p. 26). In addition, it may be expected that
as maturation progresses, for some individuals and groups, besides
increasing standardization of the communication norms, there will be a
continuing differentiation of specialized subject matter knowledge and
specialized skills owing to whatever differences in experience happen to
literary works and studies them intently is apt to develop some skills
and kinds of knowledge that will not be common to all the members of
a community. Or, a person who practiccs a certain program of sensorymotor skill, e.g., playing racquetball, may be expected to develop certain
42
55
to
some extent co-dependent expectancy systems, what is understood passes
understood
is
filtered
out
as-it-were,
short-term memory while whatever is not
even though it may in fact have been perceived. What is processed so as to
achieve a deep level translation into a general semiotic form goes into long term
memory. All the while information being processed is also evaluated affectively
for its content, i.e., whether it is good (from the vantage point of the processor)
marking, and
or bad. In general, the distinction between a positive or negative
the
amount
of
energy
devoted to
the degree of that markedness, will determine
the processing of the information in question. Things which are critical to the
survival and well-being of the organism will tend to be marked positively in
terms of affect and their absence will be regarded negatively.
Affect as Added to Cognitive Effects. The degree of importance associated
with the object (in a purely abstract and general sense of the term "object") will
be determined by the degree of positive or negative affect associated with it. To
some extent this degree of markedness and even whether a given object of
semiosis is marked positively or negatively will depend on voluntary choices
made by the processor. However, there will be universal tendencies favoring
suivival and well-being of the organism. This means that on the positive side we
will tend to find objects that human beings usually regard as survival enhancing
known that contextually expected words, for instance, are easier to perceive than
unexpected ones (the British psychologist John Morton comes to mind in this
connection). In fact, either positive or negative expectations may be created by
context which either make it easier or in fact make it harder than average to
perceive a given item. These experiments carry over rather directly into the
whole genre of doze testing to which we will return shortly. However, it can be
demonstrated that in addition to the effects of cognitive expectancies, affective
evaluations associated with stimuli also have additional significant and important
(a distinction made by James Dean Brown [19881 and alluded to by Palmer at
else who takes the time to read such essays find the ones written by better
motivated and better informed writers to also be that much more
comprehensible. All of which leads me to the most important and final diagram
for this paper, Figure 6.
C.
P.,
J6
44
__
direct acms-3
inferential access
Second Person
(Reader(s] or
Consumer(s)
Alias
Interpreter(s))
4'
Third Person
(Community or
Disinterested
Persons)
First Person
(Author or
Originator)
mxiS
FACTS
first person or producer of discourse (or text) is obviously distinct from the
second person or consumer. What is not always adequately appreciated, as
45
Read points out in his paper at this meeting, is that variability in language tests
may easily be an indiscriminant mix from both positions when only one is
supposedly being tested. What is more, logically, there is a third position that is
shared by the community of users (who will find the text meaningful) and the
text itself. Incidentally, for those familiar with Searle's trichotomy in speech act
theory (a rather narrow version of pragmatic theory), we may mention that what
It will be noted that the first person is really the only one who has direct
access to whatever facts he or she happens to be representing the production of
a particular text. Hence, the first pe:son also has direct access to the text. At the
same time the text may bc accessible directly to the person to whom it is
addressed, but the facts which the text represents (or purports to represent in
thc case of fiction) arc only indirectly accessible to the second person through
the representations of the first. That is, the second person must infer the
intentions of the first person and the facts (whatever either of these may be).
Inferences concerning those facts are based, it is hypothesized, on the sort of
semiotic hierarchy previously elaborated (Figures 1-5). Similarly, a third person
has direct access neither to the facts nor thc intentions of the first person nor the
understandings of them by the second person. All of these points must be
inferred, though the text is directly accessible. The text, like the third person(s),
also logically is part of the world of facts from the point of view of the third
person, just as the first person and second person are part of that world. (For
anyone who may have studied Peirce's thinking, thc three categories
differentiated here will be readily recognized as slightly corrupted, i.e., less
abstract and less general, versions of his perfectly abstract and general categories
of firstness, secondness, and thirdness.)
Going at these categories in a couple of different ways, I am sure that I can
make clearer both what is meant by them in general and how they arc relevant to
attention shifts to the third position, e.g., when we use language tests to
investigate characteristics of textual structure.
The point that I want to make in this next section is that unless the two
other positions (beyond whichever of the three may already be in focus), and
possibly a great many subtle variables within each, are controlled, it is likely that
5j
46
data drawn from any language testing application will be relatively meaningless.
Unfortunately this is the case with far too many studies. As Palmer emphasized
acquisition and whatever sorts of proficiency may be acquired, it appears that the
language teaching profession is long on methods, recipes, and hunches, and short
on theories that are clear enough to put to an experimental test.
of sorts were conducted using doze procedure in one way or another. A data
search turned up 192 dissertations, 409 studies in ERIC, z 'd 116 in the PsychLit
database. At this conference there were a number of other studies that either
employed or prominently referred to doze procedure (but especially sec R. S.
Hidayat, S. Boonsatorn, Andrea Penaflorida, Adrian Palmer, David Nunan, and
J. D. Brown). We might predict that some of the many doze studies in recent
years, not to mention the many other testing techniques, would focus on thc first
person position, i.e., variability attributable to the producer(s) of a tcxt (or
discourse); some on the second person position, variability attributable to the
consumer(s); and some on third position, variability attributable to the text itself.
speaker) and a reader (or listener) through text (or discourse) is always a
(third), testers (first position once removed) and tests (third position once
removed) interlocutors and texts, raters (first position twice removed) and
testers and interlocutors and texts, etc., are in agreement. It is miraculous (as
Einstein observed decades ago, see Oiler, 1989) that any correspondence (i.e.,
representational validity) should es,cr be achieved between any representations
and any facts, but it cannot be denied that such well-equilibrated pragmatic
mappings arc actually common in human experience. They are also more
common than many skeptics want to admit in language testing research as well,
though admittedly the testing problem is relatively (and only relatively) more
complex than the basic communication problem. However, I believe that it is
important to see that logically thc two kinds of problems are ultimately of the
same class. Therefore, as testers (just as much as mere communicators) we seek
which is the same thing; cf. Oiler, 1990), on the other hand, are hardly worth
discussing since they are so easy to achieve in their minimal forms as to be trivial
and empty criteria. Contrary to a lot of flap, classrooms are real places and what
takes place in them is as real as what takes place anywhere else (c.g., a train
station, restaurant, ballpark, or you name it!) and to that extent tests are as real
and authentic in their own right as any other superficial semiotic event.
Interviews are real enough. Conversations, texts, stories, and discourse in
general can be just as nonsensical and ridiculous outside the classroom (or the
interview, or whatever test) as in it. Granted we should get the silliness and
nonsense out of our teaching and our testing and out of thc classroom (except
perhaps wheu wc arc merely beirg playful which no doubt has its place), but
reality and authenticity apart from a correspondence theory of truth, or the
pragmatic mapping theory outlined here, are meaningless and empty concepts.
Anything whatever that has any existence at all is ipso facto a real and
authentic fact. Therefore, any test no matter how valid or invalid, reliable or
unreliable, is ipso facto real and, in this trivial way, authentic. The question is
whether it really and authentically corresponds to facts beyond itself. But here
we introduce the whole theory of pragmatic mapping. We introduce all of
Peirce's theory of abduction. or the elaborated correspondence theory of truth.
48
Tests, curricula, classrooms, teachers and tcaching are all real enough, the
problem is to authenticatc or validate them with reference to what they purport
to represent.
With reference to that correspondence issue, without going into any more
detail than is necessary to the basic principles at stake let me refer to a few
studies that show the profound differences across the several pragmatic
perspectives described in Figure 6. Then I will reach my conclusion concerning
all of the foregoing and hopefully justify in the minds of participants in the
conference and other readers of the paper the work that has gone into building
up the entire semiotic theory in the first place. There are many examples of
studies focussing on the first position, though it is the least commonly studied
position with doze procedure. A dramatically clear example is a family of
studies employing doze procedure to discriminatc speech samples drawn from
normals from samples drawn from psychotics.
The First Person in Focus. When the first person is in focus, variability is
attributable to the author (or speaker) of the tcxt (or discourse) on which the
doze test is based. In one such study, Maher, Manschreck, Weinstein, Schneycr,
and Okunieff (1988; and see their references), the third position was partially
kinds of missing items and therefore might produce a ceiling effect. In such a
case any differences between the speech samples of normals and psychotics
would be run together at the top of the scale and thereby washed out. Indeed
thc follow up confirmed this expectation and it was concluded that less educated
49
Ej
would have been to adjust the level of difficulty of the task performed by the
normals and psychotics thereby producing more complex passages (in the third
position) to be doze-rated.
Another pair of studies that focussed on first position variability with doze
procedure sought to differentiate plagiarists from students who did their own
work in introductory psychology classes. In their first experiment (El), Standing
and Gorassini (1986) showed that students received higher scores on doze
passages over their own work (on an assigned topic) than over someone else's.
Subjects were 16 undergraduates in psychology. In a follow-up with 22 cases, E2,
they repeated the design but used a "plagiarized" essay on a new topic. In both
cases, scores were higher for psychology students who were filling in blanks on
their own work.
proficient in the language at issue in thc plagiarized material, might very well
escape detection. Motivation of the writers, the amount of experience they may
have had with the material, and other background knowledge are all
uncontrolled variables.
With respect to E2, the third position is especially problematic. Depending
on the level of difficulty of the text selected, it is even conceivable that it might
be easier to fill in thc blanks in the "plagiarist's" work (the essay from an
extraneous source) than for some subjects to recall the exact word they
themselves used in a particularly challenging essay. There is also a potential
confounding of first and second positions in El and in E2. Suppose one of the
subjects was particularly up at the time of writing thc essay and especially
depressed, tircd, or down at the time of the doze test. Is a not possible that an
the filling in of the blanks (or vice versa) arc potential confounding variables.
50
Nevertheless, there is reason to hold out hope that under the right conditions
doze procedure might be employed to discourage if not to identify plagiarists,
and it should be obvious that countless variations on this theme, with reference
to the first position, are possible.
second position, consider Zinkhan, Locander, and Leigh (1986). They attempted
category, and one cognitive relating to knowledge and ability of the subjects (we
confounded
may note that background knowledge and language proficiency are
in
here but not necessarily in a damaging way). Here, since the variability
advertising copy (i.e., third position) is taken to be a causal factor in getting
people to remember the ad, it is allowed to vary freely. In this case, the first
position effectively merges with the third, i.e., the texts to be reacted to. It is
inferred then, on the basis of the performance of large numbers of measures
in
aimed at the second position (the n of 420), what sorts of performances
effective
in
producing
recall.
In
writing or constructing ads are apt to be most
this instance since the number of cases in the second position is large and
randomly selected, the variability in second position scores is probably
legitimately employed in the inferences drawn by the researchers as reflecting
the true qualitative reactions of subjects to thc ads.
Many, if not mOst, second language applications of doze procedure focus
taker.
on some aspect of the proficiency or knowledge of the reader or test
Hidayat
at
this
conference
who
wrote,
Another example is the paper by R. S.
the text (or the writer through the text). To be able to do so a reader should
contribute his knowledge to build a 'world' from information given by the tcxt." I
would modify this statement only with respect to the "world" that is supposedly
"built" up by the reader (and or the writer). To a considerable extent both the
writer and the reader are obligated to build up a representation (on the writer's
building up a fictional concoction had best have in mind the common world of
ordinary experience. Even in the case of fiction writing, of course, this is also
This being the case, apparently, we may conclude that the first and third
positions were adequately controlled in Hidayat's study to produce the expected
outcome in the second position.
Stansfield's case, the oral tests, an Oral Proficiency Interview (OPI) and a
Simulated Oral Proficicncy Interview (SOPI), are themselves aimed at
measuring variability in the performance of language users as respondents to the
interview situation--i.e., as takers of the test regarded as if in second position.
Though subjects are supposed to act as if they were in first position, since thc
interview is really under the control of the test writer (SOPI) or interviewer
(OPI), subjects are really reactants and therefore arc seen from the tester's point
of view as being in second position. As Stansfield observes, with an ordinary OPI
standardization of the procedure depends partly on training and largely on the
wits of the interviewer in responding to the output of each interviewee.
That is to say, there is plenty of potential variability attributable to the first
position. With the SOPI, variability from the first position is controlled fairly
rigidly since the questions and time limits arc set and thc procedure is more or
less completely standardized (as Stansficld pointed out). To the extent that the
procedure can be quite perfectly standardized, rater focus can be directed to the
variability in proficiency exhibited by interviewees (second position) via the
discourse (third position) that is produced in the interview. In other words, if
thc first position is controlled, variability in the third position can only be the
responsibility of the person in second position.
52
With the OPI, unlike the case of the SOPI, the interviewer (first position)
variability is confounded into the discourse produced (third position). Therefore,
it is all the more remarkable when the SOPI and OPI are shown to correlate at
such high levels (above .90 in most cases). What this suggests is that skilled
interviewers can to some extent factor their own proficiency out of the picture in
an OPI situation. Nevertheless, cautions from Ross and Berwick (at this
conference) and Bachman (1988) are not to be lightly set aside. In many
interview situations, undesirable variability stemming from the first position (the
interviewer or test designer) may contaminate the variability of interest in thc
The Third Position in Focus. For a last case, consider a study by Henk,
Helfeldt, and Rinehart (1985) of the third position. The aim of the study was to
determine the relative sensitivity of cloze items to information ranging across
sentence boundaries. Only 25 subjects were employed (second position) and two
doze passages (conflating variables of first and third position). The two
passages (third position) were presented in a normal order and in a scrambled
version (along the lines of Chihara, et al., 1977, and Chavcz-011er, ct al., 1985).
The relevant contrast would be between item scores in the sequential versus
scrambled conditions. Provided the items arc really the same and the texts are
not different in other respects (i.e., in terms of extraneous variability stemming
from first and/or second positions, or unintentional and extraneous adjustments
between the scrambled and sequential conditions in the third position).
That is, the tests must not be too easy or too difficult (first position) for the
subject sample tested (second position), or, alternatively, that the subject sample
does not have too little or too much knowledge (second position) concerning the
content (supplied by the first position) of one or both texts, the design at least
has the potential of uncovering some items (third position) that are sensitive to
constraints ranging beyond sentence boundaries. But does it have the potential
4.
here. This is never a legitimate research conclusion. Anyone can see the
difficulty of the line of reasoning if we transform it into an analogous syllogisul
presented in an inductive order:
Specific case, first minor premise: I found no gold in California.
Specific case, second minor premise: I searched in two (or n) places (in
California).
General rule, or conclusion: There is no gold in California.
Anyone can see that any specific case of a similar form will be insufficient to
prove any general rule of a similar form. This is not a mere question of
statistics, it is a question of a much deeper and more basic form of logic.
CONCLUSION
Therefore, for reasons made clear with cach of the several examples with
respect to each of the three perspectives discussed, for language testing research
commented that she'd have liked to get more from the lecture version of this
paper than she felt she received, I have two things to say. First, that I am glad
she said she wanted to receive more and flattered that "the time", as shc said,
"seemed to fly by" during the or al presentation (I had fun too!), and second, I
hope that in years to come as she and other participants reflect on the
presentation and the written version they will agree that there was even morc to
be enjoyed, reflected upon, understood, applied, and grateful for than they were
able to understand on first pass. As Alderson correctly insists in his abstract, the
.udy of language tests and their validity "cannot proceed in isolation from
developments in language education more generally" (apropos of which, also see
6"
54
01 ler and Perkins, 1978, and Oiler, in press). In fact, in order to proceed at all, I
am confident that we will have to consider a broader range of both theory and
research than has been common up till now.
REFERENCES
Bachman, Lyle. 1988. Problems in examining the validity of the ACTFL oral
proficiency interview. Studies in Second Language Acquisition 10:2. 149-164.
Burks, Arthur W., 1958. Collected Writings of Charles S. Peirce, Volumes VII and
VIII. Cambridge, Massachusetts: Harvard University.
Chavez-011er, Mary Anne, Tetsuro Chihara, Kelley A. Weaver, and John W. 011er,
Jr. 1985. When are close items sensitive to constraints across sentences?
Language Learning 35:2. 181-206.
Chihara, Tetsuro, John W. 011er, Jr., Kelley A. Weaver, and Mary Anne Chavez011er. 1977. Are close items sensitive to constraints across sentence boundaries?
Language Learning 27. 63-73.
C'homsky, Noam A. 1979. Language and Responsibility: Based on Conversations
with Mitsou Ronat. New York: Pantheon.
Eisele, Carolyn, ed. 1979. Studies in the Scientific and Mathematical Philosophy of
Charles S. Peirce. The Hague: Mouton.
Henk, William A. John P,. Helfeldt, and Steven D. Rinehart. 1985. A metacognitive
55
t.)
soft
Kamil, Michael L., Margaret Smith-Burke, and Flora Rodriguez-Brown. 1986. The
Nagel, Ernest. 1959. Charles Sander: Peirce: a prodigious but little known
American philosopher. Scientific American 200. 185-192.
Oiler, John W., Jr. 1979. Language Tests at School: A Pragmatic Approach.
London: Longinan.
Oiler, John W, Jr, 1989. Language and Erperience: Classic Pragmatism. Lanham,
Maryland: University Press of America.
Oiler, John W, Jr. 1990. Semiotic theory and language acquisition Invited paper
presented at the Forty-first Annual Georgetown Round Table on Languages and
Linguistics, Washington, D.C.
011er, John W, Jr. and Kyle Perkins. 1978. Language in Education: Testing the
Tests. London: Longinan.
Oilman, Philip K.. Lawrence J. Sticker, and Thomas S. Barrows. 1990. Analyzing
test structure by multidimensional scaling. Journal of Applied Psychology 75:1.
21-27.
56
6 '1
t,
Schumann, John. 1983. Art and science in second language acquisition research. In
57
INTRODUCTION
the United States but also worldwide, it is often claimed that the "prompt", the
qui ition or stimulus to which the student must write a response, is a key
variable. Maintaining consistent and accurate judgments of writing quality, it is
argued, requires prompts which are of parall::t difficulty. There are two
problems with this. First, a survey of the writing assessment literature, in both
LI (Benton and Blohm, 1986; Brossell, 1983; Brosse II and Ash, 1984; Crowhurst
and Fiche, 1979; Freedman, 1983; Hoetker and tsrossell, 1986, 1989; Pollitt and
Hutchinson, 1987; Ouellmalz et al, 1982; Ruth and Murphy, 1988; Smith et al,
1985) and L2 (Carlson et al, 1985; Carlson and Bridgeman, 1986; Chiste and
O'Shea, 1988; Cummings, 1989; Hirokawa and Swalcs, 1986; Park, 1988; Reid,
1989 (in press); Spaan, 1989; Tedick, 1989; Hamp-Lyons, 1990), reveals
conflicting evidence and opinions on this. Second (and probably causally prior),
we do not yet have tools which enable us to give good answers to the questions
of how difficult tasks on writing tests are (Pollitt and Hutchinson, 1985).
Classical statistical methods have typically been used, but are unable to provide
sufficiently detailed information about the complex interactions and behaviors
direction we see that language teachers and cssay scorers often feel quite
strongly that they can judge how difficult or easy a specific writing test prompt is,
and arc frequently heard to say that certain prompts are problematic because
they arc easier or harder than others. This study attempts to treat such
observations and judgments as data, looking at the evidence for teachers' and
raters' claims. If such claims are borne out, judgments could he of important
help in establishing prompt difficulty prior to large-scale prompt piloting, and
reducing thc problematic need to disard many prompts because of failure at thc
pilot stage.
1. .
58
4.
H. BACKGROUND
II. METHOD
Since research to date has not defined what makes writing test topics
difficult or easy, our first step toward obtaining expert judgments had to be to
59
design a scale for rating topic difficulty. Lacking prior models to build on, we
chose a simple scale of 1 to 3, without descriptions for raters to use other than
1 = easy, 2= average difficulty and 3 = hard. Next the scale and rating procedures
were introduced to 2 trained MELAB composition readers and 2 ESL writing
experts, who each used the scale to assign difficulty ratings to 64 MELAB topics
(32 topic sets). The four raters' difficulty ratings were then summed for each
topic, resulting in one overall difficulty rating per topic, from 4 (complete
agreement on a 1 =easy rating) to 12 (complete agreement on a 3-hard rating).
We then compared "topic difficulty (thc sum of judgments of the difficulty of
each topic) to actual writing scores obtained on those topics, using 8,497 cases
taken from MELAB tests administered in the period 1985-89.
Next, wc categorized the 64 prompts according to the type of writing task
type categories would be judged generally more difficult than others, and that
60
Tonic Difficulty
Table 1
Topic
Topic
opic
Difficulty
Set
tio
4
4
11
27
31
13
6
6
33
34
46
49
30
35
41
4
4
47
12
22
29
34
37
38
40
40
43
10
21
ifficulty
'0.
12
18
21
B
A
22
23
B
A
A
B
A
31
33
35
46
50
A
B
A
A
A
A
37
A
B
A
38
13
32
SeE
42
43
44
45
49
2.3
26
28
28
33
Topic
39
41
13
10
10
10
10
10
10
10
10
10
11
11
11
11
11
11
12
11
13
24
23
30
39
42
45
47
10
13
18
26
27
44
50
Most prompto had a difficulty score around the middle of thc overall
difficulty scale (i.e. 8). This is either because most prompts are moderately
difficult, or, and more likely, because of thc low reliability of our judges'
61
And here was our first difficulty, and our first piece of interesting data: it
seemed that claims that easy readers and language teachers can judge prompt
difficulty, while not precisely untrue, are also not precisely true, and certainly not
true enough for a well-grounded statistical study. When we looked at the data to
discover whether the judgments of topic difficulty could predict writing score,
using a two-way analysis of variance, in which writing score was the dependent
variable and topic difficulty was the dependent variable, we found that our
predictions were almost exactly the reerse of what actually happen (see Table
2).
SOURCE
MEAN SOR
OF SUM OF BORS
BETWEEN
WITHIN
TOTAL
ETA. .0698
SU4D1FF
(4)
(5)
(6)
(7)
(8)
(9)
413.31
84327.
84740.
8574
8582
ETA-BOR- .0049
51.663
9.8352
MEAN
679
8.4439
6.5533
9.3872
10.579
9.5634
9.1242
10.763
7.1453
2.9058
2.5599
3.0638
3.2526
3.0925
3.2941
3.0206
3.2807
2.6731
9.6742
3.1423
(10)
(11)
(12)
1040
577
72
GRAND
8583
9.4394
113
737
1539
2325
1501
5.2529
.0000
8.9455
8.9823
9.1045
9.4048
9.4705
9.5776
9.6519
9.7660
9.4028
F-STATISTIC SIGNIF
VARIANCE
10.851
STD DEV
td
62
Figure 1- ANOVA
Topic Difficulty and Writing Score
SOURCE
REGRESSION
ERROR
8587
8582
TOTAL
MuLT R.
06626
VARIABLE
MEAN SOR
F-STAT
SIGNIF
372.05
84368.
84740.
372 OS
9.8320
37 841
0000
PARTIAL
CONSTANT
16.S7.JMOIFF
SUM SORS
06626
COEFF
6.5291
.71458
STD ERROR
.15179
.18623 -1
T-STAT
SIGNIF
56.190
6.1515
O.
0000
significant
Further, while the effect of judged topic difficulty on writing scorc is
would be
is
about
18
times
smaller
than
(p=.0000), the magnitude of the effect
difficulty
lengths
of
thc
writing
and
topic
expected, considering the relative
the topic
scale
is
approximately
twice
as
long
as
scales. That is, since the writing
writing
difficulty scale (19 points vs. 11 points), we would expect, assuming "even"
equal
increases
in
steps
that
arc
all
of
proficiency (i.e. that writing proficiency
with a 2topic
difficulty
would
be
associated
width) that every 1-point increase in
effect
point decrease :n writing score; instead, the coefficient for topic dqficulty
.ctually, on
increase
in
topic
difficulty
is
(.11456) indicates that a 1-point
Also,
average, associated with only about a 1/10-point increase in writing score.
since a
is
of
little
practical
consequence,
it should be noted that such an increase
either
change of less than a point in MELAB writing score would have no effect
writing
performance
o:
on
final
MELAB
score.
on reported level of
63
private
expository
.1. argumentative
public
0.11!
Table 3 shows the difficulty ratings for each category or "response mode":
Table 3: Resmense mode.- and DItticulty RatIngs
ExpPub
WU
11A
4013
DUf
7
27A
10A
298
318
4
4
8
8
3313
32A
418
498
348
12A
468
188
3513
6
6
23A
4 lA
47A
22A
3713
21A
238
26A
28B
3213
5013
30A
4713
8
9
9
11
11
1013
18A
7
7
Arecrs
DM
49A
128
38A
388
8
8
9
10
10
10
10
42A
35A
24A
29A
398
45B
7
7
8
8
8
8
8
ArgPub
Comb.
Ufff
37A
39A
438
218
228
248
1113
9
9
9
9
10
31A
33A
46A
13A
10
10
1313
11
2713
11
44A
50A
11
Dill
3013
268
11
34A 7
28A 8
45A 8
4213
or
12
9
10
10
T diff=6.528
%wt. =9.063
r diff4.746
NW
=9.398
'X 611=8.44
dift=8.93
T wr..9.359
wr =9.9
overall 7 el:L.7.9455
rvera:'
64
diff=7 713
wr =9 517
Figure 3. ANOVA
Topic Difficulty Judgments and Response Mode Difficulty
Judgments
ANALYSIS OF VARIANCE OF 16.SUMDIFF
OF SUM OF SORS
SOURCE
BETWEEN
WITHIN
TOTAL
ETA.
5656
8492
8496
8635.0
18361
26996
O.
998 42
2158.8
2.1622
(RANDOM EFFECTS STATISTICf..)
ETA-SOR- .3199
N
MEAN
VARIANCE
STD OEV
COMBIN
2538
1210
1543
2417
789
6.5284
8.7463
8.4407
8.9326
7.7136
2.8666
1.6618
7447
1.6482
3.0549
1.6931
1.2891
1.3209
1.2838
1.7478
GRAND
8497
7.9854
3.1775
1.7826
CATEGORY
EXPPRI
EXPPUB
ARGPR1
ARGPuS
F-STATISTIC SIGN1f
CONTRAST
OBSERVED
PREDICTED
F-STAT
SIGNIF
-2.0986
-2.7098
-1.7261
-O.
-O.
-O.
892.49
1488 0
603.74
O.
O.
O.
Since the two sets of judgments were made by the same judges, albeit six months
apart, such a finding is to be expected.
65
BEST COPY AVAILABLE
would get the highest scores, with topics in the other categories falling in
between. To test this hypothesis, we conducted a two-way analysis of variance, in
which writing score was the dependent variable and topic type the independent
variable. The results of the ANOVA, shown in Figure 4, reveal that our
predictions were exactly the reverse of wha: actually happened: on average,
expository/private topics are associated with the lowest writing scores and
argumentative/public the highest.
Figure 4: ANOVA
13ETwEEN
8492
8496
ETA-SOR.
1033
0107
F-STATISTIC S1GNIF
0000
224.18
22.899
9.7500
(RANDOM EFFECTS STATISTICS)
896.71
63137.
84034.
WITHIN
TOTAL
ETA-
MEAN SOR
OF SUM OF SOPS
SOURCE
MEAN
2538
1210
comaiN
1543
2417
789
9.0634
9.3963
9.3597
9.9040
9.5171
6.9649
11.348
9.9127
9.6762
10.100
2.9975
3.3667
3.1464
3.1107
3.1781
GRAND
8497
9.4462
9.8910
3.1450
CATEGORY
EXPPR1
ExPPU8
ARGPRI
ARGPU8
VARIANCE
STO DEV
CONTRAST
OBSERVED
PREDICTED
F-STAT
SIGNIF
-.80192
-.87924
-0.
-0.
.20941
-0.
28.781
34.599
1.9627
.0000
.0000
.1613
with
the
highest
difficulty
ri
( Li
66
ratings
and
of
the
hardest
(argumentative/public) type would get the lowest writing scores. To test this, wc
again used a two-way analysis of variance, this time selecting writing score as the
dependent variable and topic ditficulty and topic type as the independent
variables. It should be noted that in order to be able to use ANOVA for this
ANOVA
1
2
2
2
2
2
SOURCE
MEAN
d1ffic
type
dt
ERROR
As the
tvoc
expri
expub
argpri
argpub
combin
expri
expub
argpri
argpub
combin
CELL MEANS
8.99454
8.27442
9.60690
9.97680
9.62406
9.19080
9.64121
9.30247
9.88822
9.40769
COUNT
1647
215
290
431
399
891
995
1253
1986
390
451627.86918
46.57869
769.24715
357.94852
82750.52196
MEAN SQUARE
OF
SUM OF SQUARES
1
1
4
4
8487
451627.86938
46.57869
192.31179
89.48713
9.75027
ST DEV
3.01525
3.26895
3.11886
3.08627
3.28068
2.96185
3.34214
1.15372
3.11648
3.06995
46319.54
TAIL FRO8
0.v
0.0289
0.0
0.0000
67
Table 4.
8.27442
expository/public
8.99454
expository/private
9.19080
expository/private
9.30247
argumentative/private 2
9.40769
combination
9.60690
argumentative/private 1
9.62406
combination
9.64121
expository/public
9.88822
argumentative/public
9.97680
argumentative/public
IV. DISCUSSION
matchcd by our writing score data. What we did find were unexpected but
interesting patterns which should serve both to inform the item writing stage of
direct v,Titing test development, and to define questions about the effects of topic
and that writers sill have morc difficulty writing on these. Yct, it may be that
either what judges perceive as cognitively demanding to ESL writers is in fact
not, or alternately, that is not necessarily harder for ESL writers to write about
the topics judged as morc cognitively demanding while some LI studies have
concluded that personal or private topics arc easier for LI writers than
impersonal or public ones, and that argumentative topics arc more difficult to
write on than topic:. calling for other discourse modes, these LI findings do not
necessarily generalize to ESL writers.
Another possible explanation for the patterns we discovered is that perhaps
more competent writers choose hard topics and less competent writers choose
68
easy topics. In fact, there is some indication in our data that this may be true. We
indeed been chosen by candidates with higher mean Part 3 scores. We found this
to be true for 15 out of 32--nearly half--of the topic sets; thus, half of the time,
general language proficiency and topic choice could account for the definite
patterns of relationship we observed between judged topic difficulty, topic type
and writing performance. One of these 15 sets, set 27, was used in a study by
Spaan (1989), in which the same writers wrote on both topics in the set (A and
B). While she found that, overall, there was not a significant difference between
scores on the 2 topics, significant differences did occur for 7 subjects in her
study. She attributed these differences mostly to somc subjects apparently
possessing a great deal more subject matter knowledge about one topic than the
other.
A furthcr possible explanation for the relationship we observed between
difficulty judgments and writing scores could be that harder topics, while perhaps
more difficult to write on, push students toward better, rather than worse writing
performance. This question was also explored through an investigation of topic
difficulty judgments, mean Part 3 scores and mean writing scores for single
topics in out dataset. We found in our dataset 3 topics whose means Part 3
scores were below average, but whose mean writing scores were average, and
which were judged as "hard"(11 or 12, argumentative/public). One of these
topics asked writers to argue for or against US import restrictions on Japanese
cars; another asked writers to argue for or against governments treating illegal
aliens differently based on their different reasons for entering; the other asked
writers to argue for or against socialized medicine. The disparity between Part 3
and writing performance on these topics, coupled with thc fact that they were
judged as difficult, suggests that perhaps topic difficulty was an intervening
variable positively influencing the writing performance of candidates who wrote
on these particular topics. To thoroughly test this possibility, future studies could
be conducted in which all candidates write on both topics in these sets.
A related possibility is that perhaps topic difficulty has an influence, not
necessarily on actual quality of writing performance, but on raters' evaluation of
that performance. That is, perhaps MELAI3 composition raters, consciously or
subconsciously, adjust thcir scores to compensate for, or even reward, choice of
69
0 4. .
"extra credit" for having attempted a difficult topic. Whether or not these
concerns translate into actual scoring adjustments is an important issue for direct
writing assessment research.
V. CONCLUSION
In sum, the findings of this study provide us with information about topic
difficulty judgments and writing performance without which we could effectively
proceed to design and carry out research aimed at answering the above
questions. In other words, we must first test our assumptions about topic
difficulty, allowing us to form valid constructs about topic difficulty, allowing us
to form valid constructs about topic difficulty effect; only then can we proceed to
carry out meaningful investigation of the effect of topic type and difficulty on
writing performance.
REFERENCES
Bachman, Lyle. 1990. Fundamental Considerations in Language Testing. London,
England: Oxford University Press.
Benton, S.L. and P.]. Blohm. 1986. Effect of question type and position on
measures of conceptual elaboration in writing. Research in the Teaching of
English. 20: 98-108
Bridgeman, Brent and Sybil Carlson. 1983. A Survey of Academic Writing Tasks
Required of Graduate and Undergraduate Foreign Students. TOEFL Research
Report No. 15. Princeton, New lerseF Educational Testing Service.
Brossell, Gordon and Barbara Hooker Ash. 1984. An experiment with the wording
of essay topics. College Composition and Communication, 35: 423-425.
Carlson, Sybil! and Brent Bridgeman. 1986. Testing ESL student writers. In
Greenberg Karen L., Harvey S Weiner and Richard A Donovan (Eds). Writing
Assessment. Issues and Strategies(126-152). New York: Longman.
Chiste, Katherine and Judith O'Shea. 1988. Patterns of Question Selection and
Writing Poformance of ESL Students. TESOL Ouarterly, 22(4): 681-684.
Crowhurst, Marion and Gene Piche. 1979. Audience and mode of discourse effects
on syntactic complexity of writing at two grade levels. Research in the Teaching
of English 13; 101-110.
Cummings,A. 1989. Writing expenise and second language proficiency. Lan2uage
Learning,39(1): 81-141.
Greenberg Karen L. 1986. The development and validation of the TOEFL writing
test: a discourse of TOEFL research reports 15 and 19. TESOL Ouarterly,
20(3): 531-544.
writer. In
. 1988. The product before: task related influences on the
Robinson, P (Ed). Academic Writing : Process and Product. London:
Macmillan in Pauline association with the British Council.
Hirokawa, Keiko and John Swales. 1986. The effects of modibing the formality
level of ESL composition questions. TESOL Quarterly, 20(2): 343-345.
Hoetker, James. 1982. Essay exam topics and student writing. College Composition
and Communication,33: 377-91
Hoetker, James and Gordon Brossell. 1989. The effects of systematic variations in
Lunsford, Andrea 1986. The past and future of writing assessment. In Greenberg,
1984.
Mohan, Bernard and Winnie An Yeung Lo. 1985. Academic writing and Chinese
students: transfer and developmental factors. TESOL QUARTERLY, 19(3):
51.5-534.
Park, Young Mok. 1988. Academic and ethnic background as factors affecting
writing performance. In Purves, Alan (Ed). Writing Across Languages and
Cultures Issues in Cross Culturat Rhetoric Newbury Park, California: Sage
Publications.
72
Pollitt, Alistair, Carolyn Hutchinson, Noel Gutwhistle and 1985. What Makes
Exam Questions Difficult: An Analysis of '0' Grade Questions and Answers.
Research Reports for Teachers, No. 2. Edinburgh: Scottish Academic Press.
1984. Toward a
domain referenced system for classifying composition assignments. Research in
the Teaching of English 18: 385-409.
Quellmalz, Edys. 1984. Toward a successful large scale writing assessment: where
are we now? where do we go from here? Educational Measurement: Issues and
Practice Spring 1984: 29-32, 35.
Quellmalz, Edys, Frank Capell and Chili-Ping Chou. 1982. Effects of discourse
Ruth, Leo and Sandra Murphy. 1988. Desivning Writing Tasks for the Assessment
of Writing. Norwood, New Jersq: Ablex Publishing Company.
.
1984. Designing topics for writing assessment: problems
of meaning. College Composition and comIllunicat ion 35: 410-422.
Smith, W. et al. 1985. Some effects of valying the structure of a topic on college
students' writing. Written Comniunication 2(1): 73-89.
Spaan, Mary. 1989. Essay tests: What's in a prompt? Paper presented at 1989
TESOL convention, San Antonio, Texas, March 1989.
Tedick, Diane. 1989. Second language writing assessment: bridging the gap
between theory and practice. Paper presented at the 8901 Annual Convention of
the American Educational Research Association, San Francisco, California,
March 1989.
73
APPENDIX I
111011CM 012.1511
Weal
ASSES9,03 BATTERY
rtilf ird fulty arrebald. Elate use of a Ma rade of PA-dead (senteros Mel) stnetrec end acarate exprologIcal (word
fam) control. thane le a wde rage of faccrietsly awl occabuiey. Crainattion is fatale. aril effective. NC there It exetlent
iFoic It
lipt
is mil derstped, nth icknaredoent of Its comaulty. Wei syntactic strixtures are used nth same fiesblItty. and there ts cocci
sontotdical contd. Vccelculery I aped srld malty teed faccriately. 13-gan98atlar Is cartrollei aril Generally Wangs to tte
material ard there are fee problem int caucticn. Smiley ard catheter Islas are rot clhaectlfcl.
83
Trod is generally clearly and mesa drekced. nth at lent sof acincielelment of Its um:laity. Both sive aryl Comex nntacto
aims some flexallty. ard Is molt
starttra re gererally adifigtely mod; there IC &WADI eorphololIcel antra Vccobutary
Spelled aid
erCrarets. Orgruzatim is =trolled Ind Vas earn fatorlicy tO 98. seteriel, in! corrector, Is Loudly edema
anctudlon errors aro sometess distractIrg.
71
Topic I toyekeed eery tut not owlet* and deal acionireccrd Its ccosleety. oth CEOs and moles syntactic structires are
anent; In some Tr essays tom are cauthisa arid scaratay uteri Mlle In otters tare le we florcy aro loos ecaseci.
leraolcalcsi cams is ye:aslant. Vcaaary I admits, art wy eametlees te tecorcoletsly med. Oroanizaticn Is prweity
controlled. file correction Is liOnotaso laybunt er Utexceartul. SOWN an] purtudnn arca are emetlees distractro.
73
Trot demexent MOM. altharyi Wad by toapletinia, le& of deity. or lack of frac The toot my S. tiff as trao It
has asty ors dlionstrt, or ally ay rat of vim Is pmsbis. Sr sas T sways both COI arcl =Oa nfirtactIc strIctu-ss Pra proselt.
but *SI airy snag; Wart MAI worsts awitaa tut vs way reetrletso h tba rams of agape attempted. Obraongfel ardroi It
Moraine& YoMadery is esteem hideouts. and wrung tworculatsty owl erCiatatirt It PatimBY core/TOM, fee camectim so
of ton feint or tresmserful. *WV Ird machete:In errors ars meagre distrectSd.
67
To* COwtOlint la orment blit restricts:4 arrl oftirs Simplot. cr wear. Mx* SrtactIc Kneen! dale% with any *Toni
0:4m sYntacte enema. If presort. we rot orrtrollsd. Leda ecracia3tel =Vol. Sane ard WWI weabulary molly accroxsates
mac Out Is oftin Irsamsciataty I. OrprizatIon, lam *plant, la poaly controlled. wd nue or ro =meteor IS rawest.
Wad ard anctiata wont re often astractirsi.
03
Doffs Irtb atp of Us* demerent Mee eractic Michele are trupint. but atth wry srrtra; licks sortrobonsi =Ord Marra,
en] slope occiaery Maolts amentlar. Rea le little or no crEentrettri. ad no =action forint. SAIIrg trod prottstIcr, errors
oftan cam ores eitrafennoe.
57
Often ertrmeM that; cadets tIes fresertary comsecetim fart the trot. Theo la little syntactic cr ontological antra
vocabulary le 001/ restricted so hmarately med. lb orgentodion C arnectker ere *print SWIM * often rdictilerable ard
Utility Mort wilily Ws* 10 lords or Ws. Cownicatie nothig, and I often ccow directly fro de moot. There is little sir of
syntactic cr entolo;Kal =trot. Vac:Wiry Is *Vilely rostrtted end repotithety used. Thee Is no meant co:ant:atlas cr
=scrim Sollitv is often Intedderfle end pactuitext I slwag at wows rents.
NAT.
NAT. (Int Cn Tint) Wats* a compositial origin al a to* =Piet* afferent from any of than asetred, rt cos not notate tot a
ale hes arety Vowed frce or siserterpreted a tcplc. LOT. aelpostticre often afar awe] Nal merael They are rot aselrol
1001111
Or
1/10/90
VIC
74
APPENDIX 1 (CONT. D)
WNW MUST LW:Ma ASSESSIXT ATtY
thll Milt WO mot to Maga that a curtain faaur la ESTWOLLY QOM Pk 600 P OZWARISOI TO DE (Kiva LEM Or Of WM
CM(
INTETPRETRT
c
d
n
o
Type I: EXPOSITORY/PRWATE
Whcn you go to a party, do you usually talk a lot, or prefer to listen? What does
this show about your personality?
Type 2: EXPOSITORY/PUBL1C
Imagine that you are in charge of establishing the first colony on the moon.
What kind of people would you choose to take with you? What qualities and
skills would they have?
Type 3: ARGUMENTATIVE/PRIVATE
A good friend of yours asks for advice about whether to work and make money
of whether to continuc school. What advice would you give him/hcr?
Type 4: ARGUMENTATIVE/PUBLIC
What is you opinion of mercenary soldiers (those who are hircd to fight for a
country other than their own?)
Discuss.
76
INTRODUCTION
There is a long tradition in the academic world of using essays and other
A means of assessing student proficiency and
forms of written expression
achievement.
not been specifically trained or guided for the task -- although it has to be
admitted that this is an idea that dies hard in the context of academic assessment
in the university.
The modern concern about achieving reliability in marking has meant that
relatively less attention has been paid to the other major issue in the testing of
writing: how to elicit samples of writing from the students. Recent research on
vaiting involving both native speakers and second language learners raises a
number of questions about the setting of writing test tasks. The relevant
research involves not only the analysis of writing tests but also more basic studies
of the nature of the writing process. Some of the questions that arise are as
follows:
77
41 t
and then to consider the more general issue of what constitutes a valid writing
test task.
Carrell (e.g. 1984, 1987), Johnson (1981) and others - that background
knowledge is a very significant factor in the ability of second language readers to
comprehend a written text. Furthermore, testing researchers such as Alderson
and Urquhart (1985) and Hale (1988) have produced some evidence that a lack
of relevant background knowledge can affect performance on a test of reading
comprehension. Until recently, there have been few comparable studies of the
the test-takers, and so it is hoped that all of them will find at least onc that
motivates them to write as best they can.
78
In order to provide a basis for analysing writing test tasks according to the
amount of content material that they provide, it is useful to refer to Nation's
(1990) classification of language learning tasks. In this system, tasks are
categorised according to the amount of preparation or guidance that the learners
are given. If we adapt the classification to apply to the testing of writing, there
are three task types that are relevant and they may be defined for our purposes
as follows:
1 Independent tasks:
The
learners
arc set
topic
and
2 Guided tasks:
79
Foreign Language). In one of the two alternating versions of this test, the
candidates are presented with data in the form of a graph or a chart and are
asked to write an interpretation of it (Educational Testing Service, 1989: 9). For
example, a preliminary version of the test included three graphs showing changes
in farming in the United States from 1940 to 1980, the task being to explain how
the graphs were related and to draw conclusions from them.
3 Experience tasks:
Tasks of this kind arc found in major EAP tests such as the International
English Language Testing Service (IELTS) and the Test in English for
Educational Purposes (TEEP). In both of these tests, writing tasks are linked
with tasks involving other skills to some extent in order to simulate the process
of academic study. For example, in the first paper of TEEP, the candidates
work with two types of input on a single topic: a lengthy written academic text
and a tcn-minute lecture. In addition to answering comprehension questions
about each of these sources, they are required to write summaries of thu
information presented in each one (Associated Examining Board, 1984). The
same kind of test design, where a writing task requires the synthesizing of
information from readings and a lecture presented previously on the samc topic,
is found in the Ontario Test of ESL (OTESL) in Canada (Wesche, 1987).
Thus, the three types of task vary according to the amount and nature of the
part of a larger proficiency test battery. The test results provide a basis for
80
admission decisions; other measures and criteria are employed for that purpose.
The test is composed of three tasks, as follows:
Task 1 (Guided)
The first task, which is modelled on one by Jordan (1980: 49), is an example of
the guided type. The test-takers are given a table of information about three
grammar books. For each book the table presents the title, author name(s),
price, number of pages, the level of learner for whom the book is intended
(basic, intermediate or advanced) and some other features, such as the
availability of an accompanying workbook and the basis on which the content of
the book is organised. The task is presented to the learners like this: "You go
to the university bookshop and find that there are three grammar books
available. Explain which one is likely to be the most suitable one for you by
comparing it with the other two."
This is a guided task in the sense that thc students arc provided with key facts
about the three grammar books to refer to as they write. Thus, the focus of their
writing activity is on selecting the relevant information to use and organising the
composition to support thc conclusion that they have drawn about the most
suitable book for them.
Task 2 (Experience)
For the second task, the test-takers are given a written text of about 600 words,
which describes the process of steel-making. Together with the text, thcy
receive a worksheet, which gives them some minimal guidance on how to take
notes on the text. After a period of 25 minutes for taking notes, the texts are
collected and lined writing paper is distributed. Then the students have 30
minutes to write their own account or the process, making use of the notes that
they have made on the worksheet but not being able to refer to the original text.
It could be argued that this second task is another example of the guided type, in
the sensf: that the students arc provided with a reference text that provides them
with content to use in their writing. However, it can also he seen as a simple
1;ind of experience task, because the test procedure is divided into two distinct
stages: first, reading and notetaking and then writing. While they are composing
their text, the students can refcr only indirectly to the source text through the
notes that they have taken on it.
81
topic on which the test task is based as part of their regular classwork. The
topic used so far has been Food Additives. The classes spend about five hours
during the week engaged in such activities as reading relevant articles, listening
to mini-lectures by the teacher, taking notes, having a class debate and discussing
how to organise an answer to a specific question related to the topic. However,
the students do not practise the actual writing of the test task in class.
The week's activities are intended to represent a kind of mini-course on the topic
of Food Additives, leading up to the test on the Friday, when the students are
given an examination-type question related to the topic, to be answered within a
time limit of 40 minutes. The question is not disclosed in advance either to the
students or the teachers. A recent question was as follows:
Processed foods contain additives.
How safe is it to cat such foods?
This third task in the test is a clear example of an experience task. It provides
the students with multiple opportunities during the week to learn about the topic
(to acquire relevant prior knowledge, in fact), both through the class work and
any individual studying they may do. Of course this does not eliminate
differences in background knowledge among the students on the course.
However, it is considerLd sufficient if the students' interest in the topic is
stimulated and they learn enough to be able to write knowledgeably about it in
the test.
."N\
ability to marshal content material effectively but also to tailor their writing for
specific readers (or "audiences") and purposes. In addition, they need guidance
on such matters as the amount to be written and form of text that thcy should
produce. If such specifications are needed for native speakers of English, then
the need is even greater in the case of foreign students who - as Hamp-Lyons
(1988: 35) points out - often lack knowledge of the discourse and pragmatic rules
that help to achieve success in an academic essay test.
At a theoretical level, Ruth and Murphy (1984) developed a model of the
"writing assessment episode" involving three actors - thc test-makers, the testtakers and the test-raters - and correspondingly three stages: the preparation of
the task, the student's response to it and the evaluation of the student's response.
The model highlights the potential for mismatch between the definition of the
task as intended by the test-makers and as perceived by the students. Assuming
that in a large-scale testing programme thc test-raters arc different people from
the test-makers, the way that the raters interpret the task is a further source of
variability in the whole process.
The tasks in the ELI writing test described earlier can be used to illustrate
some of the problems of interpretation that arise. For example, in Task 1, the
purpose of the task is not very clear. In real life, if one were making a decision
about which grammar book to buy, one would normally weigh up the various
considerations in one's mind or at most compose a list similar to the table in the
Some of the
students who took the test apparently had this difficulty and thcir solution was to
satisfactory, but thc fact that some students responded in this way represents
useful feedback on the adequacy of the task specification. The instinct of sonic
of the teacher-raters was to penalize thc letter-writers, on the grounds that there
was nothing in the specification of the task to suggest that a letter should be
written. However, one can equally argue that it was the task statement that was
at fault.
Another kind of interpretation problem has arisen with the third task, when
the question has been stated as follows:
Processed foods contain additives.
How safe is it to eat such foods?
devoted the first half of thcir composition to it, before moving on to the
83
question of the safety of the additives. Ruth and Murphy (1984: 417-418) noted
the same phenomenon with a similar writing prompt for LI students in the
United States. Whereas the majority of the students considered the opening
sentence of the prompt to be an operative part of the specification that needed
to be referred to in their essays, none of the teacher-raters thought it necessary
to do so.
Faced with such problems of interpretation, test-writers are in something of
a dilemma. In a writing test, it is obviously desirable to minimise the amount of
time that the students nccd to spend on reading the question or the task
statement.
Assume that you have just returned from a trip and are writing a letter
to a close friend. Describe a particularly memorable experience that
occurred while you were traveling.
97
draft, and criteria. Obviously not all of these dimensions need to be specified in
any individual writing stimulus, but they highlight the complexities involved,
especially w1- .n the test-setters and the test-takers do not share the same cultural
and educational backgrounds.
when she analysed student scripts from a pre-operational version of the test
using the Writer's Workbench, a computer text-analysis program which provided
data on discourse fluency, lexical choice and the use of cohesion devices in the
students' writing. The program revealed significant differences in thc discourse
features of the texts produced in response to the two different tasks.
ways by the test stimulus. In addition, tasks that were closer to personal
experience and spoken language (letter-writing or story-telling) were less
demanding than more formal ones, like expressing an opinion on a controversial
topic.
This variation according to formality was also present in Cumming's (1989)
informal topic (a personal letter) and the more academic ones (an argument
and a summary).
85
1.--4
the field-specific topic were longer, more syntactically complex and of higher
overall quality (as measured by holistic ratings) than the essays on the general
topic. Furthermore, the field-specifie essays provided better discrimination of
the three levels of ESL proficiency represented among the subjects. Tedick
contributes to the validity of the test by giving a broader basis for making
generalizations about the student's writing ability.
that the test-taker produces or, in other words, the product rather than the
process. Practical constraints normally mean that the students can be given only
a limited amount of time for writing :lad therefore they must write quite fast in
order to be able to produce an adequate composition. The preceding discussion
of different tasks and ways of specifying them has concentrated on the issue of
how to elicit the kind of texts that the test-setter wants, with little consideration
of the process by which the texts will be produced and whether that reflects the
way that people write in real life.
However, any contemporary discussion of writing assessment must take
account of the major developments that have occurred over the last fifteen yep.,s
in our understanding of writing processes. In the case of LI writers, researchers
such as Britton et al. (1975), Murray (1978), Peri (1980) and Graves (1983) have
creating ideas and creating the language to express those ideas". Studies by
Zamel (1983), Raimes (1987), Cumming (1989) and others have found that L2
writers exhibit very much the same strategies in composing text as LI writers do.
There are a number of implications of this research for the assessment of
writing. If the production of written text is not strictly linear but inherently
recursive in nature, normally involving cyclical processes of drafting and revising,
this suggests that test-takers need to be given time for thinking about what thcy
are writing; in addition, explicit provision needs to be made for them to revise
and rewrite what they hpve written.
The question, then, is how the writing process can be accommodated within
the constraints of the test situation Of course, any test situation is diffcrcnt
from the context of real-world language processing, but the disjunction between
"natural" writing processes and the typical writing test is quite marked,
particularly if we are interested in the ability of students to write essays, research
reports and theses, rather than simply to perform in examination settings. Even
a substantially increased time allocation for completing a test task does not alter
87
'tJ
the fact that thc students arc being required to writc under constraints that do
not normally apply to the writing process.
Thcre arc various ways in which one can reduce the effects of the time
constraints. One means is to limit the demands of the writing test task by
providing support as part of the task specification. The provision of content
material, as is done with guided and experience tasks, is one way of reducing the
complexity of the writing task and allowing the test-taker to focus on the
structure and organization of the text.
Another, more radical approach which is gaining ground is to move away
from a reliance on timed tests for writing assessment. This may not be possible
REFERENCES
Alderson, J. C. and Urquhart, A. H. (1985). The effect of students' academic
discipline on their pelfonnance on ESP reading tests.
Language Testing,
192-204.
Carrell, P. L. (1987).
88
Cumming, A. (1989).
Princeton,
Hale, G. A. (1988). Student major fn:Id and text content: interactive effects on
Jacobs, H. L., Zingraf, S. A., Wormuth, D. R., Hartfiel, V. F., & Hughey, J. B.
(1981). Testing ESL composition: A practical approach. Rowley, MA:
Newbury House.
89
Kelly, P. (1989). Theory, research and pedagoby in ESL writing. liz C.N. Candlin
and T. F. McNanzara (Eds.) Language, karnine and community. Sydney:
National Centre for English Language Teaching and Research, Macquarie
University.
In S. Anivan (Ed.)
Purves, A. C., Soter, A., Takata, S., and Vahapassi, A. (1984). Towards a
domain-referenced system for classiffing composition assignments. Research
in the Teaching of English, 18.4, 385-416.
422.
Stansfield, C. IV. and Ross, J. (1988). A long-tern research agenda for the Test of
Written English. I anguage Testing 5, 160-186.
ore
91
INTRODUCTION
might call linguistic 'models' - and arc themselves devices for generating
descriptions of the individual language user's ability in terms of the underlying
model. So language tests, too, must simplify what they assess.
Sometimes the descriptions produced by a language test are in terms of
numbers, cg '72%' or perhaps '59% in Writing, 72% in Reading' (although it is
difficult to know what such descriptions of linguistic ability could mean, they are
the various aspects of the description fall are not God-given, inherent in the
nature of language or linguistic ability, so much as imposed on a continuum of
confusing and unruly data by language specialists. Language ability does not fall
neatly into natural pre-existing categories, but has to be forced into man-made
92
1 1,) 5
4.?
7:44
of a language ability, then, will be onc which leaves a great deal out! What such
theoretical model - one which, in the features it chooses to highlight, and in the
way it relates those features one to another, attempts to capture the essence of
the language ability. The questions for a test, then, arc: how elaborate a model
should it be based on if it is to avoid the criticism that it leaves out of account
crucial features of the language ability to be measured; and on the other hand
how much complexity can it afford to report before it runs thc risk of being
unusable?
communicative competence tests, communicative performance tests, or simply and conveniently - communicative tests, and views of what those various terms
the sorts of characteristics such tests ought to have. We may citc just thc
following few as being fairly typical:
(a) Tests will be based on the needs (or wants) of learners. It would bc
unreasonable to assess a learner's ability to do through English
something which he has no need or wish to do. A principle such as this
suggests that the different needs of different learners may call for
different types of linguistic ability at different levels of performance; in
principle tests incorporating this idea will vary appropriately for each
ncw set of needs in the number and type of abilities they assess, and in
their appraisal of what constitutes a satisfactory level of performance.
93
(c) Tests will employ authentic texts, or texts which embody fundamental
features of authenticity. These 'fundamental features' may well include
appropriate format and appropriate length, both of which will vary with
the type of text. Concerning length in particular, longer texts are said to
Tests already exist which seek to embody all these and other features of
natural communication for more or less well-defined groups of learners. Thc
challenge is great and the difficulties formidable. Bachman (1990) has criticised
sample them adequately for all test-takers, and perhaps not even for a single
test-taker.
In the light of the already-existing difficulties posed for test construction,
and of such criticisms, and of the need for a useful, practical test to avoid
excessive complexity, we must think very carefully indeed before proposing that
tests should incorporate yet another level of complexity by including information
on the effects of affective factors in the descriptions which they yield.
Affective Factors
Affective factors arc emotions and attitudes which affect our behaviou,
We may distinguish between two kinds: predict
and unpredictable.
Unpredictable: Most teachers will be familiar with thc kinds of affective factor
which produce unpredictable and unrepresentative results in language tests, eg. a
performance the best that they are capable of, will obviously detract from the
reliability of the description of abilities yielded by the test.
Clearly, if we can find ways of minimizing the effects of such unpredictable
Predictable: There may be another sct of affective factors which are predictable
in their effects on the quality of communication, and which can therefore be built
95
adequate and appropriate rating scales. Whilst the latter are undeniably
important, the more fundamental point that the quality of spoken language
performance may vary predictably with features of the interlocutor tends to go
unnoticed. Research in this area is practically non-existent, although the results
would be of importance beyond language testing for our understanding of the
nature of linguistic performance.
Locke chose to consider the effect of the gender of the interviewer on the
interviewee. Four male postgraduate Iraqi and Saudi students at the University
of Reading were each interviewed twice, once by a male and once by a female
interviewer. The four interviewers w re all of comparable age. Two students
were interviewed by a male interviewer first, and the other two by a female
interviewer first; in this way it was hoped that any order effect could be
discounted. Then, it was necessary for each interview to be similar enough to
allow meaningful comparison of results, but not so similar that the second
interview would be felt to be a simple repeat of the first, with a consequent
practice effect. A 'same-but-different' format was therefore necessary. Each
interview was given the same structure, and the general topic-area was also the
same, but the specific content of the first and second interviews was different.
in both scoring methods and there was a high level of agreement between the
two raters.
Investigation 2: These results demanded both replication and deeper
exploration. Thc writer therefore carried out a slightly larger investigation with
IA13
96
thirteen postgraduate Algerian students at Reading (11 males and two females).
This time, interviewers were cross-categorized not only by gender, but also by
whether or not the student was acquainted with them and by a rough
categorization of their personality as 'more outgoing' or 'more reserved'. Once
again, the age of interviewers was comparable.
first interview, and unacquainted in the second, with the other half of the
students having the reverse experience; and again roughly half of the students
received an 'outgoing' interviewer First, followed by a 'reserved' interviewer, with
thc remainder having the reverse experience. The interviews were again
designed to be `same-but-different', were video-recorded, shuffled, and rated
using two methods of assessment.
What was not clear from Locke's study and could only be trivially
investigated in this one was whether any gender effect was the result of
interviewees' reactions to males versus females, or to own-gender versus
opposite-gender interviewers. In this respect, it was particularly unfortunate that
morc female students could not.he incorporated in the study: female students of
the same cultural background as the males were not available. Nevertheless,
while expressing all the caution necessary when considering the results of only
two students, the results for the two female students were interesting. For one
of the women, no difference was observable by either scoring method with the
male and female interviewers. The other woman was rated more highly when
interviewed by the man. Neither woman could be seen to go against the trend
established in the men.
A very tentative conclusion to be drawn from these two limited studies
would seem to be that, in the interview situation at least, young adult male Arab
97
4,
t)
with Japanese students, but would not be surprised to find an age-effect, ie. we
might expect students to achieve higher spoken-English ratings when interviewed
by older interviewers, as such interviewers would be accorded greater respect.
As in the previous studies, each student was given two short 'same-butdifferent' interviews, one by a male interviewer, onc by a female. Half of the
students were interviewed by a male first, half by a female first, and all
interviews were video-recorded.
Ii
98
next it would not be (neutral status). Each interviewer (I) interviewed four
students, thus:
1st interview
2nd interview
Student # 1
Male I # 1
High status
Female I # 1
Neutral status
Student # 2
Female I # 1
High status
Male I # 1
Neutral status
Student # 3
Female I # 1 Male I # 1
Neutral status High status
Student # 4
Female ! # 1
Male I # 1
Neutral status High status
name, and with academic titles where relevant (cg. Dr Smith). A brief
"96
themselves' in both status conditions, their status being suggested to the student
purely through the mode of introduction together with minor dress differences.
Videos of these interviews are currently being rated on holistic and analytic
scales, as before. On this occasion, however, the holistic scales used are those
developed by Weir for the oral component of the Test of English for Educational
Purposes (see Weir, 1988), and in order to facilitate comparisons, the videos
have not been shuffled. Multiple rating is being undertaken, with an equal
number of male and female raters. Thus far, only two sets of ratings have been
obtained, one by a male rater and one by a female.
While it is as yet much too early to draw any solid conclusions, some
99
Secondly, there is a slight tendency on both rating scales and with both
raters for students to achieve higher ratings when being interviewed by males,
but this is by no means as clear-cut as in the earlier investigations, and on the
analytic scales there is considerable disagreement between the raters on which
criteria, or for which students, this tendency manifests itself. Nevertheless, sonic
tendency is there.
Finally - and this, perhaps is the most surprising finding - there is some
slight tendency on the analytic scale, and a more marked tendency on the holistic
scale, for students to achieve higher ratings with interviewers who were not
marked for high status!
If this latter suggestion is borne out when the analysis is complete, and if it
is reinforced when more substantial studies arc undertaken, it will raise some
perplexing questions of interpretation. One possibility might be that it is not
rather specific factors such as 'gender' or 'age', and not even a rather more
general factor such as 'status' which affect the quality of language production
directly, but some much more general, very abstract factor such as 'psydwlogical
distance'. Thus the more 'distant' an interlocutor is perceived to he, the poorer
the ratings that will be achieved. All kinds of secondary factors might contribute
to this notion of 'distance', in varying strengths, but an interlocutor who is 'same
gender', 'same age', 'known to speaker', 'same status', etc. might be expected to
elicit higher-rated language than one who is 'other gender', 'older', 'unknown to
speaker'. 'higher status', etc.
Whatever the primary and secondary factors which ultimately emerge, if the
nature and degree of effect can he shown to be consistent in any way for a
models of spoken language performance for those speakers, and that tests of this
performance will need to take such predictable factors into account.
Let us now consider what such 'taking account' might involve, and finally
relate the whole issue to our underlying concern with thc complexity of tests.
100
on very sensitive matters. This should not surprise us, and is not a unique
byproduct of the particular affective factor chosen by way of illustration; the
reader is reminded that affective factors are matters of emotion and attitude,
and it is not only thc testee who is subject to their effects!
with the effects of affective factors in cases where it was predictable that the
factors would have a marked effect, hut not predictable how great or in what
direction the effect would be. Thus a 'distance' effect might be great in some
individuals, or in people from some cultural backgrounds, hut slight in others;
great 'distance' might depress the quality of performance in some learners, but
raise it in others.
It might at first glance appear that such a 'full play' solution would also have.
the attraction of making it unnecessary to do thc research to find out what the
their effects, and their field of operation (what topic-areas, what cultural
backgrounds, etc) will be necessary to inform the selection process.
considercd; it may be that some or all of ttle factors affecting the spoken
language will be shown to have significant effects on performance in the written
language, too, to the same or different degrees. Alternatively, there may be a
quite different set of affective factors for the written language. And in both
media, the term `performance' may be understood to involve both reception and
production. The potential for test complexity if all arc to be reflected in test
content, structure and administration is quite awesome. Even the `full-play'
proposal of thc previous section, related to a 'status' or 'distance' effect alone,
would double at a stroke the number of interviewers required in any situation.
Nevertheless, a description of a learner's linguistic performance which ignored
this dimension of complexity would be leaving out of account something
important.
But yes, in the cnd, practicality will have to win the day. Where thc number
of people taking the test is relatively small, and where the implications of the
results arc not critical in some sense, it is unlikely that affective factors will be, or
could be, seriously and systematically taken into account. But where the test is a
large one, where the results can affect the course of lives or entail the
expenditure of large sums of money, and where specifiable affective factors are
REFERENCES
Bachman, L. 1990. Fundamental Considerations in Language Testing. London:
CUP.
1LT
102
1.
They also agree that among the criteria mentioned above validity is the
most important, for unless a test is valid it has no function. Thc validity of a test
depends on th e. degree to which it measures what it is supposed to measure. A
good test must serve the purpose that it is intended for, otherwise it is useless.
However reliable the results may be, however objective the scoring may be, if the
test does not measure what the test user wants to know it is irrelevant.
In our context most of the test users arc foreign language teachers who
want to know how well their students have learnt thc foreign language. For this
purpose they employ tcsts. My phrase "how well the students have learnt the
foreign language" disguises the complexity of the task. In thc past twenty or
thirty ycars we have all learnt to accept communicative competence as the
overall aim of foreign language instruction. Students are supposed to learn to
understand and use the foreign language for purposes of communication. This
general aim can, of course, be broken down into a number of competencies in
listening, speaking, reading and writing.
In most countries the school curricula for foreign language instruction are
formulated in tcrms of communicative competencies, and a logical consequence
of this is that also testing is organized according to these competencies. This
approach to testing has been called the "curricular approach". The foreign
language curriculum is taken as the basis for the construction of foreign
language tests. On the assumption that the actual teaching follows the content
prescriptions laid down in the curriculum it seems plausible also to determine
the content of the tests on thc basis of the curriculum. This takes us back to the
concept of validity. If the content of a test corresponds to the content prescribed
by the curriculum it is said to possess "curricular validity" or "content validity".
103
2.
Authenticity
to work at that task, and observe how well hc does it and the quality and
Let us try to appiy the ideas expressed in this passage to a very common
task that is to be found in any foreign language curriculum: Asking the way in an
English speaking environment.
If we want to find out whether students are able to perform this speech act
the safest way would he to take them to an English speaking town, place them in
a situation where they actually have to ask the way and sec whether thcy perform
the task successfully and to which degree of perfection. We all know that this is
hardly ever possible, except for language courses that are being held in an
English speaking country. In the great majority of cases the teaching and
learning of English takes place in a non-English environment. Therefore the
second case mentioned by Cureton comes up when the tester tries to invent a
realistic situation in which the learners have to perform operations congruent
with thc ones they would have to perform in situations normal to the task.
Absolute congruence would exist when the tasks in the test situation and in the
corresponding mil-life situation would actually be identical. In this extreme case
the test situation and the tasks in it are called authentic. An authentic test is
1"
6
104
of a good test. They derive it from the generally accepted criterion of validity
and regard authcnticity as the most important aspect of validity in foreignlanguage testing.
To quote just one author who takes this view: Brendan J Carroll:
Brendan Carroll's whole book can he seen as one great attempt to ensure
authenticity in language testiog.
3.
Limits to authenticity
real-life situation is impossible and that (b) there are other demands that
necessarily influence our search for optimal forms of testing and therefore
relativize our attempt to construct authentic tests.
Re (a) Why is a complete congruence of test situation and real-life situation
impossible? The answer is simple: because a language test is a social event that
has - as onc of its characteristics - the intention to examine the competence of
language learners. In D Pickett's words: "By virtue of being a test, it is a special
and formalised event distanced from real life and structured for a particular
purpose. By definition it cannot be the real life it is probing."3
105
The very fact that the purpose of a test is to find out whether the learner is
capable of performing a language task distinguishes it considerably from the
corresponding performance of this task outside the test situation. Even if we
succeed in manipulating the testees to accept the illocutionary point of a speech
act thcy are supposed to perform, they will, in addition, always have in mind the
other illocutionary point that is inherent to a test, namely to prove that they are
capable of doing what is demanded of them.
An example of a test that examines the studcnts' competence in asking for a
I f you arc asked to find the area of a field 50 metrcs x 200 metres you do
not have to get up and walk all over the field with a tape measure. You will
not be concerned with whether it is bounded by a hedge or a fence, whether
it is pasture or planted, whether it is sunny or wet or whether it is Monday
or Thursday. These incidentals are irrelevant to the task of measurement,
for which the basic information is ready to hand, and we know that the
106
We have to concede that the decision about what are irrelevant incidentals
is easier to make in the case of an arithmetic problem than in a communicative
task, as communicative performance is always embedded in concrete situations
with a number of linguistic as well as non-linguistic elements. But the arithmetic
problem and the communicative task have one thing in common: Normally, ie.,
outside the artificial classroom setting, they occur in real-life situations that arc
characterized by a small number of essential features and a great number of
incidentals which differ considerably from one situation to the next. And if we
want to grasp the essential features of a task, wc have to abstract from the
incidentals. In this respect abstraction is the counterpoint to authenticity in
testing.
pragmatics can be of great help, I think. Its analyses of speech acts have
demonstrated that every speech act has its own specific structure with certain
characteristic features. It is on these characteristics that we have to concentrate
if we wish to test the learners' competence in performing this particular act.
4.
Examples
elicit information from a hearer. Th,.. two preparatory conditions for the
performance of a question are that the speaker does not know the answcr and
that it is not obvious that thc hearer will provide the information without being
asked. The propositional content of a question depends on what information the
speaker needs, of course.
107
Now, we all know that teaching as well as testing the ability to ask questions
is often practised in a way that disregards these conditions. A very common way
is to present a number of sentences in which certain parts are underlined and to
invite the students to ask for these parts.
govern the performance of a question. First of all, thc speech act demanded
cannot be regarded as an attempt to elicit information. Secondly, the testccs do
very well know the answer because it is given to them in the statements. It is
even underlined, which normally means that the piece of information given is
especially important - a fact that stresses the non-realistic character of the task.
And there is an additional negative feature: the procedure complicates the
task for all those learners who find themselves incapable of imagining that they
do not possess precisely the information that is given to them and to behave
accordingly, ic. to pretend that they need it.
To conclude: The questions that the students have to ask in this test are no
questions at all. The conditions under which they have to perform their speech
acts arc so basically different trom those of real questions that the test cannot be
regarded as a means to examine the students' competence in asking questions.
Let us look at the next example which could serve as an alternative to the
previous one:
flolburne Museum is situated xx xxxxxxx xxxxxx.
It belongs XX XXX XXXXXXXXXX XX XXXX.
Thc difference between the two types of test is minimal on the surface, but
decisive as regards the speech acts that arc required to perform the task. By a
very simple design, namely through replacing the underlined parts of the
12
108
sentences by words that are illegibly written, the second type marks a
considerable step forward in the direction of an authentic test: The questions
that the learners have to ask arc real questions in so far as the two main
conditions of the speech act 'QUESTION' as elaborated by Searle arc fulfilled.
First, they can be counted as attempts to elicit information and, second, the
testees do not know thc answers yet. What is still missing is an addressee to
whom the questions might be addressed. Illegible statements are quite common,
but one would hardly ever try to obtain the lacking information by a list of
written questions. To make this test still more realistic, one could present the
statements not in writing, but in sp, ken form with a muffled voice that fails to be
clear precisely a.t those points where one wishes the students to ask their
questions. lrt this case all the essential conditions of the speech act
"QUESTION" would be fulfilled. But the test is still far from being authentic.
But to come back to our central problem: How far do we want to go in our
efforts to create authenticity?
In the middle part of my talk, I tried to explain why absolute authenticity, ie.
complete congruence between the test situation and the so-called real life
situation is neither possible nor ,!esirable.
However much, for validity's sake, we might want to achieve authenticity in
our tests, any attempt to reacli it will necessarily arrive at a point, where it
becomes clear that there are limits to authenticity for the simple reason that a
language test - by its very purpose and structure is a social event that is
essentially different from any other social event in which language is used.
Very fortunately, we need not bc afraid of protests from our students. They
109
reasons I have presented we shoukl give up our efforts to achieve the impossible
and be satisfied with finding the right balance between authenticity and
abstraction.
REFERENCES
1)
2)
110
Foreign Language run by the RSA between 1980 and 1988, and subsequently run
levels for each series of the examination, and candidates arc able to choose
which modules at which level thcy wish to enter at any time. This structure
reflects the experience of language teachers that the performance of students is
not uniform across skill areas.
overriding importance of authenticiry both of text (as input) and of task (in
processing this input).
1.4 Criterion-referenced
The essential question which a communicative test must answer is whether or
not (or how well) a candidate can use language to communicate meanings. But
"communicating meanings" is a very elusive criterion indeed on which to base
judgements. It varies both in terms of "communicating" (which is rarely a black
and white, either/or matter) and in terms of "meanings" (which are very large
and probably infinite in number). In other words, a communicative test which
wishes to he criterion-referenced must define and delimit the criteria. This is a
major undertaking, which for CCSE has led to statements for each of the four
levels in each of the four skill areas of the tasks and text types which the
candidates arc expected to handle as well as (crucially) thc degree of skill with
which they will be expected to operate.
13 To reflect and encourage good classroom practice
Reference has already been made above to the educational cffcct of testing
through thc promotion of positive washback into the classroom. In the case of
CCSE, this is a major concern underlying the design of the tests; indeed in many
ways the tests themselves have drawn on "good" classroom practice in an attempt
to disseminate this to other classrooms. This conscious feedback loop between
teaching and testing, in terms not only of coittent but also of approach, is a vital
mechanism for educational development.
123
112
It will be clear from the preceding section that there is a conscious and
deliberate rationale underlying the construction of the CCSE tests. However, a
rationale can be wrong and some of the bases on which the tests are constructed
within cach skill area decisions have to be made about the contribution of
individual tasks involving specific text types to I ; overall performance, there will
still be scope for investigations of what makes up thc underlying "I", "r", "s" and
"w" factors in the listening, reading, speaking and writing tests.
The main justification for the apparent complexity of the structure of CCSE is
being penalised for what they cannot do. This is a stance which reflects a
philosophical rather than an empirical starting point. Nonetheless, it is a
common experience of teachers that many students have differential abilities in
different language skill areas; it is important both practically and educationally
that this is recognised.
113
caused by the focus on the use of authentic tasks and texts. The first is the
question of consistency of level of the tasks/texts used both within a particular
test, and across tests in the same skill area in different series; the second is the
be undertaken to improve the reliability of the CCSE tests; but they would
conflict directly with the authenticity criterion. Once again, it seems that in test
design terms, the way that is chosen reflects a basic educational philosophy.
From one standpoint, reliability is crucial; authenticity can be brought in to the
extent that it is possible, but remains secondary. From another, the essential
characteristic is authenticity; while recognising that total authcnticity of task can
receives duc attention, but in the final analysis it is not the overriding factor in
the design of the test.
CCSE scheme; the relationship between the construct and the content of any
particular set of tests is open to empirical investigation - but unfortunately the
construct itself is probably not. Similarly, in terms of content, thc relationship
114
between the specifications and the content of any particular set of papers can be
investigated; but the more fundamental question of how far the specifications
reflect the "real world" is not a matter of straightforward analysis. Concurrent
validity is largely irrelevant because there are few other tests available against
which CCSE can be measured. Face validity can be claimed - and claimed to be
extremely important - but no; proved. It may seem that CeSE should be open
to investigation in terms of predictive validity since it is in essence making claims
about the ability of candidates to carry out "real world" tasks. If they pass the
test and can in fact carry out these tasks in the real world then the tcst may be
said to have "predicted" this.
However, a moment's thought will show that this is in fact an impossible
requirement. How would a researcher judge whether or not tasks had been
"carried out" in the real world? The only way would be by evaluating
performance on individual instances of the task - in which case all the problems
of specification, reliability and generalisability which arise for CCSE would arise
again. In a very real sense, the test would bc validating itself against itself.
3. EPISTEMOLOGICAL TRADITIONS
In the preceding section we have seen how in three very basic respects, the
design of a communicative test is based on factors which go beyond, or arc not
susceptible to, conventional language testing research: the overall design is
founded on educational rather than testing requirements; reliability is secondary
on construct grounds to authenticity; and the fundamental validity is not open to
straightforward investigation.
This situation seems to raise some rather interesting questions about the kind
of research which is appropriate to the investigation of language tests.
3.1 Language testing as pure science
Since the 1960's and the development of the "scientific" approach to testing,
the predominant model of research has been one based on the precise
115
and how performance on tests relates to the real world, one might be tempted to
echo the comments of a Nobel Prize winning economist about his own subject:
It seems to mc that the reason for this sad state of affairs may well lie with
"Under normal conditions the research scientist is not an innovator but a solver
of puz2les, and the puzzles upon which hc (sic) concentrates arc just those which
he believes can be both stated and solved within the existing scientific tradition" 3
At its best, this existing scientific tradition has encouraged researchers into
what has become known as "McNamara's fallacy" of "making thc measurable
The first is the recent development (or at least the recent dissemination) of
the ideas behind chaos theory. An exposition of this theory would he out of
place here 6 (and in detail beyond my present understanding of it). But I find a
sct of ideas leads to the insight that conventional science finds it impossible to
maKe a definitive statement about the length of a coastline (because of the
problems of scale; the larger the scale, the longer the length because the more
"detail" is included), or a firm prediction about the temperature of cup of "hot"
coffee in a minute's time (because of thc variability of convection), let alone
what the weather is going to be like next week (because of the cumulative effect
of a whole range of unpredictable and in themselves "trivial" events) an
extremely powerful heuristic in thinking about language and language testing.
Perhaps the key concept is "sensitive dependence upon initial conditions" - a way
of saying that in looking at the world, everything depends on precisely where you
start from. Nothing could be more appropriate for our field.
1J
116
perhaps opens up new areas. How far does existing research into language
testing concern itself with the ethics of the test?
"Ethical validity has two aspects. Firstly is the research relevant to basic
human concerns? Is it committed to values that make a difference to the quality
of life for people now and in the future?...Secondly, do we behave morally while
doing and applying the research...And do we deploy the results in ways that
respect the rights and liberties of persons?" (Heron: 1982 p.1)
There are many questions raised here about the design and implementation
of language tests, but perhaps the most obvious area of ethical involvement is the
question of washback. It seems to me that a test which imposes (overtly or
covertly) an impoverished or unrealistic classroom regime on students preparing
for it is indeed "making a difference to the quality of life for people now and in
the future" (though a difference of the wrong sort). This reinforces my view 8
that an important area of investigation in considering the validity of a test is an
investigation of the classroom practice which it gives rise to.
A large part of Heron's paper is taken up with considering "Procedures for
1 17
focus on the acquisition and analysis of "hard" data, to include a concern with
those aspects of the nature of language itself which may not be susceptible to
such procedures, and with the effects which tests and testing may have on the
consumcrs. Testing and research which reflects thic move will no longer be
concerned simply to measure; rather it will establis J framework which will
permit informed and ,,nsistent judgements to I., mad,
It would be unrealistic to claim that the ci SE sch,ine meets all the criteria
(implicit and explicit) set out in this last section. But it is perhaps reasonable to
claim that it illustrates the legitimacy of asking questions about tests other than
those which language testing researchers conventionally ask. The next stcp is to
find some answers.
NOTES
13
118
MATERIALS-BASED TESTS:
HOW WELL DO THEY WORK?
Michael Milanovic
INTRODUCTION
face and content validity. Materials-based tests arise out of trends in the
development of language teaching materials. In recent years the most dominant
generator of materials-bascd tests, in the British context at least has been the
hand, tend to be static. The range of item formats does not change
dramatically.
It is important to note that the distinctions made above are not clear cut.
An item format may be materinls-based when it is first developed in that it
represents current trends in teaching methodology or views of the nature of
language competence. If it thcn becomes established, and continues to be used,
despite changes in methodology or views of language, it is no longer materialsbased. Ideally, tests should be materials-based, psycholinguistically-based and
measurement-based concurrently. Only when this is the case, can we claim to
have reliable and valid tests.
Hamp-Lyons (1989) distinguishes between two types of language testing
research. The first is for the purposes of validating tests that will be
119
1 3 :2
operationally used. The second, which she calls metatesting, she defines as
having its purpose in:
"... the investigation of how, why and when language is acquired or learne4
not acquired or not learned, the ways and contexts in which, and the purposes
for which, it is used and stored, and other such psycholinguistic questions".
120
_1
'3
t)el
ways to prepare for a test is to practice the items in the test. It is a well
established fact that the multiple-choice test format does not inspire innovative
methodology, that it has had a largely negative effect on classrooms all over the
world. Unhappily, it is still widely considered the best testing has to offer
because it satisfies the need for measurement accountability and is economical
to administer and mark.
In test validation research the problem of relating testing materials to useful
ii
iii
iv
The first principle, start from somewhere, suggests that the test constructor
get the best out of students, rather than the worst. Swain feels
it is
important to try and make the testing experience less threatening and potentially
harmful. The fourth principle, work for washback, requires that test writers
should not forget that test content has a major effect on classroom, practice and
that they should work towards making that effect as positive as possible. Clearly,
these four principles cannot be satisfied by using only indirect measures such as
multiple-choice items. We have to turn towards other item types.
It
must bc said that these are powerful features of the approach taken by
examining hoards in Britain. Examinations are not perceived as thc property of
boards alone. Ownership is distributed between the boards, methodologists and
121
13t
tcachcrs, all of whom accept responsibility for the effect that the examinations
have on the consumer - the students taking examinations - and the educational
process. Many examining boards in the United Kingdom try to reflect language
in use in many of the item types they use. This has been done in response to
pressure from teachers demanding an approach that reflects more closely recent
trends in methodology. The trend towards more realistic test items has not
always been backed up by the equally important need to validate such tests. The
combination of innovation and appropriate validation procedures is a challenge
yet to be fully faced.
Even so, the examples cited above show that parts of the testing world are
trying to move towards tests that look more valid and try to reflect both real life
content valid language tests, a disturbing lack of attention has been paid to
making such tests reliable, or establishing their construct validity. In the
following I will describe a project that attempted to produce a test battery that
was based, to some extent at least, on the real world needs of the test takers. It
took place in the British Council language teaching institute in Hong Kong.
Thc British Council language institute in Hong Kong is the largest of its
kind in the world. There are between 9,0(X) and 12.000 students registered in any
one term. In the region of 80% of the students are registered in what are loosely
called General English courses. In fact this term is misleading. Through a fairly
12 2
standard ESP type of irwestigation into the language needs of thc students, it was
possible to show that two main categories of student were attending courses.
These were low to middle grade office workers, and skilled manual workcrs.
This meant that the courses could be designed with these two main categories in
mind. A much smaller third category was also identified, though this overlapped
heavily with the first two. This category was students learning English for varied
performance descriptions were generated. These formed the basis for test
specifications and the generation of teaching materials.
TEST CONTENT
to say that each item in the course needs to be tested. Unfortunately, in the
minds of many teachers and studcnts a test needs to cover all aspects of a course
to be valid or fair. If the test is a discrete-point grammar test, testing a discretepoint grammar course then this may be possible if not desirable (Carroll, 1961).
In almost any other context it is simply not possible to test all that has been
taught in the time available for testing. In deciding test contcnt the following
points need to be considered:
iii
The item types that appear in a test must be familiar to both teachers
and students.
iv
12 3
1t
vi
All too often operationally used tests do not resemble teaching materials in
style and format. If, teaching a language aims to prepare learners for real-world
The argument above raises the question of whether test ituns should be
task-based or discrete-point. As teaching becomes more whole-task-based it is
inevitable that test items must follow. However, this causes two sets of problems
from a testing point of view. Firstly, how is thc tester to sample effectively from
all the task-based activities and to what extent are the results obtained
generalizable? These problems have been discussed at length over the years but
no satisfactory solution has been reached.
Secondly, in real life, a task is generally either successfully completed or
not. In class, the teacher can focus on any aspect of the task in order to improve
student performance. In the t..tsting context, however, the task may provide only
one mark if treated as a unity, as long as an overall criterion for success can be
defined and whether this is possible is a moot point. Such a task may take
several minutcs or longer to complete. If the test in which it resides is to be used
1 t.
124
ir
AIN
Attenrjoh:
8I,4ic
4 1'4 ck
of
Tel. No.:
Message:
cd-
c:111 14-11
Clearly, for the task to have been successfully completed all the relevant
information needs to be present. Unfortunately this is rarely the case - mistakes
arc made, information is missing. It would be difficult to score such an item
.dichotomously and achieve a reasonable distribution of scores or provide enough
information for effective test validation.
The views of both students and teacheis arc important in test construction.
It is difficult to involve students in test construction, but it is of great importance
that their views are sought after pre-testing or test administration in order that
125
understood. For any approach to testing to succeed, therefore, three factors are
of vital importance:
i
Teachers must gain some familiarity with the principles and practice
of language testing. This is perhaps best ac:iieved through some form of basic
training course;
ii
Teachers must be involved in the process of test design, itcm format
selection, and the writing of test items;
iii
Teachers must be familiar with the life cycle of a tcst and aware of the
ict that good test construction cannot be haphazard.
Council institute in Hong Kong, there were more than one hundred teachers
employed at any one time and so, training and involvement had to take place by
degree. However, it was anticipated that the credibility of the tests and the
process of consultation would be better accepted when those who were actually
involved in working on the tests mixed with teachers who were not involved. Thc
more teachers could bc made to feel a personal commitment to the tests, thc
more people there were who would be available to explain and defend them as
necessary. Thc image of the test constructor in the ivory towcr having no contact
with the teaching body had to be dispelled as fully as possible. Thus it was that
there were generally between four and six teachers involved in test construction
in any one term.
A MATERIALS-BASED TEST
One of the tests in the battery developed in Hong Kong will now be
described in order to illustrate some of the points made earlier. Thc A3
Progress test, likc all the others, is divided into four basic parts. A3 level
students have a fairly low standard of English therefore the test tasks they have
to perform arc of a rather basic kind. Every attempt was made, however, to
kccp these tasks realistic and relevant.
The Listenin Test, a copy of which appears in appendix I, comprises three
item types. The first simulates a typical telephone situation that the students are
likely to encounter, the second a face to face exchange at a hotel reception desk,
`-f1, 126
1 iJtl
and the third a face to face exchange between a travel agency clerk and a tourist
booking a day, tour. The skills tested are listed below:
Taking telephone messages
This involves:
two skills, since the studcnts' competence in both was likely to affect
127
tasks. The tape recordings were made in studio conditions and various sound
effects incorporated to make them more realistic.
The Grammar Test caused some conccrn. It was decided that the tests
should include a section on grammar, or perhaps more appropriately, accuracy.
The communicative approach has been much criticized by teachers and students
for its perceived lack of concern for the formal features of language. In the
Hong Kong context, it was very important to the students that there should be
something called grammar in-the tests. From the theoretical point of view, it was
also felt that emphasis should be placed on more formal features of language.
How they should be tested was the difficult question. If standard discrete-point
multiplc-choice items were used, the washback effect on the classroom would
have been negative in the sense that thc multiple-choice approach to grammar
teaching was not a feature of the teaching method in the British Council. It was
also thought better to use an item type which was text-based as opposed to
sentence-based. To this end a variation on the doze procedure was developed
for use in the lower level progress tests. It was given the name `banked doze'
because, above each text, there was a bank of words, normally two or three more
than there were spaces in the text. Students chose a word from the bank to
match onc of the spaces. Each text was based on some authentic text-type
relevant to and within the experience of the students. These were:
An article from Student News.
A newspaper article.
A description of an office layout.
A letter to a friend.
It should be pointed out that the same format was not used at higher levels.
not very realistic. However, it was a teaching device commonly used in the
institute, and thus familiar to the students. Furthermore, it focused attention on
the sociolinguistic aspects of language and allowed for a degree of controlled
creativity on the part of the student. The marking was carried out on two levels.
If the response was inappropriate it received no marks, regardless of accuracy.
14;
128
If it was appropriate, then the marks were scaled according to accuracy. Only a
response that was both appropriate and wholly accurate could receive full marks.
The types of functional responses that the students were expected to make
are listed below:
giving directions;
asking about well being;
offering a drink;
asking for preference;
asking about type of work/job;
asking about starting time;
asking about finishing time;
giving information about own job;
giving information about week-end activities.
Reading and Writing were the final two skills areas in this test. An attempt
was made here to integrate the activity as much as possible, and to base the task
on realistic texts. Students were asked to fill in a visa application form using a
important as the material. They were also able to sec that students almost
enjoyed this sort of activity, and immediately understood its relevance to their
day-to-day lives. Informal feedback from teachers, after the introduction of the
test, indicated that it had encouraged a greater focus on the use of authentic
materials and realistic tasks in the classroom. It seemed that positive washback
was being achieved.
129
context. With regard to involving teachers and integrating testing into the school
environment, there is also very little guidance available. Alderson and Walters
teachers in the stages of this treatment, that some degree of training and
sensitization was achieved. Listed below are the sLx stages of test preparation.
believe they arc appropriate to many situations where teaching and testing
interact.
Stage 1
to discuss any ideas that the teachers may have, to take into account any
feedback regarding the tests already operating and decide on a topic area that
each teacher could focus on in order to prepare items for the next meeting.
Stage 2
The ieachers write first draft items in light of Stage 1 discussions, their
experience of the materials and students, the course outlines and performance
objectives.
Stage 3
130
14
Stage 4
administrators alike that test construction can be accomplished quickly and that
the product will still be quite acceptable. Unfortunately, due to a number of
factors such as the unpredictability of thc students, the shortsightedness of the
test writer, the lack of clarity in instructions, this is rarely the case. Initial
moderation helps to make teachers aware of some of the difficulties; trialling
informally with their own classes is an invaluable addition to this sensitization
of
process. Moreover, teachers have the opportunity of observing the reactions
which
they
attempt
to
do
them.
Both
of
students to the items and the way in
these factors are very important in the construction of task-based tests that
attempt to have a positive washback effect on the classroom.
Stages 5
After initial trialling, the moderation team meets again, and in light of the
experience gained so far prepares a pretest version of a test or part of a test.
Stages 6
The moderation team meets to discuss the results of the pretest and decide
on the final form of the test items.
Any test item generally takes at least six months from inception to
the greatest of care because the test results have a very real influence on the
students in question. They are able to bear witness to the fact that no test can be
produced without due care and attention. To begin with, most of thcm believe
the approach to be unnecessarily long drawn out and tedious, but as they work
on items and become fully aware of the fallibility of tests and test constructors,
their attitudes change.
I made the claim earlier that materials-based tests need to function at least
as well as measurement-based tests, from a statistical point of view. Even if the
same degree of economy of marking cannot he achieved, this is out weighed, in
an institutional context, by the considerable educational benefits.
Some basic test statistics for five progress tests from thc battery in question
are presented below. Each test was analyzed in two ways. Firstly, it was treated
as a unity, in thc sense that none of the sections were analyzed separately. This
means that the mean, standard deviation, reliability and standard error of
measurement were established for the whole test. Then each section was treated
as a separate test. This meant that there were four separate analyses of
Listening, Grammar, Appropriacy, and Reading and Writing.
14 5
132
LIS
GRM
ARP
RD,WT
81%
24%
0.84
69%
?S%
0.92
89
264
55%
24%
0.92
28
264
60%
22%
0.88
29
10
264
264
22
264
54%
16%
0.93
42%
20%
0.87
52%
21%
0.83
77%
18%
0.76
53%
28%
0.89
96
33
24
305
305
305
19
305
20
205
58%
14%
0.91
421
18%
0.82
57%
18%
0.80
74%
15%
0.68
65%
19%
0.85
99
29
24
20
26
259
259
259
259
259
57%
16%
0.94
112
250
55%
20%
0.88
46%
19%
0.86
80%
23%
0.84
64%
24%
0.91
34
35
12
31
250
2!0
250
250
14920
58%
18%
0.95
57%
20%
0.86
49%
21%
0.87
NQ
NS
98
242
31
31
242
242
79%
22%
0.74
09
242
62%
27%
0.91
25
242
WT
73 Test
53%
19%
R
SD
KR20
NQ
NS
0.95
131 Test
X
SD
KR20
NQ
NS
g2 Test
X
SD
149.20
NQ
NS
C' Tes'
X
S2
14920
NQ
NS
C2 Test
R
SD
*KEY*
WT
LIS
=
=
=
GRM
APP
RD/WT =
X
SD
KR20
NQ
NS
.
.
Whole Test
Listening
Grammar
Appropriacy
Reading and writing
mean score;
standard deviation
Kuder-Richardson 20 reliability quotient;
number of items in the test or subtest;
number of students in the sample
133 1.10
It is clear from these figures that the tests are very reliable. The reasons for
this are as follows:
I.
much time and effort was put into planning and moderation;
ii.
iii.
teachers were involved in the process of test writing from the earliest
stages;
iv.
this context, the most straightforward of these is to attempt to show that the
In the case of the tests in the battery described here this was done by
computing students' scores on subtest tasks and then treating these tasks as
mini-tests in their own right. If the tasks grouped together according to the skills
they were said to be testing, then this would provide evidence that performance
could be accounted for by different underlying skills. A factor analysis for the
A3 test is illustrated in Table 2.
134
Table 2
Factor
Subtest
Rd/wrt
Rd/Wrt
Rd/Wrt
Rd/Wrt
Rd/Wrt
Approp
Approp
Factor 2
Factor 3
Factor 4
.74541
.70287
.64940
.64851
.63182
.62097
Listening 4
Listening 1
Listening 5
Listening 2
Listening 6
Listening 3
Grammar
Grammar
Grammar
Grammar
Approp
.75096
.69953
.63289
.51338
1
4
2
3
5
3
.41049
.86169
.65637
.59547
.52075
.44136
2
3
.41395
.75115
.54720
subtcst
Interestingly, at this fairly low level of proficiency, it is clear that
135 1 4
CONCLUSION
The results and procedures described here show that materials-based tests
can work. In an educational context, where possible, such tests should be used in
Cana lc (1985) amongst others, has pointed out that there is often a
mismatch between teaching/learning materials and those that appear in
Cana le also focuses on the type of situation that current achievement testing
often represents:
"... it is frequently a crude, contrived, confusing threatening, and above all
intrusive event that replaces what many learners (and teachers) find to be
more rewarding and constructive opportunities for lecming and use".
The problems that Cana le outlines, which are also of concern to Swain
(1985), are major difficulties in the acceptability of testing as an important and
useful part of the educational process. Several strategies can be adopted to
overcome these problems.
Secondly, the materials used in tests should always reflect the types of
activities that go on in the classroom and/or the lives of the students taking the
test. In this way both teachers and students will have the better chance of seeing
the relevance of tests.
Thirdly, teachers' sometimes inadequate understanding of testing purposes,
procedures and principles are often a major barrier in the successful integration
of testing into the curriculum in order to overcome this problem, teachers need
to be actively encouraged to get involved in test writing projects, and there needs
to be a heavy emphasis on their training. Such a strategy not only improves the
14j
136
quality of tests, in terms of reliability and validity as illustrated earlier, but also
means that more teachers will become familiar with testing as a discipline that is
integrated into the education process and not apart from ic.
BIBLIOGRAPHY
Alderson, .1 C., 1983. 77ze close procedure and proficiency in English as a foreign
language, In Oiler, J W (ed), 1983.
Alderson, J C and A Waters, 1983. A course in testing and evaluation for ESP
teachers, or 'How bad were my tests?' In A Waters (ed), 1983.
Bachman L F, et al. (1988). Task and ability analysis as a basis for examining
content and construct comparability in two EFL proficiency test batteries,
Language Testing VS No 2 pp 128-159.
Cziko, G.
models of
Hamp-Lyons, Liz (1989a). Applying the partial credit method of Rasch analysis:
Language Testing and Accountability, Language Testing Vol 6:1, 109-718.
137
1 :)1:1
Milanovic, M.
MOITOW, K.
15
138
INTRODUCTION
In recent years, there has been a move towards the wider use of criterionreferenced (CR) methods of assessing second language ability which allow
learners' language performance to be described and judged in relation to defined
behavioural criteria. This is in line with the concern among language testers to
provide meaningful information about what testees arc able to do with the
language rather than merely providing test scores. However, while criterionreferencing has enabled language testers to be morc explicit about what is being
This paper aims to illustrate and discuss the nature of these problems in the
CRITERION-REFERENCING
139
15
when measured in this manner is the behaviour which defines each point
along the achievement continuum. The term 'criterion', when used this way,
does not necessarily refer to final end-of-course behaviour. Criterion levels
This early definition of CRA highlights several key elements which are
reflected in various kinds of language assessment instruments: first, proficiency
I 5:3
140
This raises the issue of whether assessment criteria should take as their
reference point what learners do, what linguists and teachers think learners do
or what native speakers do. This point will be taken up later.
According to some authors, however, the differences between normreferenced assessment and CRA however, arc not as great as conventionally
imagined. Rowntrce (1987: 185-6), for example, notes that criterion levels are
frequently established by using population norms:
')
142
(Brown 1981: 7). If the latter view is accepted, then it would be possible to
imagine situations in which CRA assessment did not concern itself with elements
of learners' communicative performance (eg. if the syllabus were grammatically-
assessment. However, in the case of second language learners who have to use
thc language in socicty on a daily basis there are clearly arguments for
accentuating methods of CRA which allow them to gain feedback on their ability
to perform real-life tasks (see Brindley 1989: 91-120 for examples).
143
15
DEFINING CRITERIA
The easiest way to define criteria and de.criptors for language assessment is
to use those already in existence. There is no shortage of models and examples.
For proficiency testing, literally thousands of rating scales, band scales and
performance descriptors are used throughout the world. An equivalent number
of skills taxonomies, competency checklists, objectives grids etc, are available for
classroom use.
Like tests, some proficiency scales seem to have acquired popular validation
by virtue of their longevity and extracts from them regularly appear in other
scales. The original scale used in conjunction with the Foreign Service Institute
Oral Interview (FSI 1968), in particular, seems to have served as a source of
inspiration for a wide range of other instruments with a similar purpose but not
necessarily with a similar target group. Both the Australian Second Language
Proficiency Rating Scale (ASLPR) (Ingram 1984) and the ACTFL Proficiency
Guidelines (Hip le 1987) which aim to describe in the first case the proficiency of
Byrnes (1987), for example, claims that the ACTFL/ETS scale is built on a
"hierarchy of task universals" .
Apart from their lack of empirical underpinning, the validity of rating scale
descriptors (in particular the ACTFL/ETS Oral Proficiency Interview) has been
144
157
the incremental and lockstep nature of level descriptions fails to take into
account the well documented variability and "backsliding" which occur in
have shown systematic variability according to the learner's psychosociological orientation (Meisel et al. 1981); emotional investment in the
topic (Eisenstein and Starbuck 1989); the discourse demands of the task
(Brown and Yule 1989); desired degree of social convergence/divergence
(Rampton 1987); planning timc available (Ellis 1987); and ethnicity and
status of interlocutor (Beebe 1983)
faced with issues like is 'some' more than `a few' but fewer than 'several' or
'considerable' or 'many'. How many is 'many'?
145
D,)
the descriptions are highly context dependent and thus do not permit
generalisation about underlying ability (Bachman and Savignon 1986;
Skehan 1989). Methods such as the oral interview confuse trait and method
(Bachman 1988).
in the absence of concrete upper and lower reference points, criterionreferencing is not possible. Bachman (1989: 17) points out that criterionreferencing requires the definition of the end points of an absolute scale of
ability (so-called "zero" and "perfect" proficiency). Yet in practice, no-one
has zero proficiency, since somc language abilities are universal. Similarly,
native speakers vary widely in ability, which makes the "perfect speaker" an
equally tenuous concept.
Clearly the validity of the criteria on which proficiency descriptions are built
is by no means universally accepted. However, the controversy surrounding the
construct validity of proficiency rating scales and performance descriptors is
merely a manifestation of the fundamental question that CRA has to face: how
behaviour which forms the basis of the domain. As such, they arc open to
question on the same grounds as the proficiency descriptions described above.
In addition, some testers would claim that performance testing associated with
146
15
(1989) argues that the only way to develop adequate CR procedures for
proficiency which are independent of particular contexts, 'in terms of the relative
presence or absence of the abilities that constitute the domain' rather than 'in
terms of actual individuals or actual performance' (Bachman 1989: 256). An
example of such a scale is given below.
Cohesion
Vocabulary
No cohesion
1 Small vocabulary
vocabulary limitations).
2 Vocabulary of moderate size
3 Large vocabulary
(Seldom misses or searches
for words).
4 Extensive vocabulary
147
)
However such scales, too, are clearly fraught with problems as Bachman
and Savignon (1986: 388) recognize when they admit the difficulty of 'specifying
the degree of control and range in terms that are specific enough to distinguish
levels clearly and for raters to interpret consistently'. The sample scales, in fact,
manifest many of the same problems which arise in the design of more
conventional proficiency rating scales. Thc terminology used is very imprecise
and relativistic ('limited'; `frequently'; 'confusing' etc) and in the absence of
precise examples of learners' language use at each of the levels, problems of
ratcr agreement would inevitably arise. In fact, since the levels do not specify
particular contexts, structure, functions and so on, raters would not have any
concrete criteria to guide them. The difficulties of reaching agreement between
raters would, consequently, be likely to be even more acute.
Consult erpert judges
to ask expert judges to identify and sometimes to weight the key features of
learner performance which arc to be assessed. Experienced teachers tend to be
the audience most frequently consulted in the development and refining of
criteria and performance descriptions (cg. Westaway 1988; Alderson 1989;
Griffin 1989). In some cases they may be asked to generate the descriptors
themselves by describing key indicators of performance at different levels of
proficiency. In others, test developers may solicit comments and suggestions
from tcachcrs for modification of existing descriptors on the basis of their
knowledge and experience.
In ESP testing, test users may also surveyed in order to establish patterns of
language usage and difficulty, including the relative importance of language tasks
and skills. The survey results then serve as a basis for test specifications. This
procedure has been followed in the development of tests of English for academic
purposes by, inter alia, Powers (1986), Hughes (1988) and Weir (1983, 1988) and
Problems
li'ho are the experts?
The idea of using "expert judgement" appeals to logic and common sense.
However it poses the question of who the experts actually are. Conventionally it
is teachers who provide "expert" judgements, although increasingly other non-
148
teacher test users are being involved in test development. There arc obvious
reasons, of course, for appealing to teacher judgements. They arc not difficult to
obtain since teachers are on hand, they are familiar with learners' needs and
problems, they are able to analyse language and they can usually be assumed to
be aware of the purposes and principles of language testing, even though they
may not always be sympathetic to it. Although less obviously "expert" in the
sense of being further removed from the language learning situation and less
familiar with linguistic terminology, test users who interact with the target group
(such as staff in tertiary institutions or employers) can similarly be presumed
likely to have some idea of the language demands which wi:l be made on the
testee and thus to bc able to provide usable information for test developers.
But in addition to teachers and test users, it could also be argued that
testees/learners themselves are "experts" on matters relating to thcir own
language use and that their perceptions should also be considered in drawing up
149
163
150
Native speakers are not language analysts. Nor are most learners. It is
hardly surprising, therefore, that the test users' perceptions of language needs
tend to be stated in rather vague tcrms. This is exemplified by an examination
by Brindley, Neeson and Woods (1989) of the language-related comments of 63
university supervisors' monitoring reports on the progress of foreign students.
They found that the vast majority of the comments were of the general kind
("has problems with writing English"; "English expression not good"), though a
few lecturers were able to identify particular goal-related skills ("has difficulty
following lecturers-speak very fast").
151
16,
which a test item was testing a particular skill and the level of difficulty
represented by the item (agreement would constitute evidence for content
validity).
Studies aimed at investigating how expert judgements arc made, however, cast
some doubt on the ability of expert judges to agree on any of these issues. Alderson
(1988), for example, in an examination of item content in EFL reading tests, found
that judges were unable to agree not only on what particular items were testingbut
also on the level of difficulty of items or skills and the assignment of these to a
particular level. Devenney (1989) who investigated the evaluative judgements of
ESL teachers and students of ESL compositions, found both within-group and
between-group differences in the criteria which were used. Hc comments:
Implicit in the notion of interpretive communities are these assmnptions: (I)
a clear set of shared evaluative criteria exists, and (2) it will be used by
members of the intopretive community to respond to text. Yet this did not
prove to be the case for either ESL teachers or students
1 fi t)
152
claims which can be made about the capacity of language tests to predict
communicative ability (in the broader sense) in real-life settings. Second, if real-
trying to build this notion into assessment criteria. In this regard, the use of
"task fulfilment" as a criterion in the IELTS writing assessment scales is a
promising stcp in this direction (Westaway 1988).
153
16:;
off a final-s
Their assessments were largely global, the language abstract and rarely
substantiated by reference to anything concrete:
she did and something for the effort (or lack of effort) made in the
preparation, although neither is mentioned in the guidelines.
perhaps because they are perceived as part of their educator's role. Specific
assessment criteria may be developed rigorously and clearly spelled out, yet the
teachers appear to be operating with their own constructs and applying their own
criteria in spite of (or in addition to) those which they are given. This tendency
may be quite widespread and seems to be acknowledged by Clark and Grognet
154
(1985: 103) in the following comment on the external validity of the Basic English
Skills Test for non-English-speaking refugees in the USA:
On the assumption that the proficiency-rating criterion is probably somewhat
unreliable in its own right, as well as based to some extent on factors not
directly associated with language proficiency per se (for example, student
personality, diligence in completing assignments etc) even higher validity
coefficients might be shown using external criteria more directly and accurately
reflecting language proficiency
Further support for the contention that teachers operate with their own criteria
He comments that
raters seem to be influenced by their teaching background and the nature of
the criteria used can differ from rater to rater. Consensus moderation
procedures appear to have controlled this effect to some degree but not
completely.
CONCLUSION
155
1 6 :3
may not reflect what is known about the nature of language learning and use and
they may not be consistently interpreted and applied even by expert judges.
If the ideal of CRA is to bc attained, it is necessary to develop criteria and
descriptors which not only reflect current theories of language learning and
language use but which also attempt to embody multiple perspectives on
communicative ability. As far as the first of these requirements is concerned,
Bachman and his colleagues have put forward a research agenda to develop
operational definitions of constructs in Bachman model of communicative
language proficiency and validate these through an extensive program of test
development and research (see, for example, Bachman and Clark 1987;
Bachman et al 1988; Bachman 1990). One of the main virtues of this model, as
Skehan (1990) points out, is that it provides a framework within which language
testing research can bc organised. It is to be hoped that the model will enable
language testers to systematically investigate the components of language ability
as manifested in tests and that the results of such research will be used to inform
the specifications on which assessment instruments arc based.
Second language acquisition (SLA) research can also make a contribution
to the development of empirically-derived criteria for language assessment which
reflect the inherent variability and intersubjectivity of language use. First,
research into task variability of the type reported in Tarone (1989), Tarone and
Yule (1989) and Gass et al (1989a: 1989b) provides valuable insights into the
role that variables such as interlocutor, topic, social status and discourse domain
might exercise on proficiency. Investigation of factors affecting task difficulty
might also provide a more principled basis for assigning tasks to levels, a major
al 1980; Ludwig 1982; Eisenstein 1983) and crror gravity (cg James 1977;
Chastain 1980; Davies 1983). However such studies have tcndcd to examine the
in the creation of performance criteria which reflect those used in real life.
Information of this kind is of critical importance since in many cases, it is thc
judgements of native speakers that will determine the future of language
learners, not so much those of teachers. At thc same time, it ;.: important to try
to establish to what extent non-linguistic factors such as personality, social status,
ethnicity, gender etc affect judgements of proficiency and the extent to which
these factors can be related to linguistic ones (Clark and Lett 1987).
156
basis for establishing assessment criteria which arc consistent with the
sessions in which raters, as Griffin (1989) reports. But there is always the
possibility that agreement might conceal fundamental differences. As Barnwell
(1985) comments:
raters who agree on the level at which a candidate can he placed may offer
very different reasons for their decisions
Given, as we have seen, that different judges may operate with their own
personalized constructs irrespective of the criteria they are given, it would be a
mistake to assume that high inter-rater reliability constitutes evidence of thc
construct validity of the scales or performance descriptors that are used. In
same time, studies requiring teachers, !carnets and native speakers arc to
externalize the criteria they (perhaps unconsciously) use to judge language
ability would help to throw some light on how judgements are actually made by a
157
6
learning process. In the field of gene-al education, the results of research into
ACKNOWLEDGEMENT
I would like to thank Charles Alderson for his helpful comments on an earlier
version of this paper.
REFERENCES
Alderson, J C 1989. Bands and scores. Paper presented at IATEFL Language
Testing Symposium, Bournemouth, 17-19 November.
149-
164.
1..11
IIA
15 8
159 I
Caulley, D., Orton, I., and L Claydon 1988. Evaluation of English oral CAT.
Melbourne: Latrobe Universkv.
Chastain, K. 1980. Native speaker reaction to instructor-identified student second
language errors. Modern Language Journal, 64, pp. 210-215.
1989. How ESL teachers and peers evaluate and respond to student
Dickinson, L.
1987.
Douglas, D and L Selinker 1985. Principles for language tests within the 'discourse
domains' theory of interlanguage: research, test constmction and interpretation.
Language Testing, 2, 2, pp. 205-226.
160
17
Gass, S., C Madden, D Preston and L Selinker (Eds) 1989a. Variation in Second
Gass, S., C Madden, D Preston and L Selinker (Eds) 1989b. Variation in Second
Language acquisition: Psycholinguistic Issues. Clevedon, Avon: Multilingual
Matters.
Glaser, R., and D J Klaus 1962. Assessing human performance. (In) R Gagne
(Ed.), Psychological Principles in Systems Development., New York: Holt,
Rinehart and Winston.
James, C V.
161
1 '2
-L
MK- I
Jones, R L. 1985. Some basic considerations in testing oral proficiency. (In) New
McNamara, T F.
Meisel, I,
Clahsen and AI Pienemann 1981. On determining development stages
in second language acquisition. Studies in Second Language Acquisition 3, pp.
109- 135.
Messick, S. 1989. Afeaning and values in test validation: the science and ethnics
of assessment. Educational Researcher, 18, 2, pp. 5-11.
175
162
Pienemann, M., M Johnston and G Brindley 1988. Constructing an acquisitionbased assessment procedure. Studies in Second Language Acquisition, 10, 2, pp.
217-243.
of comniunicative
Raffaldini, T. 1988. The use of situation tests as measures
ability. Studies in Second Language Acquisition, 10, 2, pp. 197-216.
Language Teaching,
Skehan, P. 1988. State of the art: language testing. Part 1.
21, 2, pp. 211-221.
2.
Language Teaching
language? An essay
163
I.
Tarone, E and G Yule 1989. Focus on du' Language Learner. Oxford: Oxford
University Press.
van Lier, L. 1988. Reeling, writhing, fainting and stretching in coils: oral
proficiency interviews as conversation. TESOL Quarterly, pp. 489-508.
Westaway, G.
185-
199.
Zuengler, J.
177
164
INTRODUCTION
The last decade has seen increasing use of Item Response theory in the
examination of the qualities of language tests. Although it has sometimes been
seen exclusively as a tool for improved investigation of the reliability of tests
165
17
triallcd materials for it. There are four sub-test, one each for Speaking,
Listening, Reading and Writing. The format of the ncw test is described in
McNamara (1989b). The validation of the Speaking and Writing sub-tests is
discussed in McNamara (1990).
166
I 7n
has focused on the advantages of IRT over classical theory in investigating the
reliability of tests (eg Henning, 1984). More z,ignificant is the use of IRT to examine
aspects of the validity, in particular the construct validity, of tests.
de Jong and Glas (1987) examined the construct validity of tcsts of foreign
language listening comprehension by comparing the performance of native and
non-native speakers on the tests. It was hypothesized in this work that native
speakers would have a greater chance of scoring right answers on items: this was
largely borne out by the data. Moreover, items identified in the analysis as
showing 'misfit' should not show these same properties in relation to native
speaker performance as items not showing misfit (that is, on 'misfitting' items
native speaker performance will show greater overlap with the performance of
non-native speakers); this was also confirmed. The researchers conclude (de
Jong and Glas, 1987: 191):
The ability to evaluate a given fragment of discourse in order to understand
language and should not be discounted. The point is that interpretation of the
results of IRT analysis must be informed by an in principle understanding of the
relevant constructs.
In the area of speaking, the use of IRT analysis in thc development of the
Interview Test of English as a Sccond Language (ITESL) is reported in Adams,
Griffin and Martin, 1987; Griffin, Adams, Martin and Tomlinson, 1988. These
167
as vocabulary. One would have liked to see different kinds of items added
until the procedare showed that the limit of the unidimensionality criterion
had now been reached.
Nunan (1988: 56) is quite critical of the test's construct validity, particularly
in the light of current research in second language acquisition:
The major problem that I have with the test...fig that it fails adequately to
reflect the realities and complexities of language development.
Griffin has responded to these criticisms (cf Griffin, 1988 and the discussion
in Nunan, 1988). However, more recently, Hamp-Lyons (1989) has added her
voice to the criticism of the ITESL. She summarizes her response to the study
by Adams, Griffin and Martin (1987) as follows (1989: 117):
..This study... is a backward step for both language testing and language
teaching.
She takes the writers to task for failing to characterize properly the
dimension of 'grammatical competence' which the study claims to have
validated; like Spolsky and Nunan, she finds the inclusion of some content areas
puzzling in such a test. She argues against the logic of the design of the research
project (1989: 115):
Their assumption that if the data fit the psychometric model they de facto
validate the model of separable grammatical competence is questionable. If
you construct a test to test a single dimension and then find that it does indeed
test a single dimension, how can you conclude that this dimension exists
independently of other language variables? The unidimensionality, if that is
really what it is, is an artifact of the test development.
168
...the limitations of the partial credit model, especially the question of the
unidimensionality assumption of the partial credit model, the conditions
under which that assumption can be said to be violated, and the significance
of this for the psycholinguistic questions they are investigating... They need to
note that the model is very robust to violations of unidimensionality.
She further (1989: 116) criticizes the developers of the 1TESL for their
failure to consider the implications of the results of their test development
project for the classroom and the curriculum from which it grew.
However, when Adams, Griffin and Martin (1987: 25) refer to using
information derived from the test
in monitoring and developing profiles,
they may be claiming a greater role for the test in the curriculum. If so, this
requires justification on a quite different basis, as Hamp-Lyons is right to point
out. Again, a priori arguments about the proper relationship between testing and
teaching must accompany discussion of research findings based on IRT analysis.
A more important issue for this paper is Hamp-Lyons's argument about the
unidimensionality assumption. Here it secms that she may have misinterpreted
thc claims of the model, which hypothesizes (but does not assume in the sense of
'take for granted' or 'require') a single dimension of ability and difficulty. Its
analysis of test data represents a test of this hypothesis in relation to the data.
The function of the fit t-statistics, a feature of IRT analysis, is to indicate the
probability of a particular pattern of responses (to an item or on the part of an
164
individuals are found in this way to be disconfirming the hypothesis, this may be
interpreted in a number of ways. In relation to items, it may indicate (1) that
the item is poorly constructed; (2) that if thc item is well-constructed, it does
not form part of the same dimension as defined by other items in the test, and is
therefore measuring a different construct or trait. In relation to persons, it may
indicate (1) that the performance on a particular item was not indicative of the
candidate's ability in general, and may have been the result of irrelevant factors
such as fatigue, inattention, failure to take the test item seriously, factors which
Henning (1987: 96) groups under the heading of response validity; (2) that the
ability of the candidates involved cannot be mcasurcd appropriately by the test
instrument, that the pattern of responses cannot be explained in the samc terms
as applied to othcr candidates, that is, there is a heterogeneous test population in
terms of the hypothesis under consideration; (3) that there may be surprising
gaps in the candidate's knowledge of the areas covered by the test; this
information can then be used for diagnostic and remedial purposes.
A further point to note is that the dimension so defined is a measurement
170
Placement Examination. There were 150 multiple choice items, thirty in each of
five sub-tests: Listening Comprehension, Reading Comprehension, Grammar
Accuracy, Vocabulary Recognition and Writing Error Dctection. Relatively few
details of each sub-test are provided, although we might conclude that the first
two sub-tests focus on language use and the other three on language usage.
This assumes that inferencing is required to answer questions in the first two
Listening and Reading sub-test). One might reasonably conclude that the
majority of test items may be used to construct a single continuum of ability and
difficulty. We must say 'the majority' because in fact the Rasch analysis does
identify a number of items as not contributing to the definition of a single
underlying continuum; unfortunately, no analysis is offered of these items, so we
are unable to conclude whether they fall into the category of poorly writtcn items
or into the category of sound items which define some different kind of ability.
performance on the majority of items in the test, Henning et al. report two
other findings. First, factor analytic studies on previous versions of the test
showed that the test as a whole demonstrated a single factor solution. Secondly,
the application of a technique known as the Bejar technique for exploring the
1987b). Henning et al. nevertheless conclude that the fact that a single
dimension of ability and difficulty was defined by the Rasch analysis of their data
despite the apparent diversity of the language subskills included in the tests
shows that Rasch analysis is (Henning, Hudson and Turner, 1985: 152)
171
(Note again in passing that the analysis by this point in the study is examining a
rather different aspect of the possible inappropriateness or otherwise of IRT in
relation to language test data than that proposed earlier in the study, although
now closer to the usual grounds for dispute). The problem here, as Hamp-Lyons
is right to point out, is that what Henning et al. call 'robustness' and take to be
virtue leads to conclusions which, looked at from another point of view, seem
worrying. That is, the unidimensional construct defined by the test analysis
seems in some sense to be at odds with the a priori construct validity, or at least
the face validity, of the test being analysed, and at the very least needs furthcr
discussion. However, as has been shown above, the results of the IRT analysis in
the Henning study are ambiguous, the nature of the tests being analysed is not
clear, and the definition of a single construct is plausible on one reading of thc
sub-tests' content. Clearly, as the results of the de Jong and Glass study show
(and whether or not we agree with thcir interpretation of those results), IRT
analysis is capable of defining different dimensions of ability within a test of a
single language sub-skill, and is not necessarily 'robust' in that sense at all, that
is, the sense that troubles Hamp-Lyons.
In a follow-up study, Henning (1988: 95) found that fit statistics for both
was confirmed by factor analysis.) However, it is not clear why lit statistics
should have been used in this study; thc measurement model's primary claims
are about the estimates of person ability and item difficulty, and it is these
estimates which should form the basis of argumentation (cf the advice on this
point in relation to item estimates in Wright and Masters, 1982: 114-117).
In fact, the discussions of Hamp-Lyons and Henning are each marked by a
failure to distinguish two types of model: a measurement model and a model of
the various skills and abilities potentially underlying test performance. These are
not at all the same thing. The measurement model posited and tested by IRT
analysis deals with the question, 'Does it ma1.7 sense in measurement tcrms to
sum scores on different parts of the test? Can all items be summed
meaningfully? Are all candidates being measured in the same terms?' This is the
'unidimensionality' assumption; the alternative position requires us to say that
of language abilities, that is, in the light of current models of the constructs
172
models such as IRT, and both kinds of analysis have the potential to illuminate
the nature of what is being measured in a particular language test.
example when they appear to overturn long- or dearly-held beliefs about the
nature of aspects of . Iguage proficiency. Also, without wishing to enter into
Hamp-Lyons (1989: 1 i .) calls
the hoary issue of whether language conzpetence is unitary or divisible,
functions and setting criteria for competence that are more or less easy to
meet.
17 3
I63
Data from 196 candidates who took the Listening sub-test in August, 1987
were available for analysis using the Partial Credit Model (Wright and Masters,
1982) with the help of facilities provided by the Australian Council for Education
Research. The material used in the test had been trialled and subsequently
revised prior to its use in the full session of the OET. Part A of the test
consisted of short answer questions on a talk about communication between
different groups of health professionals in hospital settings. Part B of the test
involved a guided history taking in note form based on a recording of a
consultation between a doctor and a patient suffering headaches subsequent to a
serious car accident two years previously. Full details of the materials and the
trialling of the test can be found in McNamara (in preparation).
The analysis was used to answer the following question:
in the usual form of information about 'misfitting' items and persons. In thc
second analysis, Part A and Part B were each treated as separate tests, and
estimates of item difficulty and person ability were made on the basis of cach
There were a maximum of forty-nine score points from the thirty-two itcms.
Using data from Part A only, scores from five cpdidates who got perfect scores
or scores of zero were excluded, leaving data 6.691 191 candidates. There were a
maximum of twenty-four score points from twelve items. Using data from Part
B only, scores of nineteen candidates with perfect scores were excluded, leaving
data from 177 candidates. There were a maximum of twenty-five score points
from twenty items. Table 1 gives summary statistics from each analysis. Thc
Test reliability of person separation (the proportion of the observed variance in
logit measurements of ability which is not due to measurement error; Wright
and Masters, 1982: 105-106), termed the 'Rasch analogue of the familiar KR20
index' by Pollitt and Hutchinson (1987: 82), is higher for the test as a whole than
for either of the two parts treated independently. The figure for the test as a
whole is satisfactory (.85).
Part A
Part B
194
191
177
Number of items
32
12
20
49
24
25
34.2
14.4
19.4
S D (raw scores)
9.5
53
4.5
1.46
0 86
1.67
S D (logits)
1.33
1.44
1.25
.48
.71
75
Person separation
reliability (like KR-20)
85
.74
60
175
11
Items
Persons
Part B
Parts A and B
Part A
2 (#7, #12)
The analysis reveals that number of misfitting items is low. The same is
true for misfitting persons, particularly for the test as a whole and Part A
considered independently. Pollitt and Hutchinson (1987: 82) point out that we
would normally expect around 2% of candidates to generate fit values above +2.
On this analysis, then, it seems that whcn the test data arc treated as single
test, the item and person fit statistics indicate that all the items except two
combine to define a single measurement dimension; and the overwhelming
majority of candidates can be measured meaningfully in terms of the dimension
the two parts of the sub-test treated separately should be independent of the
Part of the tcst on which thcy arc made. Two statistical tests were used for this
purpose.
The first tcst was used to investigate the research hypothesis of a perfect
correlation between the ability estimates arrived at separately by treating the
data from Part A of the test independently of thc data from Part B of the test.
Thc correlation -oetween the two sets of ability estimates was calculated,
corrected for attenuation by taking into account thc observed reliability of the
two parts of thc test (Part A: .74, Part B: .60 - cf Table 1 above). (The
procedure used and its justification arc explained in Henning, 1987: 85-86.) Let
the ability estimate of Person n on Part A of the test be denoted by bnA and the
176
Rxy
rxy
VINX riy
where
The correlation thus corrected for attenuation was found to bc > 1, and
hence may be reported as 1. This test, then, enables us to reject the hypothesis
that there is not a perfect linear relationship between the ability estimates from
each part of the test, and thus offers support for the research hypothesis that the
true correlation is 1.
The correlation test is only a test of the linearity of the relationship between
the estimates. As a more rigorous test of the equality of the ability estimates, a
X2 test was done. Let the 'true' ability of person n be denoted by Bn. Then tvz,4
and bnB arc estimates of Bn. It follows from maximum likelihood estimation
thcory (Cramer, 1946) that, because bnA and tmB are maximum likelihood
estimators of Bn (in the ease when both sets of estimates are centred about a
mean of zero),
bnA N (13n, er2iA)
whcrc eitA is the error of the estimate of the estimate of the ability of Person n
on Part A of the tcst and
2
From Table 1, the mean logit score on Part B of the test is 1.67, while thc
mean logit score on Part A of the tcst is .86. As the mean ability estimates for
the scores on :ach part of the test have thus not been set at 7ero (due to the fact
that items, not people, have been centred), allowance must be made for the
relative difficulty of each part of the test (Part B was cons:,ferably less difficult
than Part A). On average, then, bnB - bnA = .81. It foliows that if the
177
hypothesis that the estimates of ability from the two parts of the test are identical
is true, thcn bnB - bnA - .81 = 0. It also follows from above that
2
N (0,1)
jenB
+ anA
if the differences between the ability estimates (corrected for the relative
difficulty of the two parts of the test) are converted to z-scores, as in the above
formula. If the hypothesis under consideration is true, then the resulting set of
z-scores will have a unit normal distribution; a normal probability plot of these zscores can bc done to confirm the assumption of normality. These z-scores for
each candidate are then squared to gct a value of X2 for each candidate. In
order to evaluate thc hypothesis under consideration for the entire set of scores,
thcn the test statistic is
2
N-1
where N = 174
of fit statistics, the tests chosen arc appropriate, as they depend on ability
estimates directly.
178
Difficulty
5.0
4.0
3.0
2.0
35
29
1 12
11
1.0
25
15
7
16 24
0.0
-1.0
26
10
6
9
20
22 23
14
17
18 32
21
27 30
13 28
4 19
-2.0
31
-3.0
179
1 9 4,-
Figure 1 reveals that the two Parts of ale test occupy different areas of the
map, with some overlap. For example, of the eight most difficult items, seven
are from Part A of the test (Part A contains twelve items); conversely, of the
eight easiest items, seven are from Part B of the test (Part B has twenty items).
It is clear then that differing areas of ability are tapped by the two parts of the
test. This is most probably a question of the content of each part; Part A
involves following an abstract discourse, whereas Part B involves understanding
details of concrete events and personal circumstances in the case history. The
two types of listening task can be viewed perhaps in terms of the continua more
or less cognitively demanding and more or less context embedded proposed by
Cummins (1984). The data from the test may be seen as offering support for a
similar distinction in the context of listening tasks facing health professionals
working through the medium of a second language. The data also offer evidence
in support of the content validity of the test, and suggest that the two parts are
sufficiently distinct to warrant keeping both. Certainly, in terms of backwash
effect, one would not want to remove the part of the test which focuses on the
consultation, as facc-to face communication with patients is perceived by former
test candidates as the most frequent and the most complex of the communication
tasks facing them in clinical settings (McNamara, 1989b).
The interpretation offered above is similar in kind to that offered by Pollitt
CONCLUSION
An IRT Partial Credit analysis of a two-part ESP listening test for health
professionals has been used in this study to investigate the controversial issue of
test unidimensionality, as well as the nature of listening tasks in the tcst. The
analysis involves the use o' wo inder,mdent tes:5 of unidimensionality, and both
confirm the finding of the wual analysis of the test data in this case, that is, that
it is possible to construct a single dimension using the items on the test for the
issues involved, suggest that the misgivings sometimes voiced about thc
limitations or indeed the inappropriateness of IRT for the analysis of language
test data may not be justified. This is not to suggest, of course, that we should be
uncritical of applications of the techniques of 1RT analysis.
Moreover, the analysis has shown that the kinds of listening tasks presented to
candidatcs in the two parts of the test represent significantly different tasks in terms
180
of the level of ability required to deal successfully with them. This further confirms
the useful role of IRT in the investigation of the content and construct validity of
language tests.
REFERENCES
Adams, R I, P E Griffin and L Manin (1987). A latent trait method for measuring
a dimension in second language proficiency. Language Testing 4,1: 9-27
Andrich, D (1978b). Scaling attitude items constructed and scored in the Liken
tradition. Educational and Psychological Measurement 38: 665-680.
181
.;
Henning G (1988). The influence of test and sample dimensionality on latent trait
person ability and item difficulty calibrations. Language Testing 5,1: 8-99.
Henning G, T Hudson and I Turner (1985). Item response theory and the
assumption of unidimensionality for language tests. Language Testing 22: 141154.
182
is
Cambridge: Cambridge
McNamara T F (1990). Item Response Theory and the validation of an ESP test
for health professionals. Paper presented at the Language Testing Research
Colloquium, San Francisco, March 2-5.
183
184
19
INTRODUCTION
in all British Council offices around the world, in the Australian Education
Centres and 1DP Offices being established in many Asian and Pacific countries,
and in other centres where trained administrators arc available. The test is the
the English proficiency (both general and special purpose) of the large and
growing number of international students wishing to study, train or learn English
in Australia, Britain, and other English-speaking countries. The test seeks, first,
to establish whether such students have sufficient English to undertake training
185
Iu
and academic programmes without thcir study c. training being unduly inhibited
by their English skills; second, to provide a basis a which to estimate the nature
wishing to learn English; and, fourth, when the s..udents exit from an English
course, to provide them with an internationally used aod recognized statement of
English proficiency.
11
The form that the IELTS takes has been dctermired by the purposes it has
to serve, evaluations of earlier tests, needs analyses of end-users of the test's
results, and significant changes in applied linguis ics through the 1980's
growing scepticism about the practicality of Munby-type needs
analysis (Munby 1978) and ESP theory).
The purposes the IELTS serves require that it pros ides a valid and reliable
measure of a person's practical proficiency both for general and for academic
purposes. Because if the large number of students involved, the test has to be
readily and rapidly administered and marked, preferably by clerical rather than
more expensive professional staff, and able to be administered on demand,
worldwide, en masse (except for the Speaking test), and often in remotc
localities by per sons with little training, little professional supervision, and no
access to sophi.;ticated equipment such as language laboratories. In addition,
once the test has been taken and scored, its results must be readily interpretable
even by non-professionals such as institutions' admission officers.
The large number of candidates to whom the test has to be administered
also implies a diversity of academic or training fields within which candidates'
English proficiency should be measured. While face validity would suggest that
the test should have modules relevant to the specific academic or training fields
candidates wish to enter, the sheer diversity of those fields (one count, for
example, indicated 34 different fields within Engineering) makes a hard version
of specific purpose testing impractical while, in any case, there has been growing
scepticism through the 1980's about the validity of hard versions of ESP theory
In addition to the constraints imposed by the purposes the test serves, its
form was influenced by thc pre-development studies made in thc course of
186
1 zi
Training sub-test focus especially around Band level 4 while the EAP/ESP
component focuses around Band level 6 without, in both cases, excluding the
possibility of information being provided about candidates above and below
these levels. In addition, within each sub-test, different phases of the sub-test
are focused around different band levels in order to provide graduation within
the sub-test as well as between the General and Modular components.
As already indicated, the purposes thc 1ELTS has to serve ne,:essitate that
it assess both general proficiency and ESP/EAP proficiency. Consequently, the
187
)1;
Band
Foous
eneral Comp.
Speaking
Modular Comp.
,6-7)
(3-6)
Listening
Focus
(5-7)
/3-6)
(5-7)
General Training
(3-6)
The nature and purpose of the modular components raise some difficult
issues for the form of the test. For reasons already indicated, the three
ESP/EAP modules arc less specific than, for instance, the former ELTS test and
favour the features and academic skills of the broader rather than more specific
discipline areas. In addition to the reasons already stated for this, thc less
specific nature of the Ms and their greater focus on EAP rather than ESP make
them more compatible with their use with both undergraduates and graduates.
with the broad range of academic tasks and register features of the broad
discipline arca to be entered. However, the M's were considered inappropriate
for persons at lower levels of academic development (upper Secondary School),
for persons entering vocational training such as apprenticeships, or for persons
participating in on-the-job attachments. For these, the emphasis is on general
proficiency and consequently they take the G component together with the
General Training (GT) module which, likc the M's, assesses reading and writing
188
A.
significant academic demands would take the relevant M rath than the GT (eg,
persons entering a diploma-level TAFE course, persons going on attachment to
a scientific laboratory, or subject-specialist teachers going for practical training,
work experience or on exchange in a school). To prevent candidates' taking thc
General Training module rather than one of the other modules in the belief that
it is easier to score higher on the more general test, an arbitrary ceiling of Band
6 has been imposed on the General Training module. The logic, practical value
and validity of this decision have yet to be fully tested and will, undoubtedly, be
an issue for future consideration by the International Editing committee.
The effect of the pattern of sub-tests just outlined is to enable the IELTS to
provide a comprehensive measure of general proficiency in all four maeroskillls
using the two General component sub-tests in Listening and Speaking and the
General Training sub-test in Reading and Writing. For Australian purposes, this
makes the test relevant to ELICOS needs where, for persons entering or exiting
from a general English course, a comprehensive test of general proficiency is
needed. The availability of the EAP/ESP sub-tests in Reading and Writing
means that candidates at higher proficiency levels in all four macroskills can also
have their proficienc ),. comprehensively assessed though with Reading and
Writing being assessed in broad ESP/EAP contexts. It is regrettable that the
decision to limit maximum General Training scores to Band 6 prevents persons
with high general proficiency in Reading and Writing but no ESP or EAP
development from having their skills fully assessed.
The current form of the IELTS as released in November 1989 is the result
of more than two years' work but the permanent administrative structures allow
for considerable on-going development and review activity and, in any case, the
test specifications give considerable freedom to item-writers to vary actual item
types within clearly stated guidelines. No decision concerning the form and
detailed specifications of the test has been taken lightly but, nevertheless, the
need to adhere strictly to the original release date of October-November 1989
inevitably meant that some issues will be subject to further trial and investigation
and this, together with the continual review process, will mean that the test is not
static (and hence rapidly dated) but is in a state of constant evolution.
189
2o
ability to perform specified item types, and to encourage the sort of innovation
on the part of item-writers that might lead to progressive irr--overnent in what
we would claim is already a valuable test.
candidates whose first language is not English and who are applying to
undertake study or training through the medium of English. It is primarily
intended to select candidates who meet specified proficiency requirements
for their designated programmes. Its secondary purpose is to be a semidiagnostic test designed to reveal broad areas in which problems with
English language use exist, but not to identify the detailed nature of those
problems.
"The test battery consists of a General and a Modular section. The General
section contains tests of general language proficiency in the areas of listening
and speaking; the Modular section consists of tests of reading and writing for
acadcmic purposes."
190
a,
impractical even for the four broad fields catered for in th; modular component
of the test. Consequently, Listening is part of the general component and has
two stages, the first relating to social situations and the second to course-related
situations. A variety of item types may be used including information transfer
(such as form-filling, completing a diagram, following routes on a map), more
(Australia, Britain and Canada). Discourse styles differ through thc test and
include conversation, monologue, and formal and informal lectures. Utterance
rates are graduated through the test from the lower to middle range of native
speaker rates and the contextual features listed ensure that candidates are
required to cope with varied accents, different utterance ratcs, varied but
relevant situations, and register shifts.
The Speaking sub-test is discussed in detail elsewhere (Ingiam 1990). In
brief, it is a direct test of oral proficiency in which the specifications and test
outline seek to exercise more control over the interviewer's options than in more
traditional approaches using, for example, the ASLPR or FSI Scales. The
interview lasts eleven to fifteen minutes, is in five phases, and includes activities
that progressively extend the candidate and give him or hcr the opportunity to
lead the discussion. Aftcr a short introductory phase in which the interviewer
elicits basic personal information, Phase Two aives candidates the opportunity to
provide more extended speech about some familiar aspect of their own culture
or some familiar topic of general interest. Phase Three uses information gap
tasks to have the candidate elicit information and, perhaps, solve a problem.
Phase Four draws on a short curriculum vitae Filled out by candidates before the
interview in order to have them speculate ahout their future, express attitudes
and intentions, and discuss in some detail their field of study and futurc plans.
There is a very short concluding phase entailing little more than the exchange of
good wishes and farewell. Assessment is by matching observed language
behaviour against a band scale containing nine brief performance descriptions
from 1 (Non-Speaker) to 9 (Expert Speaker). Interviewers are native speakers.
trained ESL teachers who have undergone short format training in administering
the Speaking test with an additional requirement to work through an interviewer
training package at regular intervals. All interviews are audio-recorded with a
10% ample being returned to Australia or Britain for monitoring of interview
quality and moderation of assessments assigned. This sub-test is of particular
interest from a test design point of view since it draws on thc developments in
"direct", interview-based assessment of the last two decades hut seeks to control
the interview to maximize validity and reliability for largc-scalc administration
often in remote locations using minimally trained interviewers. Of particular
191
21i,1;
importance is the attempt made in the interview to surrender initiative from the
interviewer to the candidate (especially in Phase 3) because of the importance, in
English-speaking academic environments, of students' being willing to ask
questions and seek information for themselves.
Three of the modular tests assess reading and writing in ESP-EAP contexts
while the General Training module (to be discussed subsequently) assesses them
in general and training contexts. It was noted earlier that these tests each have
to be appropriate for candidates in a wide range of disciplines and for both those
entering and those continuing their field. Consequently, though reading or
stimulus materials are chosen from within the broad discipline areas of the
target population of the module, neutral rather than highly discipline-specific
texts are to be used, which excludes such materials as is found in textbooks so
that item-writers arc required to choose "(scientific) magazines, books, academic
band scale score. Whether this more analytic approach to assessing writing
proficiency is more global band scale or whether the increased complexity and
time demands of the scoring procedure mitigate any possible benefit from the
analytic approach has yet to be convincingly demonstrated by research and this
192
International Editing Committee now that the test has been formally released.
The approach to writing assessment used in the IELTS is to be discussed in a
workshop presentation elsewhere in the 1990 RELC Seminar.
The General Training test focuses on Bands 3 to 6 (rising through the test)
with, as already noted, a ceiling of Band 6 on scores that can be assigned. The
tasks are required to focus on those skills and functions relevant to survival in
English speaking countries and in training programmes including, amongst
others, following and responding to instructions, identifying content and main
ideas, retrieving general factual information, and identifying the underlying
theme or concept. Texts should not be specific to a particular field, may bc
journalistic, and should include types relevant to training contexts and survival in
English speaking countries (eg, notices, posters, straightforward forms and
documents, institutional handbooks, and short newspaper articles). Some of the
possible item types in the reading test may include iaserting headings.
information transfer, multiple choice, short-answer questions with exhaustively
specifiable answcrs, and summary completion. In the Writing test, the skills and
sub-tests. Their choice is constrained by several factors that are not always
compatible, including economy of administration and scoring and the need to
provide a realistic measure of the candidate's practical proficiency. To reduce
costs, the tests are designed to be clerically, and hence essentially, objectively,
markable except for Speaking and Writing which are assessed directly against
Band Scales by ESL teachers trained and accredited to assess. Some use of
multiple-choice questions remains though all items are contextualized and itemwriters are required at all times to consider the realism of the task the candidate
is being asked to undertake. Item-writers are also required at all times to relate
the variety of needs it was intended to cover, that it have a wide proficiency
range and provide a measure of proficiency in all four macroskills. The General
component and General Training consequently focus around Band 4 and the
three academic sub-tests focus around Band 6 with the band scale spread shown
in Figure 1 (without excluding the possibility in all sub-tests of information being
193
r4
to cater for a wider range of candidates. In othcr words, though the IELTS is
designed to select candidates for academic courscs and training programmes, it
does not just adopt a threshold approach but is designed to make a statement
about candidates' proficiencies whatever their level might be.
The ncw test, as the name indicates, is the result of an international project
involving three English-speaking countries with their own distinctive dialects and
in Lancaster. Hence, the present writer was based in Lancaster for thirteen
months and Patrick Griffin for six weeks, an Australian Working Party provided
2 1!
194
importance placed on the international nature of the test, its development, and
its on-going management. A number of factors arising from the internationalism
of the test are worthy of note.
First, participating countries have had to adjust their requirements for test
content to accommodate each other's needs. Thus, for instance, the nonacademic module in the former ELTS test has been changed to meet Australian
ELICOS needs so that the IELTS General Training module includes a lower
proficiency range, and focuses mainly around general proficiency in reading and
writing.
Second, care is taken to ensure that place names or anything else identified
with one country are balanced by features identified with the other and
unnecessary references to one or the other country arc avoided so as to ensure
that nothing is done that could lead the test to be associated with any one of the
participating countries or to bias the test towards candidates going or going back
to any of the countries.
culture that underlies the language and constitutes the meaning system.
However, an international test must avoid country-specific assumptions based on
just one of thc cultures involved. To do this, however, requires careful editing by
testers from all participating countries since it is often impossible for a person,
however expert but locked in his or her own culture, even to recognize when
knowledge of certain aspects of his or her culture is assumed. On one occasion
during the writing of an early item in a draft Listening test, for example, the
present writer, coming from Australia where mail is delivered once a day,
Monday to Friday, interpreted a question and marked a multiple choice answer
quite differently from what was intended by the British writer who assumed that
mail was normally delivered once or twice a day, six days a week. In another
question in another sub-test, a person familiar with Australian culture would
have marked a different multiple-choice answer because of assumptions arising
from the fact that, in Australia, large numbers of sheep are trucked on road
trains and so a large flock could be "driven" from one place to another whereas,
if they had been walked there (which thc British test writcr intended), the verb
used would have been "to drove".
Fourth, the Specifications for all thc sub-tests include a section on cultural
appropriacy which emphasizes thc need for the test to he equally appropriate. for
students going to any of the participating countries and the need to avoid
country-specific cultural knowledge and lexical and other items identified with
any onc variety of English. The section on cultural appropriacy also emphasizes
the need to avoid topics or materials that may offend on religious, political or
cultural grounds and to observe international guidelines for non-sexist language.
195
2 Li Li
VI
CONCLUSION
this is probably the first time that a major test has been developed and
maintained in an international project of this sort, in which, in addition to
196
2u
technical cooperation, the project has sought and continues to seek to draw
equally on the testing expertise available in the participating countries, to foster
that expertise on a wide scale by the deliberate involvement of applied linguists
across each nation, and to develop a test compatible with the needs, dialects and
cultures of the participating countries. Third, certain features of the test itself
are of interest, not least the structured controls on the Speaking test and the
attempt to give candidates thc opportunity to take initiative during the inteniew.
Fourth, there has been a deliberate attempt throughout the development process
to consider the washback effect of the test on English language teaching in other
countries and to adopt test techniques that arc more likely to have a favourable
influence on the teaching of English. Finally, the sheer magnitude of the project,
the large number of candidates that will be taking the test, and the need for
much on-going test monitoring and regeneration will provide a considerable
stimulus to the development of language testint: as a skilled activity in both
countries. It is, for instance, not coincidental that the recently established
Languages Institute of Australia (a nationally funded centre for research and
information in applied linguistics in Australia) includes within it two language
testim; units established in two of the institutions most involved in Australia's
contribution to the development of the International English Language Testing
System,
REFERENCES
Alderson, J Charles and Caroline Clapham, 1987, "The Revisiion of the ELTS
Alderson, J Charkts and A Urquhart, 1985. "This Test (5 Unfair: I'm not an
Economist". In Hauptman et al 1985.
Burke, Ed, 1987. "Final Report on the Evaluation of Overseas Standardized
English Language Proficiency Tests: Implications for English Language
Proficiency Testing in Australia". Report to the Overseas Students office,
Commonwealth Department of Education, Canberra, Australia, 30 April.
1987.
Clapham, Caroline, 1987. "The Rationale for the Structure of the Revised
ELTS". Mimeograph.
197
Criper, Clive and Alan Davies, 1986. "Edinburgh ELTS Validation Project: Final
Report". Mimeograph. Subsequently published by Cambridge University
Press, 1988.
1.98
1,-)
S.
INTRODUCTION
speaking proficiency in a second language. The OPI, and thc scale on which it is
scored, is the precursor of the Australian Second Language Proficiency Rating
(ASLPR).
The measure I have called a SOPI (Stansficld, 1989) is a tape-recorded test
the ILR skill level descriptions. Thus, the examinee is asked to give directions to
and to
someone using a map, to describe a particular place based on drawing,
drawings in
thc
present,
past,
and
future
using
narrate a sequence of events in
199
2(
...,
the test booklet as a guide. Parts five and six of the SOP1 require examinees to
tailor their discourse strategies to selected topics and real-life situations. These
parts assess the examinee's ability to handle the functions and contcnt that
characterize the Advanced and Superior levels of the ACTFL guidelines, or
levels two through four of the ILR skill level descriptions. Like the OPI, the
SOPI can cnd with a wind-down. This is usually one or more easy questions
designed to put the examinee at case and to facilitate the ending of the
thc scores on the two types of test were statistically compared. The results
showed the correlation between the SOPI and the OPI to be .93.
Shortly after arriving at the Center for Applied Linguistics (CAL) in 1986, I
read Clark's report on this project and realized that these favorable results
merited replication by other researchers in situations involving other test
developers and learners of other languages. As a result, I applied for a grant
from the US Department of Education to develop similar tests in four other
languages. Fortunately, the grant was funded, and in August 1987 1 began the
development of a similar semi-direct interview test of Portuguese, called the
Portuguese Speaking Test (Stansfield, et al., 1990).
Three forms of this test and an OPI were administered to 30 adult learners
of Portuguese at four institutions. Each test was also scored by two raters. In
this study a correlation of .93 between the two types of test was also found. In
addition, the SOPI showed itself to be slightly more reliable than the OPI, and
200
4 I t)
raters reported that the SOPI was easier to rate, since the format of the test did
not vary with each examinee.
forms of the test wcrc developed for use in Hebrew language schools for
immigrants to Israel, and two forms were developed for use in North America.
The first two forms were administered to 20 foreign students at the University of
Tel Aviv and thc other two forms were administered to 10 students at Brandeis
University and 10 students at the University of Massachusetts at Amherst. Each
group also received an OPI. The correlation between the OPI and this SOP1 for
the Israeli version was .89, while the correlation for thc U S version was .94.
Parallel-form and interrater reliability were also very high. The average
interrater reliability was .94 and parallel form reliability was .95. When
examinees' responses on different forms were scored by different raters, the
reliability was .92.
noticed that examinees tended to assign a name to the person they were
speaking with. As a result, we gave each interlocutor, as appropriate, a name on
thc operational forms. To validate the test, 16 adult learners of Indonesian were
administered two forms of thc IST and an OP!. The correlation i'h the OPI
was .95. Reliability was also high, with interrater reliability averaging .97, and
parallel-form reliability averaging .93 for the two raters. When different forms
and different ratcrs were used, the reliability was also .93.
The development of two forms of the Hausa Speaking Test also posed
201
reliability (.91) in scoring thc test and indicated that they believed it elicited an
adequate sample of language from which to assign a rating.
A comparison of the two types of test demonstrates that thc SOPI can offer
Reliability. The SOPI has shown itself to be at least as reliable and sometimes
more reliable than the OPI. During the development of the Chinese Speaking
Test (Clark and Li, 1986) the OPI showed an interrater reliability of .92, while
the four forms of the SOPI showed an interrater reliability of .93. On the
Portuguese SOPI that I developed, the interrater reliability for three forms
varied from .93 to .98, while the reliability ot the OPI was .94. In addition, some
raters reported that it was sometimes easier to reach a decision regarding the
appropriate score for an examinee who was taking the SOPI than for an
examinee who was taking the ON. This is because the OPI requires that each
examinee be given a unique interview, whereas the format and questions on an
SOPI arc invariant. Under such circumstances, it is often easier to arrive at a
decision on the score. The situation is similar to scoring a batch of essays on the
same topic versus scoring essays on different topics. The use of identical
questions for each examinee facilitates the rater's task. I should be careful to
point out that although the rater's task is made easier by the use of identical
questions, competent raters arc able to apply the scale reliably when different
questions are used. Thus, the use of a common test for all examinees does not
guarantee an improvement in reliability over the face-to-face interview.
Thc length of the speech sample may also facilitate a decision on a rating.
Thc OPI typically takes about 20 minutes to administer and produces about 15
the OPI, the same interviewer typically rates and scores the test. Yet this
interviewer may not bc the most reliable or accuratc ratcr. In the SOPI, one can
have the tape scored by thc most reliable rater, even if this rater lives in a
different city or region of the country.
202
:3
Validity. Many factors can affect the validity of a measure of oral proficiency.
The consideration of several factors explains why the SOPI may be as valid as
the OPI.
The SOPI usually produces a longer sample of examinee speech. Whcn this
is the case, the more extensive sample may give it greater contcnt validity.
are of personal interest or within his or her range of awareness. Or, the
interviewer and the interviewee may have very little in common. Finally, if the
interview is too short, it will not adequately sample the language skills of the
interviewee. All of these factors can affect the validity of the OPI.
of topic, if done too abruptly, can seem awkward and disconcerting to the
interviewee. This is not the case when the topic is switched naturally, but such
natural changes in topic of the conversation can only be brought about a limited
number of times (4-8) within the span of a 20 minute conversation. As a result,
203
p.
In the OPI the examinee speaks directly to a human being. However, the
examinee is fully aware that he or she is being tested, which automatically
creates unnatural circumstances. As van Lier (1989) has noted, in the OPI the
aim is to have a successful interview, not a succcssful conversation. Thus, even
thc 011 is not analogous to a real conversation. The SOPI, on the other hand,
would seem even less natural, since it is neither a conversation nor an interview.
In short, neither format produces a "natural" or "real-life" conversation.
As mentioned above, the interview usually contains two role plays that arc
2 04
thc SOPI correlates so highly with it, even when the OPI is conducted by
experienced, expert interviewers. The explanation probably lies in the
limitations of the OPI itself. Since the SOPI does not mcasurc interactive
language, and the two tests measure the same construct, then the examinee's
skill in verbal interaction must not play a significant role on the OPI.
Consideration of the relationship between interviewer and interviewee on thc
OP1 suggests this is indeed the case. The interviewer typically asks all thc
questions and maintains formal control over the direction of the conversation.
The interviewee plays the subservient role, answering questions and responding
to prompts initiated by the interviewer with as much information as possible. He
to that format. Similarly, if thc two test types seem to elicit language that is
qualitatively different, then it would bc helpful to know this as well. Currently,
we have available tapes containing examinee responses under both formats.
Elana Shohamy and hcr associates arc currently planning a qualitative study of
2 05
the Hebrew tapes. We are willing to make the tapes in Chinesc, Portuguese,
Hausa and Indonesian available to other serious researchers. The results of such
Practicality. The SOPI offcrs a number -1 practical advantages over the ON.
The OPI must be administered by a trained interviewer, whereas any teacher,
aide, or language lab technician can administer the SOPI. This may be
especially useful in locations where a trained interviewer is not avaiiable. In the
US, this is often the case in languages that are not commonly taught, which are
those for which I have developed SOPI tests thus far.
Another advantage is that the SOPI can be simultaneously administered to
not available locally, one will have to be brought to the examinees from a
distance, which can result in considerable expenditure in terms of the cost of
travel and the interviewer's time. The fact that the SOPI makcs it possible to
administer the test simultaneously to groups obviates the need for sveral
interviewers who would interview a number of examinees within a short period
of time.
CONCLUSION
206
19?,
standardized, semi-direct format of the test does not permit the extensive
probing that may be necessary to distinguish between the highest levels of
proficiency on the ILR scale, such as levels 4,4+, and 5.
The purpose of testing may also play a role in the selection of the
agency for public schools in the state of Texas, agrees with me on this point.
Recently, it awarded CAL a contract to develop SOPI tests in Spanish and
French for teacher certification purposes in Texas).
When conducting research on language gains or language attrition, use of
the SOPI would permit one to record the responses of an examinee at differcnt
points in time, such as at six months intervals. These responses could then be
analyzed in order to determine their complexity. In this way, the SOPI would
Whcn scores will not be used for important purposes, and a competent
point I have made here is that when quality control is essential, and when it can
not be assurcd for all examinees using the OPI, thcn the SOPI may be
preferable, given the high degree of quality control it offers. When quality
control can be assured, or when it is not a major concern, or when assessment at
very low and very high ability levels is required, or when practical considerations
do not dictate test type, then the OPI may be preferable.
207
REFERENCES
Shoham)+, E., Gordon, C., Kenyon, D. M., and Stansficld, C. W. (1989). The
Van Licr, L. (1989). Reeling Writing, Drawling Stretching and Fainting in Coils:
Oral Proficiency Interviews as Conversation. TESOL Ouanerly, 23(3), 489-508.
2 08
Future I.
NO VICE
Intermediate Low
Intermediate Mid
Intermediate High
ADVANCED
Advanced.Plus
SUPERIOR
High.Superior
This rating. which is not part of the ACTFL scale, is uscal in PST scoring for
examinees who clearly emceed the requirements for a rating of Superior A rating
209
2 r!..-
-1
SOUTHEAST ASIAN
LANGUAGES PROFICIENCY EXAMINATIONS
James Dean Bmwn
H. Gary Cook
Chwies Lockhart
Teresita Ramos
ABSTRACT
This study (N = 218) explored the score distributions for each tcst on the
proficiency batteries for each language, as well as differences between the
distributions for the pilot (1989) and revised (1989) versions. The relative
reliability estimates of the pilot and revised versions were also compared as were
the various relationships among tests across languages.
The results are discussed in terms of the degree to which the scores on the
210
Liskin-Gasparro 1982, and/or 1LR 1982). Though the ACTFL guidelines are
somewhat controversial (eg. sce Savignon 1985; Bachman and Savignon 1986),
they provided a relatively simple paradigm within which we could develop and
describe these tests in terms familiar to all of the teachers involved in the
project, as well as to any language teachers who might be required to use the
tests in the future.
The central research questions investigated in this were as follows
(1)
How are the scores distributed for each test of the proficiency
battery for each language, and how do the distributions differ
between the pilot (1989) and revised (1989) versions?
(2)
To what degrce are the tests reliable? How does the reliability differ
between the pilot and revised versions?
(3)
(4)
(5)
To what degree are the tests valid for purposes of testing overall
proficiency in these languages?
(6)
211 r)
...,
(--) iA
.... Jc
METHOD
A test development project like this has many facets. In order to facilitate
the description and explanation of the project, this METHOD section will be
organized into a description of the subject uscd for norming the tests, a section
on the materials involved in the testing, an explanation of the procedures of the
statistical procedures used to analyze, improve and reanalyze the tests.
Subject
A total of 228 students were involved in this project: 101 in the pilot stage
of this project and 117 in the validation stage.
The 101 students involved in the pilot stage were all students in the SEASSI
program during the summer of 1989 at the University of Hawaii at Manoa. They
were enrolled in thc first year (45.5%), second year (32.7%) and third year
(21.8%) language courses in Indonesian (n = 26), Khmer (n = 21), Tagalog (n
= 14) Thai (n = 17) and Vietnamese (n = 23). There were 48 females (47.5%)
and 53 Malcs (52.5%). The vast majority of these students were native speakers
project were all students in the SEASSI program during summer 1989. They
were enrolled in thc first year (48.7%), second year (41.0%) and third ycar
(10.3%) language courses in Indonesian (n = 54), Khmer (n = 18), Tagalog (n
= 10) Thai (n = 23) and Vietnamese (n = 12). There were 57 females (48.7%)
and 60 males (51.3%).
In general, all of the groups in this study were intact classes. To some
degree, the participation of the studcnts depended on the cooperation of their
teachers. Since that cooperation was not universal, the samples in this project
Materials
There were two test batteries employed in this project. The test of focus
was the SEASSIPE. However, the Mockrn Language Aptitude Test (MLAT),
developed by Carroll and Sapon (1959), was also administered. Each will be
described in turn.
Description of the SEASSIPE Thc SEASSIPE battery for each language
presently consisted of four tests : multiple-choice listening, oral interview
212
o-
procedure, dictation and doze test. In ordcr to make the tests as comparable as
possible across the five languages, they were all developed first in an English
prototype version. The English version was then translated into the target
language with an emphasis on truly translating the material into that language
such that the result would be natural Indonesian, Khmer, Tagalog, Thai or
Vietnamese. The multiple-choice listening test presented the students with aural
statements or questions in the target language, and they were then asked what
they would say (given four responses to choose from). The pilot versions of the
test all contained 36 items, which were developed in 1988 on the basis of the
ACTFL guidelines for listening (see APPENDIX A). The tests were then
administered in the 1988 SEASSI. During 1989, the items were revised using
distractor efficiency analysis, and six items were eliminated on the basis of
overall item statistics. Thus the revised versions of the listening test all
contained a total of 30 items.
The oral interview procedure was designed such that the interviewer would
ask students questions at various levels of difficulty in the target language (based
36 questions, this scale had 0 to 3 points (one cach for three categories:
accuracy, fluency, and meaning). On the revised version of the interview, 12
questions were eliminated. Hence on the revised version, the students were
rated on a 0-72 scale including one point each for accuracy, fluency and meaning
based on a total of 24 interview questions.
created in thc target language by translating the English passage and deleting
every 13th word for a total of 30 blanks. The pilot and revised versions of this
test each had the same number of items. However, blanks that proved
ineffective statistically or linguistically in thc pilot versions were changed to more
promising positions in the revised tests (see Brown 1988b for more on doze test
improvement strategics).
As mentioned above, these four tests were developed for each of five
languages taught in the SEASSI. To the degree that it was possible, they were
213
ec
made parallel across languages. The goal was that scores should be comparable
across languages so that, for instance, a score of 50 on the interview procedure
for Tagalog would be approximately the same as a score of 50 on the Thai test.
experimental. Hence the results of the pilot project were used primarily to
improve the tests and administration procedures in a revised version of each test.
The scores were reported to the teachers to help in instructing and grading the
students. However, the teachers were not required, in any way, to use the
results, and the results were NOT used to judge the effectiveness of instruction.
Teachers' input was solicited and used at all points in the test development
process.
Description of the MLAT. The short version of the MLAT was also
administered in this study. Only the last three of the five tests were administered
as prescribed for thc short version by the original authors. These three tests are
entitled spelling rhiPs, words in sentences and paired associates.
Procedures
The overall plan for this project procceded on schedule in four main stages and a
number of smallet steps.
Stage one: Design. The tests were designed during June 1988 at the
University of Hawaii at Manoa by J D Brown, Charles Lockhart and Teresita
Ramos with the cooperation of teachers of the five languages involved (both in
214
r
of July 1988 and the tests were actually administered in SEASSI classes on
August 5, 1988. This stage was the responsibility of T. Ramos with the help of C.
Lockhart.
Stage three: Validation. The on-going validation proccss involved the
collertion and organization of the August 5th data, as well as teachcr ratings of
Analyses
The analyses for this study were conducted using the QuattroPro
spreadsheet program (Borland 1989), as well as the ABSTAT (Bell-Anderson
1989), and SYSTAT (Wilkinson, 1988) statistical program. These analyses fall
into four categories: descriptive statistics, reliability statistics, correlational
analyses, and analysis of covariance.
Because of the number of tests involved when we analyzed four tests rach
in two versions (1988 pilot version and 1989 revised version) for cach of five
languages (4 x 5 x 2 = 40), the descriptive statistics reported here are limited to
the number of items, thc number of subjects, the mean and thc standard
215
97
deviation. Similarity, reliability statistics have been limited to the Cronbach alpha
coefficient (see Cronbach 1970) and the Kudcr and Richardson (1973) formula
RESULTS
Summary descriptive statistics arc presented in Table 1 for thc pilot and
revised versions of the four tests for each of the five languages. The languages
are listed across the top of the table with the mean and standard deviation for
each given directly below the language headings. Thc mean provides an
indication of the overall central tcndcncy, or typical behavior of a group, and the
standard deviation gives an estimate of the average distance of students from the
mean (see Brown 1988a for more on such statistics). The versions (ie. the pilot
versions administered in summer of 1988 or the revised versions administered in
summer of 1989) and tests (Listening, Oral Interview, Dictation and Cloze Test)
are labeled down the left side of the table along with thc number of items (k) in
parentheses.
.1,0651PE MI-1)41M t coal
rlf# k
INEYNE
.4
-.'D
r,i14,;
------------.
. IE '9411E%.
L't;
6)-4r'Eg
;D
^VAN
1.
-;71"
'I
S"2.
StfvI'
11
I '. 95
1- .8"
11.
.84
87. 2?
9. '2
'4
. SYS
5.:'
IP.
tat It,
.
ac.
=b.
1 '.
4.uo
16 .
.31
5.20
11.7,
o.bb
b1.09 9.55
4.13
1 '.q0
Ze
.86
12.75
5-b4
6.57
4.5- 20.75
3.b,
1,89
.stub Inc)
-
:^t.
. 59
49 .9e,
C1,:ne "'St.
20.,.,9
C.ral
trt .20
56.50 10 ...2
4."
4.15
17.90
216
6. T..3
50.22 13.92
57 .S3
6.31
76 .86
9.10
bP
513
7.03
13.50
4.34
14.50
5.30
Notice that, for each tcst, there is considerable variation across versions and
languages not only in the magnitude of the means but also among the standard
deviations. It seems probable that the disparities across versions (1988 and
1989) are largely due to the revision processes, but they may in part be caused by
cm gra
u*.o.rt-f.
-c-t,4"
.11
vt-_
.'o
.su
.0"-*
st
4,4
.42
.So
.07
.81
It
Cloze 'st
.68
.o3
.86
3ra1
71C.......100
72
It
.81
.1
.02
.63
q'
.60
5.
.QC
SO
.-8
.e5
5g
.63
,14
.c:
Nct
ra Nct aoollcable.
listening and doze tests in this study). However, for any test which has a
2 17
weighed scoring systcm (like the Interview tests in this study), another version of
alpha must be applied -- in this case, one based on the odd-even variances (see
Cronbach 1970)
TAGLE 3: SEABS1PE PEST INTERCORRELATIONS FOR EACH LANGUAGE
111:0014.
040
,1444.S
501
OETIW(SE
415
'nut
N1 21 ',VI
3ral ',Is
545
Catatica
601
'OS
Mut lst
571
'41 .126
&It
64
601
110
.26
60
rtt
571 .141
431
.654
-.12
.42
-.24
.20
541 .621
.W:
TGASED 14111
Oral Int,
511
Olctatne
.314
SP
C312, 7s3
.21
.24
II.
..s.
.704
.651 .441
.511
51
44 .111 .111
.11
.131
.es .ni 11
.541
-.61 -.71
56
.* -.I/
Gs .nt
.711 .571 .721
7 .35
115113016; 0
(M. 1110; 1
QS 751
the degree to which two sets of numbers are related. A coefficient of 0.00
indicates that thc numbers arc totally unrelated. A coefficient of +1.00 indicates
that they arc completely related (mostly in terms of being ordered in the same
way). A coefficient of -1.00 indicates that they are strongly related, but in
opposite directions, ic. as one set of numbers becomes larger, the other set
grows smaller. Naturally, coefficients can vary throughout this range from -1.00
to 0 00 to 1.00.
Notice that the languages arc labeled across the top with Listening (L),
Oral Interview (0) and Dictation (D) also indicated for each language. The
versions (1988 or 1989) and tests (Oral Interview, Dictation and Cloze Test) arc
also indicated down thc left side. To read the table, remember that each
correlation coefficient is found at the intersection of the two variables that were
being examined. This means, for instancc, that the .54 in the upper-left corner
indicates the degree of relationship between the scores on thc Oral Interview
and Listening tests in Indonesian in 1988 pilot version.
rj
r-. a
6. I
2 18
with asterisks occurred by chance alone. Put yet another way, there is a 95
percent probability that the coefficients with asterisks occurred for other than
chance reasons. Those coefficients without asterisks can be interpreted as being
zero.
SCLRCE
SS
BETWEEN SUBJECTS
LANGUAGE
MLAT tcov4R1arE)
SUBJECTS WITHIN OzCLPS
31c7.197
256,114
6295.642
w1THIN SUBJECTS
LANGUAGE
MLAT (COvAR/ATE)
SUBJECTS WITHIN GAUPS
7156.650
80.513
5604.643
ID <
dT
4
I
57
12
3
171
700.,Q0
:56.014
110.2-4
7.24811
5q6.387
26.838
32.776
18.106*
2.322
0.8..9
.05
languages across the four tests. This means in effect that at least one of the
differences in mcans shown in table 1 is due to other than chance factors (with
95 percent certainly). Of course, many more of the differences may also bc
significant, but there is no way of knowing which they arc from this overall
analysis. It should suffice to recognize that a significant difference exists
somewhere across languages. The lack of asterisks after the F ratios for thc
MLAT indicate that there was no significant difference detected language
aptitude (as measured by MLAT) among the groups of studcnts taking the five
languages.
tr
(2)
Thus the assumptions were found to be met for thc univariate analysis of
covariance procedures in a repeated measures design, and the results were
further confirmed using multivariate procedures. It is therefore with a fair
amount of confidence that these results are reported here.
LELEL
'EST
LIstenlnd
1st year
asca year
Dral Inty
3rd year
1st year
a,c1 year
3rd year
0:ctatzan
1st ',ear
Cloze Tst
2nd
3r0
1st
2nd
3rd
year
year
year
year
year
''EAN
15.7347
19.6383
20.9167
50.6538
47.1915
57.5000
16.4061
18.2500
73.9187
57.3393
z.'.2000
65.5833
STO
5.2392
4.4007
4.5218
14.9'022
15.9519
12.2714
5.3571
4.2602
3.4499
12.1015
8.8894
6.894e
49
47
12
26
47
12
32
48
12
56
48
12
One other important result was found in this study: the tests do appear to
reflect thc differences in ability found between levels of language study. This is
an important issue for overall proficiency tests like the SEASSIPE because they
should be sensitive to the types of overall differences in language ability that
2-
?
t,
22 0
DISCUSSION
The purpose of this section will be to interpret the results reported above
with the goal of providing direct answers to the original research questions posed
at the beginning of this study. Consequently, the research questions will be
restated and used as headings to help organizc the discussion.
(1)
How are the scores distributed for each test of the proficiency battery for each
language, and how do the distributions differ between the pilot (1989) and
revised (1989) versions?
The results in Table 1 indicate that most of the current tests are reasonably
well-centered and have scores that arc fairly widely dispersed about the central
tendency. Several notable exceptions seem to be the 1989 Oral Interviews for
Indonesian and Khmer, both of which appear to be negatively skewed (providing
classic examples of what is commonly called the ceiling effect -- see Brown 1988a
tor further explanation). It is difficult, if not impossible, to disentangle whether
thc differences found between the two versions of the test (1988 and 1989) arc
due to the revision processes in which many of the tests were shortened and
improved, or to differences in the samples used during the two SEASSIs.
(2) To what degree are the tests reliable? How does the reliability differ between
the pilot and revised %visions?
Table 2 shows an array of reliability coefficients for the 1988 pilot version
and 1989 revised tests that arc all moderate to very high in magnitude. The
lowest of these is for the 1989 Indonesian Listening test. It is low enough that
the results for this test should only bc used with extreme caution until it can be
administered again to determine whether the low reliability is a result of bad test
design or some aspect of the sample of students who took the test.
221 9
4.0
have generally, though not universally, improved test reliability either in terms of
that these correlation coefficients for Thai based on very small samples (duc
mostly to the fact that students at the lowest level were not taught to write in
Thai), and that these correlation coefficients were not statistically significant at
the p < .05 level. Thcy must therefore be interpreted as correlation coefficients
that probably occurred by chance alone, or simply as correlations of zero.
(4) To what degree are the tests parallel aovss languages?
a significant difference among the mean test scores across the five languages
despite the efforts to control for initial differences in language aptitude (the
MLAT covariate). A glance back at Table 1 will indicate the magnitude of such
differences.
One possible cause for these differences is that thc tests have changcd
during the process of development. Recall that all of these tests started out as
the same English language prototype . It is apparent that, during the processes
222
235
(5) To what degree are the tests valid for purposes of testing overall proficiency in
these languages?
The intercorrelations among the tests for each language (see Table 3)
indicate that moderate to strong systematic relationships exist among many of
the tests in four of the five languages being tested in this project (the e=eption is
Thai). However, this type of correlational analysis is far from sufficient for
analysing thc validity of these tests. If there were other well established tests of
the skills being tested in these languages, it would be possible to administer
those criterion tests along with the SEASSIPE tests and study the correlation
coefficients between our relatively new tests and the well-established measures.
Such information could then be used to build arguments for the criterion-
related validity of some or all of these measures. Unfortunately, no such wellestablished criterion measures were available at the time of this project.
However, there arc results in this study that do lend support to the
construct validity of these tests. The fact that the tests generally reflect
differences between levels of study (as shown in Table 3) provides evidence for
the construct validity (the differential groups type) of these tests.
Nevertheless, much more evidence should be gathered on thc validity of the
various measures in this study. An intervention study of their construct validity
analyses factor analysis might also be used profitably to explore the variance
structures of those relationships.
The point is that there arc indications in this study of the validity of thc tests
involved. However, in the study of validity, it is important to build arguments from
a number of perspectives on an ongoing basis. Hence, in a sense, the study of
validity is never fully complete as long as more evidence can be gathered and
stronger arguments can be constructed.
(6) To what degree are the strategies described here generalizable to test
development projects for other languages?
From the outset, this project was designed to provide four different types of
proficiency tests -- tests that would be comparable across five languages. The
intention was to develop tests that would produce scores that were comparable
223
2,
We now know that the use of English language prototypes for the
development of these tests may have created problems that we did not foresee.
One danger is that such a strategy avoids the use of language that is authentic in
the target language. For instance, a passage that is translated from English for
usc in Khmer doze test may he topic that would never be discussed in the target
culture, may be organized in a manncr totally alien to Khmer, or may simply
seem stilted to native speakers of Khmer because of its rhetorical structure.
These problems could occur no matter how well-translated the passage might be.
Ultimately, the tests did not turn out to be similar enough across languages
to justify using this translation strategy. Thus we do not recommend its use in
further test development projects. It would probably have been far more
profitable to use authentic materials from the countries involved to develop tests
directly related to the target languages and cultures.
CONCLUSION
23",
224
to better serve the population of students and teachers who are the ultimate
users of such materials.
One final point must be stressed: we could never have successfully carricd
out this project without the cooperation of the many language teachers who
volunteered their time while carrying out other duties in the Indo-Pacific Languages
department, or thc SEASSIs held at University of Hawaii at Manoa. We owe cach
of these language teachers a personal debt of gratitude. Unfortunately, we can only
thank them as a group for their professionalism and hard work.
REFERENCES
225233
Carroll, J B and S M Sapon (1959). Modern language aptitude test. New York:
The Psychological Corporation.
226
CIPPEINOIX A
L !STEN hC
AC1F1_ 15136
Generic DeStriplions-Speaking
tilOviee
Novice Low
The Novrce level is chat aciertred by the ability to communicate minimally with learned in
ill
Oral production consists of isolated words and perhaps a few hightheduency Ora,. Essentially no tun.:
ilonal communicative ability
Novice Mid
Nemec High
Intermediate
Oral production continues to Consist Of isolated words and learned phrases within ery predictable areas 01
need. although quantity. increased. Vocabulary is sufficient only for handling simple. elementary nerds and
expressing basic courtesies Utterances rarely consist of more than two ea (100, worth and show ftequent lung
pauses and "petition of interlocutor's words. Speaker may have some difficulty Producing een the simples,
utterances. Some NIOsice.hilid speakers will be understood only with great difficulty
Abte to satisfy partially the tequirernents Of basic communianiYe exchanges by relying heavily on learned in.
terances but occasionally expanding these through simple recombinations of their elements. Can ask questions
or make statements involving learned material Shows signs of spontaneity although this falls short of real
autonomy of expression Speech Continue% 10 COnsts1 of learned utterathes rather than of personalized. t.- Ca
ttonally adapted ones Vocabulary centers on areas such as basic obiects, places, and most common kinsnup
terms. Pronunclation may still be strongly influenced by first language. berm, are frequent and, in mite of
repetition...me Novice High speakers will have difficulty being understood even by sympathette interlocutors
initiate. minimally sustain. and close in a simple ay basic communicative tasks, and
ask and answee questions.
Intermediate.Low
Inteirenedtare.hlid
Able to handle successfully limited number of intereCtive, task-oriented and LOCial SilstatiOns Can ask as.1
answer questions, initiate and relpond to simple statements and rnainithn face to.face conversation. annougn
in a highly restricted manner and with much linguistic inaccuracy. Within ;hex limitations. can perform
;asks as introducing self, ordering a meal. asking directions. and making purchases. vocabulary ts aelec.,ac
to caper.a only the most ekrnentary needs. Strong interference from native language may occur Misuralersiand.
sop frequently arise, but viols true-Mien. the Intermediateilow speaker can generally be under treed by svm
pathetic tnterlocutors.
Able to handle successfully sariely of uncomplicated. baste and communicative tasks and social SituanOns
Cars talk Simply a0Out self and family members Can atk and Instals cluMliOns and participate in simple son
ersations on topics beyond the most immediate needs: e g.. personal history and leisute nme achvities
terance length incteases slightly, but speech may continue to be characterized by frequent long pautcs,
li
the smooth incorporation of even basic consersanonal strategies is often hindered as the speaker struggin
to ereaH appropriate language forma PeonunCiatiOn May continue 10 be strongly influenced by roti lantuale
and fluency may still be strained. Although misunderstndings still arise, the Intermediate...lid speaker Can
generally bc understood by sympathetic inrlocutors
11110mm:hale High
Able to handle successfully most UnCOrnplitated COMInuniCalte last s and social silualiOns Can initiate. tut
lain, and Close general con ,,,,, non uith a nurnbei Of strategics appropriate to a range or circumsianies
and topics, but errors are evident Limited vocabulary still necessitates hesitation and may bring about went:.
uneaPecled CueumloCutton. There is erner ging evidence of connected discourie. Particularly fOr simple nail.
lion and/or desertption. The Intermediate High socialite can generally by understood even by interfneuiori
not accustomed to dealing with speakers al thrs lewd. but repetition may still be required.
227 2 0
Ad anted
turn of esents.
Shinto mangy the requirements of everyday situ...nand routine school end seork requirements ( an han
die n th conlidenie but nnt cob facility complicated task.* and a...l situations, such as el...nanny., wn.
ni.nn.na. and anoints/in; Can narrate and describe with come detail., linking lenience, together menoenhly
an .0iiimunmate famlt and talk sasually about lupins of murrent public and Pertonal nterest miming general
i.Kanutarr Sh0rrsoiniegs man often be smoothed oser by summumeanse strategies. much as panic filter,
mialling desists, and diflefent rates of speech Ciesuinlosution which arises from s,Os.abutare Of svniastis
Immo anon, sera often is stune successful. though tom< groping for words may still bt esident Thc Adsanced
lcmel mpeal er can be understood cillinut difficulty by native interlocutors
.iian..
Plum
Able to sattel y the requirement, 0( a broad satiety of evereday. school, and nor klilollmtciis Can dile.,
mnncrete Wit s relating to particular interest, and Special (meld, Of competence There is emef ging es id,nCe
of ability to support opinions. esplain in detail. and hyPothwize. The AdeanCed Plus speaker often rhos,
a ...it developed ability to compensate foe an impeded grasp of some forms enter confident use of tom
nutmeat int SfralegInt. ISICh a, enteaphrastog and cocumlocution Differentiated ocabulary and iniunanon
are effeenvely used to communnate fine shades Of meaning The Adanced.Plus sneaker often mons
remarkable fluency and ease of speech but unuer the demands of Superior.lesel. (emote< ,,,,, . languace
may Weak dossn itt rt.< ioadetauOre
Superior
s.,,rOf
amble to speak the language enth sufficient accuracy tO Participate effectieele in rnon formal and infntmal
crnsertation, on practical. tonal. professional. and abstract topics. Can discuss special fields of competence
and iniemeat with rate Can support opinions and hypothesize. but may not be able to tailor language so
audience Of diSCut, in depth highly abstract or unfamiliar topics Usually the Superior leel sneaker is owe
pall-aily familiar voth regional or mher dralectical sartants The Superior level speaker commands a -.de
arrery of interaniee strategies and shoy. gopd "Vents, Of dneourie strategies The latter insolses ISt
ability to oistmguish main ideas from supporting informatton ihtough syntactic. lexical and surrategrneniat
leatures twIs. stress. inzonationl Sporadic ettor, may occur. particularly in lovr.ftequeney nruClufeS and
tome complex highlfe011oney Structure, more common to formal welling but no patterns of error are es,
dent Errors do not disturb the nag, speaker Or interfere erith communication
Notice lau
llsdersi a ndin g it limited 10 occarronal isolated sword, . such as cognates. borrowed word, . and high ftequency
Atte to understand some shoot . learned utterances. pattnutatly +here context strongly supports understand.
ing and speech is dearly audible Comprehend, some -ord, and phrases from simple questions...C.c.,.
high frequency command, and courtesy formulae about topics that refer te, basic personal information or
the immediate physical setting The listener to:nitres long pauses for asstmilation and periodically requens
repetition and/or a tloever rate of speech
(a 41
228
if
Able tal understand short. learned utterances and some sentenee.length utterances. particularly where con
rent strongly supports undersondorg and tooth is clearly audible. Comprehends nords and phrases from
hooce-High
omple quemons. statements. brill f.requency comr.ands and courtesy formulae May fourie opentron .
rephtasing and/ot a skined rate of speon for cm Iptehentron
IntomernareLo
Able to understand sentence length utterances nh C. consist of recombinations of learned eIemenh in a irmued
number of content areas. particularly if strongly Supported by the trtuational con.. Comen4 ere, 4a
basic personal background and needs. '00,liter...entrant and retain< tasks, such at genius meals tad reteietng
omple instructions and arrections. Lotening tasks pertain primarily to spontaneous face to la..e 4on ,,,,,
c.ons Undetstandreig o often unccn repermon and renordrng may be necessary misunderoandrngs in
hulk 10 undoril and hniency length ul want e4 huh conust nI ro.conbinatiOns Or kat ned utterances on a .4alely
and nomeerhat morn comples tasks, such ar 'odg4ng. Sransportatton. and thOppIng Addaronal "mem Vest
...rude Wale DerfOnal rmeresh and WI.. res. aeda grearne diversity Of tnstructrOns and droccitons Lsiening
not only penain to sdontaneous lace to fate tonetsations hut alao to short routine telephone toner,sa
,
trona and rome deliberate speech, such as simple announcemenrs and reports oeer the meal. Undentandrog
.ononues to be unmet,.
Intermediart Irian
Able ler tusfain underhandtna over looter tlittchey of connected drscourst on a number of mires periarnmg
drIleten1 hints antl places. honmer. undemanding o inconsraltel due to failure to grasy main ideas and m
derails That. uhtle topics do net Mt fer hand-mantle frOrh those Of an Adeanced level later., component...
Adr anseel
med.., of the trtuairon Comprehension may be uneon due ion cutely of ling...hoc andcattalo-1,4.41h !an
eery prommens There teats foquently thClyn desorption and nal4a
rola. among +inch rod.< famtlranty
t.on in Off fere et trine names ot aspects. Such as proem, nonpasr. habitue/. Of undertone< Tex it Mu inclUde
.nitr.,eus, rhotl leCtures nn famiLar topics, and nen. items and odorre pomanly dealing nun factual "Ira,.
macron Listener .1a...re of cohesive des ices bur may nO4 be able to use them tO follow the sequence of lhougld
m an Oral lesl.
Ad. an4n0 Ph,
in
thou.5 an emergrng anareness I nullotally nnpoed meamngt beyond the surface meanings of the teal but 41,4
Able to understand rho main des, of all 'torn m a standard dialect. Including technuat discussion on a 1.e..1
nI 40CesalitalrOn. Can folio. the essentrais of mended discoed. sehrth is Porposumnally and ;mg.
eomplet. as in academe/professional settings. in tomes. rOrceehes. and reports Lrstener Mons some aP
preciation of aesthetic norms of tat get language. Of idlOrns. colloroutalisms. and register hutting Able lo mace
rnlerenceturthrn the cultural framework of the target language. Understanding is oded by an 1..arrnell
he underlying organ...nal ttruoure of Me oat tot and includes aeronomy for .1%100/land h./halal refocuses
bum-rice
and us affective OeettOnes Rarely enhunderslancil Out man nO1 underhand etcessisely rapid. high', f011Oqu Ii
speech or speech that has strong Cultural references
ngurthed
Able to undostand all form, and Mks of truoch pertinent to personal, social and proleshonal needs la.Ierr.l
to dif (trent audrences. Shona strong senotorty to social and cultural references and aorhettc norms by pit,
;Ming language horn no tun the CUltulall11111,0lk. Tents include theater plays, sown pro4.14101on1. eduluroo
literary readrngs, and most roLes and puns May hare
,44000a.a, academe debates. Pubrm dedoe
dtl (octal,. nag some dial.us and slang
242
BEST COPY AVAILABLE
229
INTRODUCTION
would be quite naive and perhaps even imprudent to suggest thcn, that all
teachers will also by extension make naturally good testers given Spolsky's (1975)
The contention therefore is that thc teacher who has had some
responsibility for course design and implementation is in many ways preeminently qualified to construct tcsts for the course particularly if it is backed by
experience and shared knowledge in the field. Since the target group is known at
first hand, needs can be fairly accurately specified on the basis of introspection
and experience. The backwash effect of teacher-made tests on teaching can only
be beneficial. As the teacher in this case is also responsible for course content
(and like all other teachers across the board has the best interests of her students
230
at heart), she will certainly teach what is to be tested, test what is taught and
'bias for best' in the use of test procedures and situations. The only possible
danger lurking in this happy land is the possibility of a teacher who willy-nilly
teaches the test as well and thereby nullifies its value as a measuring instrument.
BACKGROUND
situation in the country which while overtly ESL also manifests many hybrids of
integrated teaching of the four language skills. These students have also had a
minimum of eleven years of instruction in English as a subject in school. There
is also invariably the case of the mature student who has probably had 'more'
English instruction, having been subject chronologically to a different system of
education in the country's history.
Course Objectives
The oral communication course comprises two levels- cach level taught
over two semesters consecutively. The general aim of level I is to provide a
231
language learning environment for the acquisition of advanced oral skills and
that of level II to augment and improve upon the skills acquired in level I, thus
providing a learning continuum for the acquisition of advanced oral skills. At
this juncture it must be pointed out that in the integrated program of the first
year there is an oral fluency component. In other words thc students in the
second year have already been thrown into the 'deep end' as it were and the
assumption is that upon entry to Level I they have more than banal or survival
skills in oral communication. The reality is that students in spite of the first year
of fairly intensive instruction and exposure enter the second ycar with varying
levels of abilities. The task at hand for the second year oral skills programme is
quite clear; raise levels of individual oral ability, bridge varying levels of
individual abilities and yet help students to develop at their own pace. Hence the
need to see the language class as a language acquisition nvironment bearing in
mind that contact and exposure with the language outside the class is not
optimal. The main objective in Level I is to achieve a high level of oral fluency
in the language with an accompanying level of confidence and intelligibility, the
latter being viewed with some urgency since native vernaculars are increasingly
used for social communication outside the classroom and Bahasa Malaysia
remains the language of instruction for courses in all other disciplines. The main
objective of Level II is to achieve a high level of oral language ability. Both these
objectives arc further broken down into specific objectives for both levels. The
tests arc pegged against these objectives
The specific objectives of Level I of the course are as follows:
1
interact and converse freely among themselves and other speakers of the
language
232
245
Having said that, why then has listening comprehensior been included as a
not only comprehend all standard varieties of the language hut also make
233
24
level and a videotaped presentation for the second level of either one of two
THE TESTS
Some Considerations
prescribed by the oral skills domains. Therefore each test would sample
different behaviour or skills in the form of different speech modes and the task
specifications will vary from test type to test type. However all tests will test for
both linguistic and communicative ability.
"It is difficult to totally separate the two criteria, as the linguistic quality of
It is quite clear that as the view of the language underlying the teaching is
247
234
The tests are administered at various points in the semesters that roughly
coincide with points on the course where the skills to be tested have already been
"Content validity refers to the ability of a test to measure what has been
By spacing out the tests in relation to the content, not only is the teachertester able to 'fir the test to the content, she is also able after each test to obtain
valuable feedback for the teaching of the subsequent domains that have been
arranged in a cyclical fashion. Hence learning and performance is also on a
cumulative basis because each skill taught and learnt or acquired presupposes
and builds on the acquisition and the development of the preceding skills. It is
on these bases that the tests have been developed and administered over a
period of time. They are direct tests of performance that are communicative in
235
243
Test Types
Level I
Extended/impromptu' speech
Group discussion
End-of-semester project
There are three speaking tasks of this typc. Student speak for about 2
minutes on thc first, 2-3 on the second and 3-5 on thc third. The tasks test for
three modes of speech as follows:
(i) and (ii) are tested at the beginning of the first level mainly for diagnostic
However thc topics for (iii) are of a slightly controversial nature such as
Should smoking be banned in all public places?
Do women make better teachers?
236
Both (ii) and (iii) arc rated for global ability to communicate in the mode
which is the overall ability of the studcnt to persuade or justify reasons taken for
a stand in the case of the latter and to describe, report and narrate in thc case of
the former.
The group discussion test is administered in the second half of the semester
as by this time there has been plenty of practice in the interaction mode as the
modus operandi of Level I is small group work. It tests specifically for oral
intcraction skills. The topics for group discussion tests are also based on thc
tacit principle that the content should be either familiar or known and not pose
problems in thc intcraction process. Though the amount of communication (size
of contribution) and substantiveness is rated as criteria, content per se is not
rated. Group discussion in Level 1 tests lower order interaction skills that are
discernible at the conversational level.
The groups discussion test has been modelled on the lines of the Bagrut
group discussion test with some modifications (see Shomay, E., Reyes, T. and
Bejerano, Y. 1986 and Gefen, R. 1987). In Level I the topics are of matters
that either concern or pose a problem to the test takers as UKM students.
Hence there is sufficient impetus to talk about them and this 'guarantees'
initiation by all members of thc group in the discussion. Topics in the form of
statements are distributed just before the tests from a prepared pool of topics.
Each topic comes with a set of questions. Students arc allowed to read the
questions in advance but discussion on the topic and questions before the test is
not permitted. These questions function as cues to direct and manage the
interaction. They need not be answered. In fact students may want to speak on
other aspects of the topic. An example of the topic and questions is as follows:
Scholarships should be awarded on nccd and not on merit.
(a) Are both equally important considerations?
(b) Should students have a say in w' o gets scholarships ie. have student
representatives on scholarship boards?
(c)
237
took 15-20 minutes to round off the discussion and groups of 5 took about 20-25
minutes. However, it is desirable not to cut off the discussion after 20-25
minutes, as extra time (usually an extra 5 minutes) helped to confirm ratings.
Rating is immediate on the score sheets prepared for the test (see Appendix C
ii). A variation of the topics with maximum backwash effect on learning is to use
books that have been recomr,ended for extensive reading as stimulus for group
discussion. This has been trialled as a class activity.
It can be seen that the oral interview test is noticeably absent in the
sampling of speech interactions for Level I of the course and probably begs the
question why, as it is a common and well established test for testing oral
interaction. Suffice to say that it is firstly one of the tests administered in the
first year integrated program (and therefore sampled). Secondly the group
discussion appears to be a more valid (face and content) test of oral interaction
in relation to the course objectives.
Since a premium is placed on intelligibility/comprehensibility the end-ofsemester project tests for overall verbal communicative ability in the rehearsed
speech genre in the form of a news magazine that is audio taped for assessment
and review. The news magazine may be presented either as a collage of items of
news and views of events and activities on campus or thematically cg. sports on
campus, cultural activities, student problems etc.
3
Level II
Group discussion
Public speaking
Debates
End-of-semester project
In the second level the group discussion test is administered early in the
1
semester and the results used to determine how much more practice is needed in
improving interaction skills before proceeding to the more formal performanceoriented speech genres. The topics for the group discussion in the second level
arc of a more controversial nature than in the first. Although cognitive load is
238
The debate is placed at the end of the semester and usually viewed by the
students as a finale of sorts of their oral communication skills. As with thc
3
public speaking test, topics and teams (for the debates) are made known well in
advance and students work on the topics cooperatively for the latter. The
backwash effect on the acquisition of social and study skills is tremendous as
students are informed that ratings reflect group effort in the debating process.
Both tests 2 and 3 are rated immediately and video taped for both review and
record purposes.
Rating scales have been constructed for all the tests developed. A look at
the critcria and the rating scales (sce appendices) for the various tests discussed
above, shows that the criteria for each test varies although some (mainly
linguistic) recur as each test samples diffcrcnt types of communicative ability.
Working over a period of time (ic two years = four semesters) it has been
possible to specify what critcria should be used to rate each test and therefore
what sorts of rating scales to produce. It has also been possible to select specific
239
components from the broader criteria identified for each rating scale. In this
sense each test has evolved pedagogically (mainly) and psychologically over a
period of time to become more comprehensive in terms of the test (task)
specifications. Feedback in the form of student responses (and reaction) to cach
task has also helped the tests to jell as they were used to make changes especially
to the criteria and subsf...quently thc rating scale so as to reflect a wider possible
range of responses for cach test.
Obviously comprehensiveness of criteria should not be at the expense of the
feasibility of rating scales and the practicality of scoring procedures. Too many
descriptors can make it difficult for a rater to evaluate thc performance in any
one task. Using all these simultaneously to make an immediate judgement is no
COWLUSION
Test Anxiety
A certain amount of anxiety has been removed from the testing situations in
thc course firstly, because of the ongoing nature of the assessments and secondly
because of the wider sampling of the speech genres.
dr ,
400
240
'There is ... evidence in the literature that thc format of a task can unduly
affect the performance of some candidates. This makes it necessary to
include a variety of test formats for assessing each construct... In this
Practitioners know that not only do levels of test anxiety vary from situation
to situation and from testee to testce, it may not even be possible to eliminate
anxiety as an affective variable. However, in order to further reduce test anxiety
and to 'bias for best', students are informed at the beginning of each level about
course objectives and expectations, test types and task specifications explained.
Feedback is also provided aftcr each test although actual scores obtained are not
divulged.
Other Matters
All tests of courses on thc university curriculum (cumulative or otherwise)
arc seen as achievement tests with scores and grades awarded accordingly.
There is a certain amount of tension between rating according to specified
criteria and the subsequent conversion of the weightage of the components of
these criteria into scores. However despite this constraint it is still possible to
speak of a studcnt's profile of performance in the oral communication class from
level to level. At thc end of the second year similar judgements can be made of
them as potential students for the B A in English Studies.
The oral communication course has also been offered more recently as an
elective to other students and therefore involves more teachers. While the
241
REFERENCES
Alderson, C. 1981. Report of the Discussion on Communicative Language Testing
Clark, J L D.
1975.
Gefen, R. 1987. "Oral Testing --- The Battery Approach" Forum 25.2 24-27
Kreshcn, S. and Terrell T. 1983. The Natural Approach: Language Acquisition in
Skchan, P. 1988. Language Testing: Part! 1-10. Language Testing: Pan 11 1-13.
Language Teaching Abstracts.
242
Shohanzy, E. 1988. "A Proposed Framework for Testing the Oral Language of
Second/Foreign Language Learners" Studies in Second Language Acquisition
10.2 165-180
243
25 "s;
INTRODUCTION
will be less obtrusive and will provide constant advice to the learners and
teachers. In this paper wc shall attempt to explain how CAT works and what is
thc underlying theory. Thc various steps involved in implementing an adaptive
test will be described with examples from a placement test that we have
developed in French.
1) -
244
CAT takes full advantage of these two properties of the computer. Let's
suppose we want to assign a student to a group that would suit his needs by
mcans of a conventional placement test.
We do not know a priori at which level the student could be placed; he/she
could be an absolute beginner in the language or an "educated native". In this
case, the test should probably include some difficult items, as well as some easy
ones. In fact, given the student's level, how many of the items of a two hour tcst
are relevant? Probably less than 25%. Some of the items will be too easy,
particularly if the student is at an advanced level. From the student's point of
view, those items are boring, unchallenging; from the psychometric point of view,
they do not bring valuable information because the outcome is too predictable.
On the other hand, some items will be too difficult, particularly for beginners
who will feel frustrated because they find that the test is "over their heads%
again, there is very little information on the student's level that can be drawn
from these items.
Adaptive testing has also been called "tailored testing" because it aims at
praenting items that suit the students' competence and that arc informative. In
an open-ended test, this means items in which the chance to answer correctly will
be approximately fifty/fifty. This approach to testing problems might bring to
mind Binet's multi-stage intelligence tests that were developed at the beginning
of the century. For language teachers, it may also resemble recent oral interview
procedures in which the examiner is encouraged to adapt the exchange to the
examinees' performance (Educational Testing Service 1985).
Adjusting the test is in fact a complex process that CAT seeks to replicate.
For this task, we need:
itcm bank and the selection procedure, the most widely used is thc Item
Response Theory (IRT). Despite its mathematical complexity, IRT is
conceptually attractive and very interesting for CAT. The theory was first
labeled "latent trait theory" by Birnbaum (1968) because it assumes that a test
245
25S
labeled "latent trait theory" by Birnbaum (1968) because it assumes that a test
of standard deviations and ranges from roughly -3 to +3. Figure 1 shows the
curve for an "Intermediate" level item. The inflection point of this ICC is around
0 which corresponds to the sample mean. Since the subject's ability and the item
difficulty are expressed on the same scale, we say that the difficulty of the item
(the parameter b) is 0.
riffi,01ty
.
B.IL
Figure 1: Item Characteristic Curve
246
25j
Guesin,.1
0.2
MEM.
If an item clearly separates the advanced students from the beginners the
curve should be very steep; if it does not, the curve will be flatter. In other
words, the slope of the ICC corresponds to thc discrimination (the parameter a).
An item with a discrimination index of 1 or more is a very good item. Finally, we
see that, in this particular case, the curve will never reach the bottom line. This
is due to the fact that the item is a multiple choice question which involves some
guessing.
This is expressed with a third parameter (parameter c). A m/c item with
five options should have a c around .2. Of course, in reality, such a regular curve
is never found. The degree to which the data for an item conforms to an ICC is
the "item fit". Misfitting items should be rejected.
Once the parameters are known, we can precisely draw the ICC using the
basic IRT formula
Da (Rbi )
Pi (8)
= Ci
(1
Ci )
1
+ e Da,.
where 0(theta) represents the subject's ability and D a scaling constant set
at 1.7. A simpler formula for a less complex but generally less accurate model
has been proposed by G. Rasch (1960). The Rasch model is a one-parameter
model; it assumes that there is no guessing and that all the itcms discriminate
equally. Under this model, only the difficulty has to be estimated.
247
2G0
463
- Planning the bank: Are we measuring more than one common trait? If
so, then several item banks should be set up. At this stage, we must also
make sure that the items can be administered, answered and marked both
with a "paper-and-pencil" format and with a computerized version. Since
field testing is expensive, a great deal of attention must be paid to the
wording of the items. For large item banks, several versions using
"anchor items" will be necessary.
- Field testing and item analysis: The items will be tried out on a small
sample - 100 to 200 subjects. Classical item analysis using proportions of
correct answers and correlations is helpful in order to eliminate bad items
from the next version. At this stage, some dimensionality analysis can be
conducted to make sure the test (or sub-test) is measuring a single trait.
- Updating the bank: new items may be added, some others deleted. The
user should also be able to browse in the bank and modify an item
without having to rewrite it.
- Importing items: When a set of items are located in another file, there
should be provisions to execute a mass transfer into the bank.
248
- Listing the items: Each item can been seen individually on the screen.
Yet the user can also call a list of the items. Each tine will show the
code of an itcm, the parameters, and a cue to remind the
question.
In addition, our system calculates the "Match index% According to Lord
(1970), this value corresponds to the ability at which the item is the most
efficient.
Obtaining the iteta information: Under IRT, one can tell how much
information can be obtained at different points of the ability scale. As the
information sums up, at a specific ability point, the estimation becomes
increasingly more reliable at this point.
construct exists even though a more refined evaluation should probably divide
this general competence in various components such as the grammatical
competence, the discourse competence or the sociolinguistic competence
(Canale and Swain 1980). The format of the test is affected by the medium, the
micro-computer. The three sub-tests contain multiplc-choicc items because we
want to minimize the use of the keyboard and because open-ended answers are
too unpredictable to be properly processed in this type of test. The organization
and the content of the. test also reflect the fact that we had to comply with IRT
requirements.
2 49
beginning of thc test, the student will be asked some information about his/her
background in the second/foreign language:
How many years did the student study the langtne?
Did he/she ever live in an environment where tl- is language is spoken?
If so, how long ago?
Then the program prompts thc student to rate his/hcr general proficiency
level on a seven category scale ranging from "Beginner" to "Very advanced". All
this information is used to obtain a preliminary estimation that will bc used for
the selection of the first item of the first sub-test. Tung (1986) has shown that
the more precise is the preliminary estimation, the more efficient is the adaptive
test.
The first sub-test contains short paragD,phs followed by a rn/c question to
mcasure the student's comprehension. According to Jafarpur (1987), this "short
context technique" is a good way to measure the general proficiency. Figure 2
illustrates how the adaptive procedure work. At the beginning of the sub-test,
after an example and an explanation, an item with a difficulty index close to the
preliminary estimation is submitted.
If m
sp,rp
Thpla
Tnfo.
Etr.r
rn?4
/0.'
O.? ill
n11
.n.75n
cri27
0.9112
-0.819
0.231
0/2
-0.950
0.809
-n 810
0.264
1/3
-1.831
0 318
1.719
1.141
1.109
0.219
2/4
-1.129
1.948
0.716
0.9(.7
1.10o
0.180
1/5
-0.894
2 685
0.610
rn41
nrIll
col?
cr17?
rn
14
col0
:'1?
1.005
-0.568
0.260
1/6
-1.070
2 752
0.603
n 807
0.905
0.228
4/6
-0.946
3 269
0.551
1.220
0.809
0.198
4/7
-1.148
3.408
0.542
In the example, the first item was failed (U = 0) and the program then
selected an easier one. Whcn at least onc right and one wrong answer have been
obtained, the program uses a more refined procedure to calculate the student's
250
ability. The next item will be one which has not bccn presented as yet and that is
the closest to the new estimation. The procedure goes on until the pre-set
threshold of information is reached. When this quantity of information is
attained, the measure is precise enough and the program switches to thc next
sub-test.
The same procedure is used for the second part with the estimation from
the previous sub-test as a starting value. On the second sub-test, a situation is
presented in English and followed by four grammatically correct statements in
French. The student must select the one that is the most appropriate from a
semantic and sociolinguistic point of view. Raffaldini (1988) found this type of
situational test a valuable addition to a measure of the proficiency. Once we
have obtained sufficient information, the program switches to the third sub-test,
which is a traditional fill-the-gap exercise. This format is found on most of the
current standardized tests and is a reliable measure of lexical and grammatical
aspects of the language. Immediately after the last sub-test, the result will
appear on the screen.
Since a normal curve deviate is meaningless for a student, the result will be
expressed as one of the fourteen labels or strata that the test recognizes along
the ability range: "Absolute beginner, Absolute beginner +, Almost beginner ...
Very advanced +".
Both the students and the program administrators appreciate that the result
is given right away. The studcnts receive immediate feedback on what he/she
did and the result can be kept confidential. Since there arc no markers, thc
40% of the items of the equivalent conventional test. Finally, thc adaptive
procedure means that the student is constantly faced with a realistic challenge:
thc itcms arc never too difficult or too easy. This means less frustration,
particularly with beginners. With a more sophisticated instrument than the one
we designed, one could even find other positive aspects of CAT. For example,
with IRT it is possible to recognize misfitting subjects or inappropriate patterns
251
26,';
itself and because of the psychometric model. With the combination of the
prcscnt technology and IRT, it is hard to imagine how a test could use anything
other than m/c items or very predictable questions. The medium, the computer,
not only affects the type of answers but also the content of the test. In our test,
parameter calibration may be too large. Using a Rasch model may help to
reduce the sample sizc, usually at the expense of the model fit, but the field
testing will always be very demanding. Therefore, CAT is certainly not
applicable to small-scale testing.
an answer to one itcm should never affect the probability of getting a right
answer on another item. Cloze tests usually do not meet this requirement
because finding a correct word in a context increases the chance of finding the
next word.
Finally, when all the theoretical problems have been solved some practical
problems may arise. For example, for many institutions the cost cf the
development and implementation of an adaptive test could be too high. Madsen
252
(1986) studied the student's attitude and anxiety toward a computerized test;
attention must be paid to these affective effects.
CONCLUSION
for some integrative tests of receptive skills particularly if the result will not
affect the student's future or can be complemented with more direct measures.
In short, a CAT will always be a CAT, it will never be a watchdog.
NOTES
1
REFERENCES
Assessment Systems Corporation (1984) User's Manual for the MieroCAT testing
system. St.Paul, MN.
253
Baker F.B. (1985) The basics of item response theory. Portsmouth, NH.
Birnbaum A. ( 1968) Some latent trait models and their use in infering an
examinee's ability. In F.M. Lord & M.R. Novick, Statistical theories of mental
test scores. Reading MA: Addison-Wesley.
Bunderson C.V., Inouye D.K. & Olsen J.B. (1989) The four generations of
computerized educational measurement. In R.L. Lmn (Ed.) Educational
Measurement 3rd ed. (pp. 367-408) New York: American Council on
Education - Macmillan Publishing.
Dandonelli P. & M. Rumizen (1989) There's more than one way to skin a CAT:
Development of a computer-adaptive French test in reading. Paper presented
at the CALICO Conference, Colorado Spring, CO.
Educational Testing Service (1985) The ETS Oral Interview Book Princeton, NI.
Henning a (1986) Item banking via DBasc II: the UCLA ESL proficiency
examination experience. In C. Stansfield (Ed.) Technolok, and language
testing (pp. 69-78). Washington, D.C.: TESOL.
Henning G., Hudson T. & Turner 1. (1985) Item response theory and the
assumption of unidimensionality for language tests.
Language Testing, 2,
141-154.
267
254
Lord P.M. (1977) Practical application of item characteristic curve thony. Journal
of Educational Measurement, 14, 117-138.
Rasch G. ( 1960) Probabilistics models for some intelligence and attainment tests.
Copenhagen: Danish Institute for Educational Research.
4b
255
LIST OF CONTRIBUTORS
J Charles Alderson
Department of Linguistics &
Modern Languages
University of Lancaster
Bailrigg, Lancaster LA1 4YW
United Kingdom
Geoff Brindley
National Centre for English
Language Teaching & Research
Macquarie University
North Ryde, NSW 2109
Australia
Peter Doyc
Tcchnischc Universitat Braunschweig
Seminar fur Englishe und
Franzosische Sprache
Buitenweg 74/75
3300 Braschweig
Germany
David E Ingrarn
Centre for Applied
Linguistics & Languages
Griffith University
Nathan, Queensland 4111
Australia
I*.
256
Michel Laurier
National Language Testing Project
Department of French
Carleton University
Ottawa, Ontario K1S 5B6
Canada
Liz Hamp-Lyons
Sheila B Prrx:how
English Language Institute
3001 North University Building
University of Michigan
Ann Arbor, Michigan 48109-1057
USA
T F McNamara
Language Testing Unit
Department of Linguistics
& Language Studies
University of Melbourne
Parkvillc, Victoria 3052
Australia
Michael Milanovic
Evaluation Unit
University of Cambridge
Local Examinations Syndicate
Syndicate Buildings
1 Hills Road
Cambridge CB1 2EU
United Kingdom
Keith Morrow
Bell Education Trust
Hillscross, Red Cross Lane
Cambridge CB2 2QX
United Kingdom
2
257
dm()
John W Oiler Jr
Department of Linguistics
University of New Mexico
Albuquerque, NM 87131
USA
Don Porter
Centre for Applied Linguistics
University of Reading
Whitcknights, Reading RG6 2AA
United Kingdom
John Read
English Language Institute
Victoria University of Wellington
P 0 Box 600, Wellington
New Zealand
Charles W Stansficld
Division of Foreign Language
Education & Testing
Center for Applied Linguistics
1118 22nd Street, N W
Washington DC 20037
USA
Shanta Nair-Venugopal
English Department
Language Centre
Universiti Kebangsaan Malaysia
Bangi, Selangor, Darul Ershan 43600
Malaysia
2r
2 58