Language Tests at School

James E. Weaver; John W. Oller

Language tests at school

John W Oller

1979

visibility

…

description

256 pages

link

1 file

AI-generated Abstract

This paper explores various language testing methodologies in educational settings, highlighting the evolution and significance of sociolinguistic and psychometric principles in test design. Key contrasts in phonetic contrasts and statistical methodologies are analyzed, alongside a discussion of notable theories and contributions from prominent linguists, such as Edward Sapir and the Prague School. The implications of these methodologies for testing efficacy and educational outcomes are examined.

LONGMAN GROUP LIMITED London Associated companies, branches and representatives throughout the world © Longman Group Ltd. 1979 All rights reserved. No part of this publication may be reproduced,stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the Copyright owner. Oller, John W Language tests at school. - (Applied linguistics and language study.) 1. Language and languages - Ability testing I. Title II. Series 407'.6 P53 78-41005 Contents ISBN 0-582-55365-2 ISBN 0-582-55294-X Pbk First published 1979 Chapter 1 Introduction A. What is a language test? B. What is language testing research about? C. OrganIzation of this book Key Points Discussion Questions Suggested Readings page 1 3 6 11 12 13 PART ONE THEORY AND RESEARCH BASES FOR PRAGMATIC LANGUAGE TESTING Printed in Great Britain by Butler & Tanner Ltd, Frome and London. ACKNOWLEDGEMENTS We are grateful to the following for permission to reproduce copyright material: Longman Group Ltd., for an extract from Bridge Series: Oliver Twist edited by Latif Doss; the author, John Pickford for his review of 'The Scandaroon' by Henry Williamson from Bookcase Broadcast on the BBC World Service January 6th 1973, read by John Pickford; Science Research Associates Inc., for extracts from 'A Pig Can Jig' by Donald Rasmussen and Lynn Goldberg in The SRA Reading Program - Level A Basic Reading Series © 1964, 1970 Donald E. Rasmussen and Lenina Goldberg. Reprinted by permission of the publisher Science Research Associates Inc. Board of Education of the City of New York, from 'New York City Language Assessment Battery' ; reproduced from the Bilingual Syntax Measure by permission. Copyright © 1975 by Harcourt Brace Jovanovich, Inc. All rights reserved; Center for Bilingual Education, Northwest Regional Educational Laboratory from 'Oral Language Tests for Bilingual Students'; McGraw-Hili Book Company, from Testing English as a Second Language' by Harris; McGraw-Hili Book Company from 'Language Testing' by Lado; Language Learning (North University Building) from 'Problems in Foreign Language Testing'; Newbury House Publishers from 'Oral Interview' by Ilyin; 'James Language Dominance Test', copyright 1974 by Peter James, published by Teaching Resources Corporation, Boston, Massachusetts, U.S.A.; 'Black American Cultural Attitude Scale', copyright 1973 by Perry Alan Zirkel, Ph.D, published by Teaching Resources Corporation, Boston, Massachusetts, U.S.A. Chapter 2 Language Skill as a Pragmatic Expectancy Grammar A. What is pragmatics about? B. The factive aspect oflanguage use C. The emotive aspect D. Language learning as grammar construction and modification E. Tests that invoke the learner's grammar Key Points Discussion Questions Suggested Readings Chapter 3 Discrete Point, Integrative, or Pragmatic Tests A. Discrete point versus integrative testing v 16 16 19 26 28 32 33 35 35 36 36 VI CONTENTS CONTENTS B. A definition of pragmatic tests C. Dictation and cloze procedure as examples of pragmatic tests D. Other examples of pragmatic tests E. Research on the validity of pragmatic tests 1. The meaning of correlation 2. Correlations between different language tests 3. Error analysis as an independent source of validity data Key Points Discussion Questions Suggested Readings Chapter 4 Multilingual Assessment A. Need B. Multilingualism versus multidialectalism C. Factive and emotive aspects of multilingualism D. On test biases E. Translating tests or items F. Dominance and proficiency G. Tentative suggestions Key Points Discussion Questions Suggested Readings Chapter 5 Measuring Attitudes and Motivations A. The need for validating affective measures B. Hypothesized relationships between affective variables and the use and learning oflanguage C. Direct and indirect measures of affect D. Observed relationships to achievement and remaining puzzles 38 39 44 Key Points Discussion Questions Suggested Readings 143 145 147 PART TWO 50 52 57 64 70 71 73 74 74 77 80 84 88 93 98 100 102 104 105 105 112 121 138 THEORIES AND METHODS OF DISCRETE POINT TESTING Chapter 6 Syntactic Linguistics as a Source for Discrete Point Methods A. From theory to practice, exclusively? B. Meaning-less structural analysis C. Pattern drills without meaning D. From discrete point teaching to discrete point testing E. Contrastive linguistics F. Discrete elements of discrete aspects of discrete components of discrete skills - a problem of numbers Key Points Discussion Questions Suggested Readings 150 150 152 157 165 169 172 177 178 180 Chapter 7 Statistical Traps A. Sampling theory and test construction B. Two common misinterpretations of correlations C. Statistical procedures as the final criterion for item selection D. Referencing tests against nonnative performance Key Points Discussion Questions Suggested Readings 199 204 206 208 Chapter 8 Discrete Point Tests A. What they attempt to do 209 209 181 181 187 196 vii viii CONTENTS CONTENTS B. Theoretical problems in isolating pieces of a system C. Examples of discrete point items D. A proposed reconciliation with pragmatic testing theory Key Points Discussion Questions Suggested Readings Chapter 9 Multiple Choice Tests A. Is there any other way to ask a question? B. Discrete point and integrative multiple choice tests C. About writing items D. Item analysis and its interpretation E. Minimal recommended steps for multiple choice test preparation F. On the instructional value of mUltiple choice tests Key Points Discussion Questions Suggested Readings 212 217 227 229 230 230 231 231 233 237 245 255 256 257 258 259 PART THREE PRACTICAL RECOMMENDATIONS FOR LANGUAGE TESTING Chapter 10 Dictation and Closely Related Auditory Tasks A. Which dictation and other auditory tasks are pragmatic? B. What makes dictation work? C. How can dictation be done? Key Points Discussion Questions Suggested Readings 262 263 265 267 298 300 302 Chapter 11 Tests of Productive Oral Communication A. Prerequisites for pragmatic speaking tests . B. The Bilingual Syntax Measure C. The Ilyin Oral Interview and the Upshur Oral Communication Test D. The Foreign Service Institute Oral Interview E. Other pragmatic speaking tasks Key Points . Discussion Questions Suggested Readings Chapter 12 Varieties Qf Ooze Procedure A. What is the cloze procedure? B. Cloze tests as pragmatic tasks C. Applications of cloze procedure D. How to make and use cloze tests Key Points Discussion Questions Suggested Readings Chapter 13 Essays and Related Writing Tasks A. Why essays? B. Examples of pragmatic writing tasks J' C. Scoring for conformity to correct prose D. Rating content and organization E. Interpreting protocols Key ,points Discussion Questions Suggested Readings Chapter 14 Inventing New Tests in Relation to a Coherent Curriculum A. Why language skills in a school curriculum? 303 304 308 314 320 326 335 336 338 340 341 345 348 363 375 377 379 381 381 383 385 392 394 398 399 400 401 401 ix X I CONTENTS B. The ultimate problem of test validity C. A model: the Mount Gravatt reading program D. Guidelines and checks for new testing procedures Key Points Discussion Questions Suggested Readings CONTENTS I 403 408 (I Figure 9 -\ Figure 10 I 415 418 419 421 Appendix I Figure 11 Figure 12 " I Figure 13 Figure 14 Figure 15 The Factorial Structure of Language Proficiency: Divisible or Not? A. Three empirically testable alternatives B. The empirical method C. Data from second language studies D. The Carbondale Project, 1976-7 E. Data from first language studies F. Directions for further empirical research 423 424 426 Figure 17 428 431 451 Figure 19. Figure 18 - LIST OF FIGURES page Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 A cartoon drawing illustrating the style of the Bilingual Syntax Measure A hypothetical view of the amount of variance in learning to be accounted for by emotive versus factive sorts of information A dominance scale in relation to proficiency scales Example of a Likert-type attitude scale intended for children Componential breakdown of language proficiency proposed by Harris (1969, p. II) Componential analysis oflanguage skills as a framework for test construction from Cooper (1968, 1972, p. 337) 'Langnage assessment domains' as defined tiy Silverman et al (1976, p. 21) Schematic representation of constructs posited by a componential analysis of 48 Table 3 Table 4 83 Table 5 99 137 173 222 222 222 222 309 309 311 315 318 342 Intercorrelations of the Part Scores on the Test of English as a Foreign Language Averaged over Forms Administered through April,1967 Intercorrelations of the Part Scores on the Test of ]£nglish as a Foreign Language Averaged over Administrations from October, 1966 through June, 1971 Response Frequency Distribution Example One Response Frequency Distribution Example Two Intercorrelations between Two Dictations, Spelling Scores on the Same Dictations, and Four Other Parts of the UCLA English as a Second Language Placement Examination Form 2D Administered to 145 Foreign Students in the Spring of 1971 190 . 191 253 254 281 TABLES IN THE APPENDIX 173 174 221 page Table 1 Table 2 Figure 1 175 LIST OF TABLES 456 459 479 References Index Figure 16 language skills based on discrete point test theory, from Oller (1976c, p. 150) The ship/sheep contrast by Lado (1961, p. 57) and Harris (1969, p. 34) The watching/washing contrast, Lado (1961, p.57) The pin/pen/pan contrast, Lado (1961, p. 58) The ship/jeep/sheep contrast, Lado (1961, p.58) 'Who is watching the dishes?' (Lado, 1961, p.59) Pictures from the James Language Dominance Test Pictures from the New York City Language Assessment Battery, Listening and Speaking Sub test Pictures 5, 6 and 7 from the Bilingual Syntax Measure Sample p,ictures for the Orientation section of the [lyin Oral Interview (1976) Items from the Upshur Oral Communication Test Some examples of visual closure - seeing the overall pattern or Gestalt page Table 1 J>rincipal Components Analysis over Twenty-two Scores on Language Processing Tasks Requiring Listening, Speaking, xi xii. CONTENTS Reading, and Writing as well as Specific Grammatical Decisions Table 2 Varimax Rotated Solution for Twenty-two Language Scores Table 3 Principal Components Analysis over Sixteen Listening Scores Table 4 Varimax Rotated Solution for Sixteen Listening Scores Table 5 Principal Components Analysis over Twenty-seven Speaking Scores Table 6 Varimax Rotated Solution over Twentyseven Speaking Scores Table 7 Principal Components Solution over Twenty Reading Scores Table 8 Varimax Rotated Solution over Twenty Reading Scores Table 9 Principal Components Analysis over Eighteen Writing Scores Table 10 Varimax Rotated Solution over Eighteen Writing Scores Table 11 Principal Components Analysis over Twenty-three Gra=ar Scores Table 12 Varimax Rotated Solution over Twentythree Grammar Scores 435 437 439 440 Acknowledgements 443 444 446 447 449 450 452 453 George Miller once said that it is an 'ill-kept secret' that many people other than the immediate author are involved in the writing of a book. He said that the one he was prefacing was no exception, and neither is this one. I want to thank all of those many people who have contributed to the inspiration and energy required to compile, edit, write, and re~write many times the material contained in this book. I cannot begin to mention all of the colleagues, students, and friends who shared with me their time, talent, and patience along with all of the other little joys and minor agonies that go into the writing of a book. Neither can I mention the teachers of the extended classroom whose ideas have influenced the ones I have tried to express here. For all of them this is probably a good thing, because it is not likely that any of them would want to own some of the distillations of their ideas which find expression here. Allowing the same discretion to my closer mentors and collaborators, I do want to mention some of them. My debt to my father and to his Spanish program published by Encyclopaedia Britannica Educational Corporation, will be obvious to anyone who has used and understoop the pragmatic principles so well exemplified there. I also want to thank the staff at Educational Testing Service, and the other members ofthe Committee of Examiners for the Test of English as a Foreign Language who were kind enough both to tolerate my vigorous criticisms ofthat test and to help fill in the many lacunae in my still limited understanding of the business of tests and measurement. I am especially. indebted to the incisive thinking and challenging communications with John A. Upshur of the University of Michigan who chaired that committee for three of the four years that I served on it. Alan Hudson ofthe University of New Mexico and Robert Gardner of the University of Western Ontario stimulated my interest in much of the material on attitudes and sociolinguistics which has found its way into this book. Similarly, doctoral research Xl11 XlV ACKNOWLEDGEMENTS by Douglas Stevenson, Annabelle Scoon, Rose Ann Wallace, Frances Hinofotis, and Herta Teitelbaum has had a significant impact on recommendations contained here. Work in Alaska with Eskimo children by Virginia Streiff, in the African primary schools by John Hofman, in Papua New Guinea by Jonathon Anderson, in Australia by Norman Hart and Richard Walker, and in Canada and Europe by John McLeod has fundamentally influenced what is said concerning the testing of children. In addition to the acknowledgements due to people for contributing to the thinking embodied here, I also feel a debt of gratitude towards those colleagues who indirectly contributed to "the development of this book by making it possible to devote concentrated periods of time thinking and studying on the topics discussed here. In the spring of 1974, Russell Campbell, now Chairman of the TESL group at UCLA, contributed to the initial work on this text by inviting me to give a series of lectures on pragmatics and language testing at the American University, English Language Institute in Cairo, Egypt. Then, in the fall semester of 1975,' a six week visit to the University of New Mexico by Richard Walker, Deputy Director of Mount Gravatt College of Advanced Education in Brisbane, Australia, resulted in a stimulating exchange with several centers of activity in the world where child language development is being seriously studied in relation to reading curricula. The possibility of developing tests to assess the suitability of reading materials to a given group of children and the discussion of the relation oflanguage to curriculum in general is very much a product of the dialogue that ensued. More recently, the work on this book has been pushed on to completion thanks to a grant from the Center for English as a Second Language and the Department of Linguistics at Southern Illinois University in Carbondale. Without that financial s.upport and the encouragement of Patricia Carrell, Charles Parish, Richard Daesch; Kyle Perkins, and others among the faculty and students there, it is doubtful that the work could have been completed. Finally, I want to thank the students in the testing courses at Southern Illinois University and at Concordia University in Montreal who read all or parts of the manuscript during various stages of development. Their comments, criticisms, and encouragement have sped the completion of the work and have improved the product immensely. Preface It is, in retrospect, with remarkable speed that the main principles and assumptions have become accepted of what can be called the teaching and learning oflanguage as communication. Acceptance, of course, does not imply practical implementation, but dis,tinctions between language as form and language as function, the meaning potential of language ~s ·discourse and the role of the learner as a negotiator of interpretations, the match to be made between the integrated skills of communicative actuality and the procedures of the classroom, among many others, have all been widely announced, although not yet adequately described. Nonetheless, we are certain enough of the plausibility of this orientation to teaching and learning to· suggest types of exercise and pedagogic procedures which are attuned to these principles and assumptions. Courses are being designed, and textbooks written, with a communicative goal, even if, as experiments, they are necessarily partial in the selection of principles they choose to emphasise. Two matters are, however, conspicuously lacking. They are connected, in that the second is an ingredient of the first. Discussion of a communicative approach has. been very l~rgely concentrated on syllabus specifications' and, to a lesser extent, on the design of exercise types, rather th~ln on any coherent and consistent view of what we can call the communicative curriculum. Rather than examine the necessary interdependence of the traditional curriculum components: purposes, method~ and evaluations, from the standpoint of a communicative view of language and language learning, we have been happy to IQok at the components singly, and at some much more than others. Indeed, and this is the second matter, evaluation has hardly been looked at at all, either in terms of assessment of the c01pmunicative abilities of the learner or the efficacy of the programme he is following. There are many current examples, involving materials· and methods aimed at developing communicative interaction among learners, which are preceded, xv xvi PREFACE interwoven with or followed by evaluation instruments totally at odds with the view of language taken by the materials and the practice with which they are connected. Partly because we have not taken the curricular view, and partly because we have directed our innovations towards animation rather than evaluation, teaching and testing are out of joint. Teachers are unwilling to adopt novel materials because they can see that they no longer emphasise exclusively the formal items oflanguage structure which make up the 'psychometric-structuralist' (in Spolsky's phrase) tests another generation of applied linguists have urged them to use. Evaluators of programmes expect satisfaction in terms of this testing paradigm even at points within the programme when quite other aspects of communicative ability are being encouraged. Clearly, this is an undesirable and unproductive state of affairs. It is to these twin matters of communication and curriculum that John Oller's major contribution to the Applied Linguistics and Language Study Series is addressed. He poses two questions: how can language testing relate to a pragmatic view of language as communication and how can language testing relate to educational measurement in general? Chapter 1 takes up both issues; in a new departure for this Series John Oller shows how language testing has general relevance for all educationalists, not just ,those concerned with language. Indeed, he hints here at a topic he takes up later, namely the linguistic basis of tests of intelligence, achievement and aptitude. Language testing, as a branch of applied linguistics, has cross-curricular relevance for the learner at school. The major emphasis, however, remains the connection to be made between evaluation, variable learner characteristics, and a psycho-socio-linguistic perspective on 'doing' language-based tasks. This latter perspective is the substance of the four Chapters in Part One of the book. Beginning from a definition of communicative proficiency in terms of'accuracy' in a learner's 'expectancy grammar' (by which Oller refers to the leamer's predictive competence in formal, functional and strategic terms) he proceeds to characterise communication as a functional, context-bound and culturallyspecific use oflanguage involving an integrated view of receptive and productive skills. It is against such a yardstick that he is able, both in Chapters 3 and 4 of Part One, and throughout Part Two, to offer a close, detailed and well-founded critical assessment of the theories and methods of discrete point testing. Such an approach to testing~ PREFACE xvii Oller concludes, is a natural corollary of a view of language as form and usage, rather than of process and use. If the" view of language changes to one concerned with the communicative properties of language in use, then our ways of evaluating learners' competences to communicate must also change. In following Spolsky's shift towards the 'integrativesociolinguistic' view of language testing, however, John Oller does not avoid the frequently-raised objection that although such tests gain in apparent validity, they do so at a cost of reliability in scoring and handling. The immensely valuable practical recommendations for pragmatically-orientated language tests in Part Three of the book constantly return to this objection, and show that it can be effectively countered. What is more, and this is a strong practical theme throughout the argument of the book, it is necessary to invoke a third criterion in language testing, that of classroom utility. Much discrete point testing; he argues, is not only premissed on an untenable view of language for the teacher of communication, but in requiring time-consuming and often arcane pre-testing, statistical evaluation and rewriting techniques, poses quite impractical burdens on the classroom teacher. What is needed are effective testing procedures, linkeq to the needs of particular instructional programmes, reflecting a communicative view of language learning and teaching, but which are within the design and administrative powers of the teacher. Pragmatic tests must be reliable and valid: they need also to be practicable and to be accessible without presupposing technical expertise. If, as the examples in Part Three of the book show, they can be made to relate and be relevant to other subjects in the curriculum than merely language alone, then the evaluation of pragmatic and' communicative competence has indeed cross-curricular signlficanqe. Finally, a word on the book's organisation; although it is lengthy, / the book has clear-cut divisions: the Introduction in Chapter 1 provides an overview; Part One defines the requirements on pragmatic testing; Part Two defines and critically assesses current and overwhelmingly popular discrete point tests, and the concluding Part Three exemplifies and justifies, in practical and technical terms, the shape of alternative pragmatic tests. Each Chapter is completed by a list of Key Points and Discussion Questions, and Suggested Readings, thus providing the valuable and necessary working apparatus to sustain the extended and well-illustrated argument. Christopher N Candlin, General Editor Lancaster, July 1978. F AUTHOR'S PREFACE Author's Preface A simple way to find out something about how well a person knows a language (or more than one language) is to ask him. Another is to hire a professional psychometrist to construct a more sophisticated test. Neither of these alternatives, however, is apt to satisfy the needs of the language teacher in the classroom nor any other educator, whether in a multilingual context or not. The first method is too subject to error, and the second is too complicated and expensive. Somewhere between the extreme simplicity of just asking and the development of standardized tests there ought to be reasonable procedures that the classroom teacher could use confidently. This book suggests that many such usable, practical- classroom testing procedures exist and it attempts to provide langu~ge teachers and educators in bilingual programs or other multilingual contexts access to those procedures. There are several textbooks about language testing already on the market. All of them are intended primarily for teachers of foreign languages or English as a second language, and yet they are generally based on techniques of testing that were not developed for classroom purposes but for institutional standardized testing. The pioneering volume by Robert Lado, Language Testing (1961), the excellent book by Rebecca Valette, Modern Language Testing: A Handbook (1967), the equally useful book by David Harris, Testing English as a Second Language (1969), Foreign Language Testing: Theory and Practice (1972) by John Clark, Testing and Experimental Methods (1977) by J. P. B. Allen and Alan Davies, and Valette's 1977 revision of Modern Language Testing all rely heavily (though not exclusively) on techniques and methods of constructing multiple choice tests developed to serve the needs of mass production. Further, the books and manuals oriented toward multilingual education such as Oral Language Testsfor Bilingual Students (1976) by R. Silverman, et ai, are typically aimed at standardized published XVlli .-' { XIX tests. It would seem that all of the previously published books attempt to address classroom needs for assessing proficiency in one or more languages by extending to the classroom the techniques of standardized testing. The typical test format discussed is generally the multiple-choice discrete-item type. However, such techniques are difficult and often impracticable for classroom use. While Valette, Harris, and Clark briefly discuss some of the so-called 'integrative' " 'tests like dictation (especially Valette), composition, and oral interview, for the most part they concentrate on the complex tasks of writing, pre-testing, statistically evaluating, and re-writing discrete point multiple-choice items. The emphasis is reversed in this book. We concentrate here on pragmatic testing procedures which generally do not require pretesting, statistical evaluation, or re-writing before they can be applied in the classroom or some other educational context. Such tests can be shown to be as appropriate to monolingual contexts as they are to multilingual and multicultural educational settings. 1 Most teachers whether in a foreign language classroom or in a multilingual school do not have the time nor the technical background necessary for multiple choice test development, much less for the work that goes into the standardization of such tests. Therefore, this book focusses on how to make, give, and evaluate valid and reliable language tests of a pragmatic sort. Theoretical and empirical reasons are given, however, to establish the practical foundation and to show why teachers and educators can confidently use the recommended testing procedures without a great deal of prerequisite technical training. Although such training is desirable in its own right, and is essential LO the researcher in psychometry, psycholinguistics, sociolinguistics, or education per se, this bookjs meant as a handbook for those many teachers and educators who do not have the time to master fully (even if that were possible) the highly technical fields of statistics, research design, and applied linguistics. The book is addressed to educators at the consumer end of -educational research. It tries to provide practical information without presupposing technical expertise. Practical examples of testing procedures are given wherever they are appropriate. 1 Since it is believed that a multilingual context is normally also multicultural, and since it is also the case that language and culture are mutually dependent and inseparable, the term 'multilingual' is 'often used as an abbreviation for the longer term 'multilingual-multicultural' in spite of the fact that the latter term is often preferred by many authors these days. F XX AUTHOR'S PREFACE The main criterion of success for the book is whether or not it is useful to educators. If it also serves some psychometrists, linguists and researchers, so much the better. It is hoped that it will fill an important gap in the resources available to language teachers and educators in multilingual and in monolingual contexts. Thoughtful suggestions and criticism are invited and will be seriously weighed in the preparation offuture editions or revisions. ' 1 Introduction John Oller Albuquerque, New Mexico, 1979 A. What is a language test? B. What is language testing research about? C. Organization of this book This introduction discusses language testing in relation to education in general. It is demonstrated that many tests which are not traditionally thought of as language tests may actually be tests of language more than of anything else. Further, it is claimed that this is true both for students who are learning the language of the school as a second language, and for students who are native speakers of the language used at school. The correctness of this view is not a matter that can be decided by preferences. It is an empirical issue, and substantiating evidence is presented throughout this book. The main point of this ch~pter is to survey the overall topography of the crucial issues and to consider some of the angles from which the salient. points of interest can be seen in bold relief. The overall organization of the rest of the book is also presented here. A. What is a language test? When the term 'language test' is mentioned, most people probably have visions of students in a foreign language classroom poring over a written examination. This interpretation of the term is likely because most educated persons and most educators have had such an experience at one time or another. Though only a few may have really learned a second language (and practically none of them in a classroom context), many have at least made some attempt. For them, a language test is a device that tries to assess how much has 1 2 ,i I . , .. : ! i LANGUAGE TESTS AT SCHOOL been learned in a foreign language course, or some part of a course. But written examinations in foreign language classrooms are only one of the many forms that language tests take in the schools. For any student whose native language or language variety is not used in the schools, many tests not traditionally thought ofas language tests may be primarily tests of language ability. For learners who speak a minority language whether it is Spanish, Italian, German, Navajo, Digueno, Black English, or whatever language, coping with any test in the school context may be largely a matter of coping with a language test. There are, moreover, language aspects to.tests in general evenfor the student who is a native speaker of the language of the test. In one way or another, practically every kind of significant testing of human beings depends on a surreptitious test of ability to use a particular language. Consider. the fact that the psychological construct of 'intelligence' or IQ, at least insofar as it can be measured, may be no more than language proficiency. In any case, substantial research (see Oller and Perkins, 1978) indicates that language ability probably accounts for the lion's share of variability in IQ tests. It remains to be proved that there is some unique and meaningful variability that can be associated with other aspects of intelligence once the language factor has been removed. And, conversely, it is apparently the case that the bulk of variability in language proficiency test scores is indistinguishable from the variability produced by IQ tests. In Chapter 3 we will return to consider what meaningful factors language skill might consist of (also see the Appendix on this question). It has not yet been shown conclusively that there are unique portions of variability in language test scores that are not attributable to a general intelligence factor. Psycholinguistic and sociolinguistic research rely on language test results in obvious ways. As Upshur (1976) has pointed out, so does research on the nature of psychologically real grammars. Whether the criterion measure is a reaction time to a heard stimulus, the accuracy of attempts to produce, interpret, or recall a certain type of verbal material, or the amount of time required to do so, or a mark on . a scale indicating agreement or disagreement with a given statement, language proficiency (even if it is not the object of interest per se) is probably a major factor in the design, and may often be the major factor. , The renowned Russian psychologist, A. R; Luria (1959) has argued that even motor tasks as simple as pushing or not pusb.inga button.in response to a red or green light may be rather directly INTRODUCTION 3 related to language skill in very young children. On a much broader scale, achievement testing may be much more a problem oflanguage testing than ·ls commonly thought. For all ofthese reasons the problems oflanguage testing are a large subset of the problems of educational measurement in genenil. The methods and findings of language testing research are of crucial importance to research concerning psychologically real grammatical systems, and to all other areas that must of necessity make assumptions about the nature of such systems. Hence, all areas of educational measurement are ei;tller directly or indirectly implicated. Neither intelligence measurement, achievement testing, aptitude assessment, nor personality gauging, attitude measurement, or just plain classroom evaluation can be done without language testing. It therefore seems reasonable to suggest that educational testing in general can be done better if it takes the findings oflanguage testing research into account. B. What is language testing research about? In general, the subject matter of language testing research is the use and learning oflanguage. Within educational contexts, the domain of foreign language teaching is a special case of interest. Multilingual delivery of curricula is another very important case of interest. However, the phenomena of interest to research in language testing are yet more pervasive. Pertinent questions of the broader domain include: (1) How can levels oflanguageproficiency, stages oflanguage acquisition (first or second), degrees of bilingualism, or language competence be defined? (2) How are earlier stages of language learning different from later stages, and how can known or hypothesized differences be demonstrated by testing procedures? (3) How can the effects of instructional programs or techniques (or environmental changes in general) be demonstrated empirically? (4) How are levels of language proficiency and the concomitant social interactions that they allow or deny related to the acquisition of knowledge in an educational setting? This is notto say that these questions have been or even will be answered by language testing research, but that they are indicative of some of the kinds of issues that such research is in a unique position to grapple with. Three angles of approach can be discerned in the literature on language testing research. First, language tests may be examined as F 4 LANGUAGE TESTS AT SCHOOL tests per se. Second, it is possible to investigate learner characteristics using language tests as elicitation procedures. Third, specific hypotheses about psycholinguistic and sociolinguistic factors in the performance of language based tasks may be investigated using language tests as research tools. It is important to note that regarding a subject matter from one angle rather than another does not change the nature of the subject matter, and neither does it ensure that what can be seen from one angle will be interpretable fully without recourse to other available vantage points. The fact is that in language testing research, it is never actually possible to decide to investigate test characteristics, or learner traits, or psycholinguistic and sociolinguistic constraints on test materials without making important assumptions about all three, regardless which happens to be in focus at the moment. In this book, we will be concerned with the findings of research from all three angles of approach. When the focus is on the tests \~hemSelves, questions of validity, reliability, practicality, and I instructional value will be considered. The validity of a test is related ! 0 how well the test does what it is supposed to do, namely, to inform us about the examinee's progress toward some goal in a curriculum or course of study, or to differentiate levels of ability among various examinees on some task. Validity questions are about what a test actually measures in relation to what it is supposed to measure. The reliability of a test is a matter of how consistently it produces similar results on different occasions under similar circumstances. Questions of reliability have to do with how consistently a test does what it is supposed to do, and thus cannot be strictly separated from . validity questions. Moreover, a test cannot be any more valid than it is reliable; A test's practicality must be determined in relation to the cost in terms of materials, time, and effort that it requires. This must include the preparation, administration, scoring, and interpretation of the test. Finally, the instructional value ofa test pertains to how easily it can be fitted into an educational program, whether the latter involves teaching a foreign langulj.ge, teaching language arts to native speakers, or verbally imparting subject matter in a monolingual or multilingual school setting. When the focus of language testing research is on learner characteristics, the tests themselves may be viewed as elicitation procedures for data to be subsequently analyzed. In this case, scores INTRODUCTION 5 on a test may be treated as summary statistics indicating various positions. on a-developmental scale, or individual performances may be analyzed in a more detailed way in an itttempt to diagnose specific aspects of learner development. The results of the latter. sort of analysis, often referred to as 'error analysis' (Richards, 1970a, 1971) may subsequently enter into the process of prescribing therapeutic intervention - possibly a classroom procedure. When the focus of language testing research is the verbal material in the test itself, questions 'usually relate to the psychologically real grammatical constraints on particular phonological (or graphological) sequences, syllable structures, vocabulary items, phrases, clauses, and higher level patterns of discourse. Sociological constraints may also be investigated with respect to the negotiability of those elements and sequences of them in interactional exchanges between human beings or groups of them. For any of the stated purposes of research, and ofcourse there are others which are not mentioned, the tests may be relatively formal devices or informal elicitation procedures. They· may require the production or comprehension of verbal sequences, or both. The language material may be heard, spoken, read, written (or possibly merely thought), or some combination of these. The task may require recognition only, or imitation, or delayed recall, memorization, meaningful conversational response, learning and long term storage, or some combination of these .. Ultimately, any attempt to apply the results of language testing research must consider the total spectrum of tests qua tests, learner characteristics, and the psychological and sociological constraints on test materials. Inferences concerning psychologically real grammars cannot be meaningful apart from the results oflan.guage tests viewed from all three angles of research outlined above. Whether or not a particular language test is valid (or better, the degree to which it' is valid or not valid), whether or not an achievement test or aptitude test, or personality inventory, or IQ test, or whatever other sort of test one chooses to consider is a language test, is dependent on what language competence really is and what sorts of verbal sequences present a challenge to that competence. This is essentially the question that Spolsky (1968) raised in the paper entitled: 'What does it mean to know a language? Or, How do you get someone to perform his competence?' 'p 6 LANGUAGE TESTS AT SCHOOL C. Organization of this book A good way, perhaps the only acceptable way to develop a test of ~ given ability is to start with a clear definition of the capacity in question. Chapter 2 begins Part One:. on Theory and Research Bases for Pragmatic Language Testing by proposing a definition for language proficiency. It introduces the notion of an expectancy grammar as a way of characterizing the psychologically real system that governs the use of a language in an individual who knows that language. Although it is acknowledged that the details of such a system are just beginning to be understood, certain pervasive characteristics of expectancy systems can be helpful in explaining why certain kinds of language tests apparently work as well as they do, and how to devise other effective testing procedures that take account of those salient characteristics of functional language proficiency. In Chapter 3, it is hypothesized that a valid language test must press the learner's internalized expectancy system into action and must further challenge its limits of efficient functioning in order to discriminate among degrees of efficiency. Although it is suggested that a statistical average of native performance on a language test is usually a reasdnable upper bound on attainable proficiency, it is almost always possible and is sometimes essential to discriminate degrees of proficiency among native speakers, e.g. at various stages of child language learning, or among children or adults learning to read, or among language learners at any stage engaged in normal inferential discourse processing tasks.~riierion referencedtestint, where passing the test or s0m.e portion 9:f itrrteanl'b~~;)i:m~i'fo l»'ft'frmth~ task crt t~b:lt~:~fu@t~tf'tevef6¥'W~qi:£ae1,twhicfi may be native-like performance in some cases) is also discussed. Pragmatic language tests are defined and exemplified as tasks that require the meaningful processing of sequences of elements in the target language (or tested language) under normal time constraints. It is claimed that time is always involved in the processing of verbal sequences. Chapter 4 extends the discussion to questions often raised in reference to bilingual-bicilltural programs and other multilingual contexts in general. The implications of the now famous Lau versus Nichols case are discussed. Special attention is given to the role of socio-cultural attitudes in language acquisition and in education in general. It is hypothesized that other things being equal, attitudes INTRODUCTION 7 expressed and perceived in the schools probably account for more variance in nitl< and amount of learning than do educational methodologies related to the transmission of the traditionally conceived curriculum. Some of the special problems that arise in muhilingual contexts are considered;,SUchas, cilltural bias in tests, difficillties in translating test items, and methods of assessing . language dominance. Chapter 5 concludes Part One with a discussion of the measure; ment of attitudes and motivations. It discusses in some detail questions related to the hypothesized~relationship between attitudes and language learning (first or second), and considers such variables as the context in which the language learning takes place, and the types of measurement techniques that have been used in previous research. Several hypotheses are offered concerning the relationship between attitudes, motivations, and achievement in education. Certain puzzling facts about apparent interacting influences in multilingual contexts are noted. Part Two, Theories and Methods of Discrete Point Testing, takes up some of the more traditional and probably less theoretically sound ways of approaching language testing. Chapter 6 discusses some of the difficillties associated with testing procedures that grew out of contrastive linguistics, syntax based structure drills, and certain assumptions about language structure and the learning of language from early versions of transformational linguistics. Pitfalls in relying too heavily on statistics for guiding test development are discussed in Chapter 7. It is shown that different theoretical assumptions may resillt in contradictory interpretations of the same statistics. Thus it is argued that such statistical techniques as are normally applied in test development, though helpfill if used, with care, should not be the chief criterion for deciding test format. An understanding of language use and language learning must take priority in guiding format decisions. Chapter 8 shows how discrete point language tests may produce distorted estimates of language proficiency. In fact, it is claimed that some discrete point tests are probably most appropriate as measures of the sorts of artificial grammars that learners are sometimes encouraged to internalize on the basis of artificial contexts of certain syntax dominated classroom methods. In order to measure communic/ltive effectiveness for real-life settings in and out of the classroom; it is reasoned that the language tests used in the classroom (or in any educational context) must reflect certain crucial properties 8 LANGUAGE TESTS AT SCHOOL of the normal use of language in ways that some discrete point tests apparently cannot. Examples of discrete point items which attempt to examine the pieces of language structure apart from some of their systematic interrelationships are considered. The chapter concludes by proposing that the diagnostic aims of discrete point tests can in fact be achieved by so-called integrative or pragmatic tests. Hence, a reconciliation between the apparently irreconcilable theoretical positions is possible. In conclusion to Part Two on discrete point testing, Chapter 9 provides a natural bridge to Part Three, Practical Recommendations for Language Testing, by discussing multiple choice testing procedures which may be of the discrete point type, or the integrative type, or anywhere on the continuum in between the two extremes. However, regardless of the theoretical bent of the test writer, multiple choice tests require considerable technical skill and a good deal of energy to prepare. They are in some respects less practical than some of the pragmatic procedures recommended in Part Three precisely because of the technical skills and the effort necessary to their preparation. Multiple choice tests need to be critiqued by some native speaker, other than the test writer. This is necessary to avoid the pitfalls of ambiguities and subtle differences of interpretation that may not be obvious to the test writer. The items need to be pretested, preferably on some group other than the population which will ultimately be tested with the finished product (this is ofte~ not feasible in classroom situations). Then, the items need to be statistically analyzed so that non-functional or weak items can be revised before they are used and interpreted in ways that affect learners. In some cases, recycling through the whole procedure is necessary even though all the steps of test development may have been quite carefully executed. Because of these complexities and costs of test development, multiple choice tests are not always suitable for meeting the needs of classroom testing ~ or for broader institutional purposes in some cases. The reader who is interested mainly in classroom applications of pragmatic testing procedures may want to begin reading at Chapter lOin Part Three. However, unless the material in the early chapters (especially 2 through 9) is fairly well grasped, the basis for many of the recommendations in Part Three will probably not be appreciated fully. For instance, many educators seem to have acquired the impression that certain pragmatic language tests, such as those based on the cloze procedure for example (see Chapter 11), are 'quick and INTRODUCTION 9 dirty' methods of acquiring information about language proficiency. This idea, however, is apparently. based only. on intuitions and is disconfirmed by the research discussed in Part One. Pragmatic tests are typically better on the whole than any other procedures that have been carefully studied. Whereas the prevailing techniques of language testing that educators are apt to be most familiar with are based on the discrete point theories, these methods are rejected in Part Two. Hence, were the reader to skip over to Part Three immediately, he might be left in a quandary as to why the pragmatic testing techniques discussed there are recommended instead of the more familiar discrete point (and typically multiple choice) tests. Although pragmatic testing procedures are in some cases deceptively simple to apply, they probably provide more accurate information concerning language proficiency (and even specific achievement objectives) than the more familiar tests produced on the basis of discrete point theory. Moreover, not only are pragmatic tests apparently more valid, but they are more practicable. It simply takes less premeditation, and less time and effort to prepare and use pragmatic tests. This is not to say that great care and attention is not necessary to the use of pragmatic testing procedures, quite the contrary. It is rather to say that hour for hour and dollar for dollar the return on work and money expended in pragmatic testing will probably offer higher dividends to the learner, the educator, and the taxpayer. Clearly, much more research is needed on both pragmatic and discrete point testing procedures, and many suggestions for possible studies are offered throughout the text. Chapter 10 discusses some of the practical classroom applications of the procedure of dictating material in the target language. Variations of the technique which are also discussed include the procedure of 'elicited imitation' employed with monolingual, bilingual, and bidialectal children. In Chapter 11 attention is focussed on a variety of procedures requiring the use of productive oral language skills. Among the ones discussed are reading aloud, story retelling, dialogue dramatization, conversational interview techniques, and specifically the Foreign Service Institute Oral Interview, the Ilyin Oral Interview, the Upshur Oral Communication Test, and the Bilingual Syntax Measure are also discussed. Increasingly widely used cloze procedure and variations of it are considered in Chapter 12. Basically the procedure consists of deleting words from prose (or auditorily presented material) and asking the . ;f " 10 1'1 I 'I I .. ,~,q0 .'~ . INTRODUCTION LANGUAGE TESTS AT SCHOOL examinee to try to replace the mlssmg words. Because of the; simplicity of application and the demonstrated validity of the technique, it has become quite popu1ar in recent years. However, it is probably not any more applicable to classroom purposes than some of the procedures discussed in other chapters. The cloze procedure is sometimes construed to be a measure of reading ability, though it may be just as much a measure of writing, listening and speaking ability (see Chapter 3, and the Appendix). Chapter 13 looks at writing tasks per se. It gives considerable space to ways of approaching the difficu1ties of scoring relatively' free essays. Alternative testing methods considered include various methods of increasing the constraints on the range of material that the examinee may produce in response to the test procedure. Controls range from suggesting topics for an essay to asking examinees to rephrase heard or read material after a time lapse. Obviously, many other control techniques are possible, and variations on the cloze procedure can be used to construct many of them. Chapter 14 considers ways in which testing procedures can be related to curricu1a. In particu1ar it asks, 'How can effective testing procedures be invented or adapted to the needs of an instructional program?' Some general guidelines are tentatively suggested for both developing or adapting testing proceq.ures and for studying their effectiveness in relation to particufar i!ducational objectives. To illustrate one of the ways. that curriculum (learning, teaching, and testing) can be related to a comprehensive sort of validation research, the Mount Gravatt reading research project is discussed. Tliis project provides a rich source of data concerning preschool children, and children in the early grades. By carefully studying the kinds of language games that children at various age levels can play and win (Upshur, 1973), that is, the kinds of things that they can explore verbally and with success, Norman Hart, Richard Walker, and their colleagues have provided a model for relating theories of language learning via carefu1 research to the practical educational task of teaching reading. There are, of course, spin off benefits to all other areas of the curricu1um because of the fundamental part played by language use in every area of the educational process. It is strongly urged that language testing procedures, especially for assessing the language skills of children, be carefully examined in the light of such research. Throughout the text, wherever technical projects are ref6rred to, details of a technical sort are either omitted or are explained in non- 11 technical language. More complete research reports are often referred to in the text (also see the Appendix) and should be consu1ted by anyone interested in applying the recommendations contained here to the testing of specific research hypotheses. However, for classroom purposes (where at least some of the technicalities of research design are luxuries) the suggestions ~ffered here are intended to be sufficient. Additional readings; usually of a non-technical sort, are suggested at the end of each chapter. A complete list of technical reports and other works referred to in the text is included at the end of the book in a separate Bibliography. The latter includes all of the Suggested Readings at the end of el).ch chapter, plus many items not given in the Suggested Readings lists. An Appendix reporting on recent empirical study of many of the pressing questions raised in the body of the text is included at the end. The fundamental question addressed there is whether or not language proficiency can be parsed up into components, skills, and the like. Further, it is asked whether language proficiency is distinct from IQ, achievement, and other educational constructs. The Appendix is not inchldedin the body of the text precisely because it is somewhat technical. It is to be expected that a book dealing with a subject matter that is changing as rapidly as the field of language testing research shou1d soon be outdated. However, it seems that most current research is pointing toward the refinement of existing pragmatic testing procedures and the discovery of new ones and new applications. It seems unlikely that there will be a substantial return to the strong versions of discrete point theories and methods of the 1960s and early 1970s. In any event the emphasis here is on pragmatic language testing because it is believed that such procedures offer a richer yield of information. KEY POINTS 1. Any test that challenges the language ability of an examinee can at least in part, be construed as a language test. This is especially true for examine~s who ?o not know or normally use the language variety of the test, but IS true m a broader sense for native speakers of the language of the test. 2. It is not known to what extent language ability may be co-extensive with IQ, but there is evidence that the relationship is a very strong one .. Hence, IQ tests (and many other varieties of tests as well) may be tests of language ability more than they are tests of anything else. 3. Language testing is crucial to the investigation of psychologically real grammars, to research in all aspects of distinctively human symbolic I F 12 INTRODUCTION LANGUAGE TESTS AT SCHOOL 6. Consider any educational research project that you know of or have access to. What sorts of measurements did the research use? Was there a testing technique? An observational or rating procedure? A way of recording behaviors? Did language figure in the measurements taken? 7. Why do you think language might or might not be related to capacity to perform motor tasks, particularly in young children? Consider finding your way around town, or around a new building, or around the house. Do you ever use verbal cues to guide your own stops,starts, turns? Subvocal ones? How about in a strange place or when you are very tired? Do you ever ask yourself things like, Now what did! come in here for ? 8. Can you conceive of any way to operationaliie notions like language competence, degree" of bilingualism, stages of learning, effectiveness of language teaching, rate oflearning, level of proficiency without language tests? . If you were to rank the criteria of validity, reliability, practicality and instructional value in their order of import alIce, what order would you put them in? Consider the fact that validity without practicality is certainly possible. The ·same is true for validity without instructional value. How about instructional value without validity? Do you consider the concept of intelligence or IQ to be a useful theoretical construct? Do you believe that researchers and theorists know what they mean by the term apart from some test score? How about grammatical knowledge? Is it the same sort of construct? ~1. Can you, think of any way(s) that time is normally involved in a task like readin~ a novel- when no one is holding a stop-watch? behavior, and to educational measurement in general. 4. Language testing research may focus on tests, learners, or constraints on verbal materials. 5. Among the questions of interest to language testing research are: (a) how to operationally define levels of language proficiency, stages of language learning, degrees of bilingualism, .or linguistic competence; (b) how to differentiate stages of learning; (c) how to measure possible effects of instruction (or other environmental factors) on language learning; (d) how language ability is related to the acquisition of knowledge in an . educational setting. 6. When the research is focussed on tests, validity, reliability, practicality, and instructional value are among the factors of interest. 7. When the focus is on learners and their developing language systems, tests may be viewed as elicitation procedures. Data elicited may then be analyzed with a view toward providing detailed descriptions of learner systems, and/or diagnosis of teaching procedures (or other therapeutic interventions) to facilitate learning. 8. When research is directed toward the verbal material in a given test or testing procedure, the effects of psychological or sociological constraints built into the verbal sequences themselves (or constraints which are at least implicit to language users) are at issue. 9. From all ofthe foregoing it follows that the results oflanguage tests, and the findings of language testing research ~re highly relevant to psychological, sociological, and educational measurement in general. DISCUSSION QUESTIONS 1. What tests are used in your school that require the comprehension or production of complex s€<quences of material in a language? Reading achievement tests? Aptitude tests? Personality inventories? Verbal I Q tests? Others? What evidence exists to show that the tests are really measures of different things? 2. Are tests in your school sometimes used for examinees whose native language (or language variety) is not the language (or language variety) used in the test? How do you think such tests ought to be interpreted? . 3. Is it possible that a non-verbal test of IQ could have a language factor unintentionally built into it? How are the instructions given? What strategies do you think children or examinees must follow in order to do the items on the test? Are any of those strategies related to their ability to code information verbally? To give subvocal commands? '/4. In what ways do teachers normally do language testing (unintentionally) in their routine activities? Consider the kinds of instructions children or adults must execute in the classroom. 5. Take any·standardized test used in any school setting. Analyze it for its level of verbal complexity. What instructions does it use? Are they more or less complex than the tasks which they define or explain? For what age ~ level of children or for what proficiency level of second language learners would such instructions be appropriate? Is guessing necessary in orderto understand the instructions of the test? 13 "9. SUGGESTED READINGS 1. George A. Miller, 'The Psycholinguists,' Encounter 23, 1964, 29-37. Reprinted in Charles E. Osgood and Thomas A. Sebeok (eds.) Psycho linguistics : A Survey of Theory and Research Problems. Bloomington, Indiana: Indiana University, 1965. 2. John W. Oller, Jr. and Kyle Perkins, Language in Education: Testing the Tests. Rowley, Massachusetts: Newbury House, 1978. 3. Bernard Spolsky, 'Introduction: Linguists and Language Testers' in Advances in Language Testing: Series 2, Approaches to Language Testing. Arlington, Virginia: Center for Applied Linguistics, 1978, v-x. c L LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 2 Language Skill as a Pragmatic Expectancy Grammar A. B. C. D. What is pragmatics about? The factive aspect oflanguage use The emotive aspect Language learning as grammar construction and modification E. Tests that invoke the leamer's grammar Understanding what is to be tested is prereq~isite to good testing of any sort. In this chapter, the object of interest is language as it is used for communicative purposes - for getting and giving informationabout (or for bringing about changes in) facts or states of affairs, and for expressing attitudes toward those facts or states of affairs. The notion of expectancy is introduced 'a,s a key to understanding the nature of psychologically ft!al processes that underlie language use. It is suggested that expectancy generating systems are constructed and modified in the course oflanguage acquisition. Languag~ proficiency . is thus characterized as consisting of such an expectancy generating system. Therefore, it is claimed that for a proposed measure to qualify as a language test, it must invoke the expectancy system or grammar of the examinee. A. What is pragmatics about? The newscaster in Albuquerque who smiled cheerfully while speaking of traffic fatalities, floods, and other calamities was expressing an ~ entirely different attitude toward the, facts he was referring to than was probably held by the friends and relatives of the victims, not to 16 , : I, 17 mention the more compassionate strangers who were undoubtedly watching him on the television. There might not have been any disagreement about what the facts were, but the potent contrast in the attitudes of the newscaster and others probably accounts for the brevity of the man's career as the station's anchorman. Thus, two aspects of language use need to be distinguished. Language is usually used to convey information about people, things, events, ideas, states of affairs, and attitudes toward all of the foregoing. It is possible for two or more people to agree entirely about the facts referred to or the assumptions implied by a certain statement but to disagree markedly in their attitudes toward those facts. The newscaster and his viewers probably disagreed very little or not at all concerning the facts he was speaking of. It was his manner of speaking (including his choice of words) and the attitude conveyed by it that probably shortened,his career. Linguistic analysis has traditionally been concerned mainly with what might be called thefactive (or cognitive) aspect oflanguage use. The physical stuff of language which codes factive information usually consists of sequences of distinctive sounds which combine to form syllables, which form words, which get hooked together in highly constrained ways to form phrases, which make up clauses, which alsd combine in highly restricted ways to yield the incredible diversity of human discourse. By contrast, the physical stuff of language which codes emotive (or affective, attitudinal) information usually. consists of facial expression, tone of voice, and gesture. Psychologists and sociologists have often been interested more in the emotive aspect oflanguage than in the cognitive complexities of the factive aspect. Certainly cognitive psychology and linguistics, along with philosophy and logic, on the other hand, have concentrated on the latter. Although the two aspects are intricately interrelated, it is often useful and sometimes essential to distinguish them. Consider, for instance, the statement that Some of Richard's lies have been discovered. This remark could be taken to mean that there is a certain person named Richard (whom we may infer to be a male human), who is guilty oflying on more than one or two occasions, and some of whose lies have been found out. In addition, the remark implies that there are other lies told by Richard which may be uncovered later. Such a statement relates in systematic ways to a speaker's asserted beliefs concerning certain states of affairs. Of course, the speaker may be lying, or sincere but mistaken, or sincere and correct, and these are c 18 LANGUAGE TESTS AT SCHOOL only some of the many possibilities. In any case, however, as persons ,"who know English, we understand the remark about Richard partly, by inferring the sorts of facts it would take to make such a statement true. Such inferences are not perfectly understood, but there is no doubt that language users make them. A speaker (or writer) must make them in order to know what his listener (or reader) will probably understand, and the listener (or reader) must make them in . order to know what a speaker (or writer) means. In addition to the factive information coded in the words and phrases. of the statement, a person who utters that statement· may convey attitudes toward the asserted or implied states of affairs, and may further code information concerning the way the speaker thinks the listener should feel about those states of affairs. For instance, the speaker may appear to hate Richard, and to despise his lies (both those that have already been discovered and the others not yet found out), or he may appear detached and impersonal. In speaking, such emotive effects are achieved largely by facial expression, tone of voice, and gesture, but they may also be achieved in writing by describing the manner in which a statement is made or by skillful choice of words. The latter, of course; is effective either in spoken or written discourse as a device for coding emotive information. Notice the change in the emotive aspect if the word lies is replaced by halftruths: Some of Richard's half-truths have been discovered. The' disapproval is weakened still further if we say: Some of Richard's mistakes have been discovered, and further still if we change mistakes to errors ofjudgement. In the normal use oflanguage it is possible to distinguish two major kinds of context. First, there is the physical stuff oflanguage which is organized into a more or less linear ~rrangement of verbal elements skillfully and intricately interrelated with a sequence of rather precisely timed changes in tone Of voice, facial expression, body posture, and so on. To call attention tothe fact that in human beings even the latter so-called 'paralinguistic' devices of communication are an integral part of language use, we may refer to the verbal and gestural aspects of language in use as constituting linguistic context. With reference to speech it is possible to decompose linguistic context into verbal and gestural contexts. With reference to writing, the terms linguistic context and verbal context may be used interchangeably. A second major type of context has to do with the world, outside of language, as it is perceived by language users in relation to themselves and valued other persons or groups. We will use the term'" rI lJ • LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 19 extralinguistic context to refer to states of affairs constituted by things, events, people, ideas, relationships, feelings,' perceptions, memories, and so forth. It may be useful to distinguish objective aspects of extralinguistic context from subjective aspects. On the one hand, there is the world of existing things, events, persons, and so forth, and on the other, there is the world of self-concept, otherconcept, interpersonal relationships, group relationships, and so on. In a sense, the two worlds are part of a single totality for any individual, but they are not necessarily so closely related. Otherwise, there would be no need for such terms as schizophrenia, or paranoia. Neither linguistic nor' extralinguistic contexts are simple in themselves, but what complicates matters still further and makes meaningful communication possible is that there are systematic correspondences between linguistic contexts· and extralinguistic ones. ,That is, sequences of linguistic elements in normal uses of language are not haphazard in their relation to people, tliings, events, ideas, relationships, attitudes •.etc., but are systematically related to states of affairs outside oflanguage. Thus we may say that linguistic contexts are pragmatically mapped onto extralinguistic contexts, and vice versa. We can now offer a definition of pragmatics. Briefly, it addresses the question: how do l,ltterances relate to human experience outside oflanguage? It is concerned with the relationships between linguistic contexts and extralinguistic contexts. It embraces the traditional subject matter of psycholinguistics and also that of sociolinguistics. Pragmatics is about how people communicate information about facts and feelings to other people, or how they merely express themselves and their feelings through the use of language for no particular audience, except possibly an omniscient God. It'is about how meaning is both coded and in a sense invented in the normal intercourse of words and experience (to borrow a metaphor from Dewey, 1929). B. The factive.aspect oflanguage use Language, when it is used to convey information about facts, is always an abbreviation for a richer conceptualization. We know more about objects, events, people, relationships, and states of affairs than we are ever fully able to express in words. Consider the difficulty of saying all you know about the familiar face of a friend. The fact is that your best effort would probably fail to convey enough .1- 20 LANGUAGE TESTS AT SCHOOL LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR information to enable someone else to single out your friend in a large; : crowd. This simply illustrates the fact that you know more than yo-q' are able to say. Here is another example. Not long ago, a person whom I know very well, was involved in an accident. He was riding a ten-speed bicycle around a corner when he was hit head-on by a pick-up. The driver of the truck cut the corner at about thirty miles an hour leaving no room for· the cyclist to pass. The collision was inevitable. Blood. gushed from a three inch gash in the top of~his head and a blunt handlebar was rammed nearly to the bone in his left thigh, From this description you have some vivid impressions about the events referred to. However, no one needs to point out the fact that you do not know as much about the events referred to as the person who experienced them. Some of what you do know is a result of t~e ,. linguistic context of this paragraph, and some of what you know IS the result of inferences that you have correctly made concerning what it probably feels like to have a blunt object rammed into your thigh, or to have a three inch gash in your head, but no matter how much you are told or are able to infer it will undoubtedly fall short ofthe information that is available to the person who experienced the events in his own flesh. Our words are successful in conveying only part of the infortnation that we possess. . ' Whenever we say anything·at all weleave a great deal more unsaid .. We depend largely for the effect of our communications not only ~n . what we say but also on the cr~ative ability of our listeners to fill 10. what we have left unsaid. The fact is that a normal listener supplies a ( great deal of information by creative inference and in a very important sense is always anticipating what the speaker will say ne~t. Similarly, the speaker is always anticipating what the listener will infer and is correcting his output on the basis of feedback received from the listener. Of course, some language users are more skillful in such things than others. We are practically always a jump or two ahead of the person that we are listening to, and sometimes we even outrun our own tongues ( when we are speaking. It is not unusual in a speech error for a speaker to say a word several syllables ahead of what he intended to say, nor is it uncommon for a listener to take a wrong turn in his thinking and fail· to understand correctly, simply because he was expecting something else to be said. It has been shown repeatedly that tampering with the speaker:s own feedback of what he is saying has striking debilitating effects 21 (Chase, Sutton, and First, 1959). The typical experiment illustrating , this involves delayed auditory feedback or sidetone. The speaker's voice is recorded on a tape and played back a fraction of a second later into a set of headphones which the speaker is wearing. The result is that the speaker hears not what he is saying, but what he has just said a fraction of a second earlier. He invariably stutters and distorts syllables almost beyond recognition. The problem is that he is trying to compensate for what he hears himself saying in relation to what he expects to hear. After some practice, it is possible for the speaker to ignore the delayed auditory feedback and to speak normally by attending instead to the so-called kinestheti6 feedback of the movements of the· vocal 'apparatus and presumably the bone conducted vibrations of the voice. The pervasive importance of expectations in the processing of all sorts of information is well illustrated in the following remark by the world renowned neurophysiologist, Karl Lashley: . .. The organization of ianguage seems to me to be characteristic of almost all other cerebral activity. There is a series of hierarchies of organization; the order of vocal movements in pronouncing the word, the order of words in the sentence, the order of sentences in the paragraph, the rational or~er of paragraphs i~ a discourse. Not only speech, but all skilled acts seem to mvolve the same problems. of serial ordering,even down to the temporal coordinations of muscular co.ntractions in such a movement as reaching and grasping . (1951, p. 187). A major aspect of language use that a good theory must explain is that there is, in Lashley's words, 'a series' of hierarchies of organization.' That is, there are units that combine with each other to form higher level units. For instance, the letters in a written word combine to form the written word itself. The word is not a letter and the letter is not a word, but the one unit acts as a building block for' the other, something like the, way atoms combine to form molecules. Of cmlfse, atoms consist of their own more elementary building blocks and molecules combine in complex ways to become the building blocks of a great diversity of higher order substances. Words make phrases, and the phrases carry neW and different meanings which are not part of the separate words of which they are made. For instance, consider the meanings of the words head, red, the, and beautiful. Now consider their meanings again in the phrase the beautiful redhead as in the sentence, She's the beautiful redhead I've been telling you about. At each higher level in the hierarchy, as John '~- () LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 22 LANGUAGE TESTS AT SCHOOL lecture: He asked Mr Macklin to kindly read it aloud once to the audience and then to repeat it from memory. So Mr Macklin read: Dewey (1929) put it, new meanings are bred from the cop~latihg .l ' forms. This in a nutshell is the basis of the marvelous compleXlty and novelty oflanguage as an instrument for coding information and for conceptualizing. Noam Chomsky, eminent professor of linguistics at the Massachusetts Institute of Technology" is mainly responsible for the emphasis in modern linguistics on the char~cteristic novelty of sentences. He has argued convincingly (cf. especIally Chomsky, 1972) that novelty is the rule rather than the exception in the everyday use of language. If a sentence is more than part of a ritual verbal p~ttern (such as, 'Hello. How are you 1'), it is probably a novel con~octlOn.of the speaker which probably has never been heard or SaId by h~m before. As George Miller (1964) has pointed out, a co~servat~ve estimate of the number of possible twenty-word sentences III Enghsh is on the order of the number of seconds in one' hundred million centuries. Although sentences may share certain structural features, any particular one that happens to be uttered is probably invented, new on the spot. The probability that it has been heard before and memorized is very slight. The novelty of language, however, is a kind of freedom within limits. When the limits on the creativity allowed by language are violate~, many versions of nonsense result. They may range fro~ unpronounceable sequences like,glmtmbwk (unpronounceable III English at least) to pronounpeable nonsense such' as nox ems glerf onmo kebs (from 'Osgood, 1955). They may be syntactically, acceptable but semantically strange concoctions l,ike the m~ch overused example of Jabberwocky or Chomsky s (now tnte) illustrative sentence, Colorless gr,een ideas sleep furiously. 1 A less well known passage of nonsense was inve~ted by Samuel Foote, one of the best known humorists of the nineteenth century, in, order to prove apoint about the organization of memory. Foote had, been attending a series oflectures by Charles Macklin on oratory. O~ one particular evening, Mr Macklin boasted that he had mastered the ' principles of memorization so thoroughly that he could repeat any paragraph by rote after having read it only once. At t~e end. of the lecture, the unpredictable Foote handed Mr Mackhn a piece of paper on which he had composed a brief paragraph during the So she went into the garden to cut a cabbage leaf to make an apple pie; and at the same time a great she-beat coming up the street pops its head in the shop. 'What! No soap!' So he died, and she very imprudently married the barber: and there were present the Picninnies, the Joblillies, and the Garcelies, and the Great Panjandrum himself~ with the little round button at the, top and they all fell to playing the game of catch-as-catch-can, till 'the gunpowder ran 'out the heels of their boots (Samuel Foote, ca. 1854; see Cooke, 1902, p. 221f).2 The incident probably improved Mr Macklin's modesty and it surely instructed him on the importance of the reconstructive aspects of verbal memory. We don't just happen to remember things in all their detail ; rather, we remember a kind of skeleton, or possibly a whole hierarchy of skeletons, to which we attach the flesh of detail by a creative and reconstructive process. That process, like all verbal and cognitive activities, is governed largely by what we have learned to expect. The fact is that she-bears rarely pop their heads into barber shops, nor do people cut cabbage leaves to make apple pies. For such reasons, Foote's prose is difficult to remember. It is because the contexts, outside 'of language, which are mapped by his choice of words are odd contexts in themselves. Otherwise, the word sequences are grammatical enough. Perhaps the importance of our normal expectancies concerning words and what they mean is best illustrated by nonsense which ,violates those expectancies. The sequence gbntmbwk forces oli our attention things that we know only subconsciously about our language - for example, the fact that g cannot immediately precede b at the beginning of a word, and that syllables in English must have a vowel sound in them somewhere (unless shhhh! is a syllable). These are facts we know because we have acquired an expectancy grammar for English. Our internalized grammar tells us that glerf is a possible word in English. It is pronounceable and is parallel to words that exist in the 2 I am indebted to my father, John Oller, Sr., for this illustration. He used it often in his talks on language teaching to show the importance of meaningful sequence to recall and learning. He often attributed the prose to Mark Twain, and it is possible that Twain used this same piece of wit to debunk a supposed memory expert in a circus contest as my father often claimed. I have not been able to document the story about Twain, though it seems characteristic of him, No doubt he knew of Foote and may have consciously imitated him, 1 At one Summer Linguistics Institute, someone had bumper stickers printed up with Chomsky'S famous sentence. One'{)fthem found its way into the hands of my brother, D. K. Oller, and eventually onto my bumper to the consternation of many Los Angeles motorists, ,I I' 23 '.1- 24 LANGUAGE TESTS AT SCHOOL language such as glide, serf, and slurp; still, glerf is not an English word. Our grammatical expectancies are not completely violated by Lewis Carroll's phrase, the frumious bandersnatch, but we recognize this as a novel creation. Our internalized grammar causes us to suppose that frumious must be an adjective that modifies the noun bandersnatch. We may even imagine a kind of beast that, in the context, might be referred to as a frumious bandersnatch. Our inferential construction mayor may not resemble anything that Carroll had in mind, if in fact he had any 'thing' in mind at all. The inference here is similar to supposing that Foote .was referring to Macklin himself when he chose the phrase the Great Panjandrum himself, with the little round button at the top. In either case, similar grammatical expectancies are employed. But it may be objected that what we are referring to here as grammatical involves more than what is traditionally subsumed under the heading grammar. However, we are not concerned here with grammar in the traditional sense as being something entirely abstract and unrelated to persons who know languages. Rather, we are concerned with the psychological realities oflinguistic knowledge as it is internalized in whatever ways by real human beings. By this definition of grammar, the language user's knowledge of how to map utterances pragmatically onto contexts outside of language and vice versa (that is, how to map contexts onto utterances) must be incorporated int<;> the grammatical system. To illustrate, the word horse is effective in communicative exchanges if it is related to the right sort of animal. Pointing to a giraffe and calling it a horse is not an error in syntax, nor even an error in semantics (the speaker and listener may both know the intended meaning). It is the pragmatic mapping of a particular exemplar of the category GIRAFFE (as an object or real world thing, not as a word) that is incorrect. In an important sense, such an error is a grammatical error. The term expectancy grammar calls attention to the peculiarly sequential organization oflanguage in actual use. Natural language is perhaps the best known example of the complex organization of elements into sequences and classes, and sequences of classes which are composed of other sequences of classes and so forth. The term pragmatic expectancy grammar further calls attention to the fact that the sequences of classes of elements, and hierarchies of them which constitute a language are available to the language user in real life situations because they are somehow indexed with reference to their appropriateness to extralinguistic contexts. LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 25 In the normal use oflanguage, no matter what level oflanguage or mode of processing we think of, it is always possible to predict partially what will come next in any given sequence of elements. The elements may be sounds, syllables, words, phrases, sentences, paragraphs, or larger units of discourse. The mode of processing may be listening, speaking, reading, writing, or thinking, or some combination of these. In the meaningful use oflanguage, some sort of pragmatic expectancy grammar must function in all cases. A wide variety of research has shown that the more grammatically predictable a sequence oflinguistic elements is, the more readily it can be processed. For instance, a sequence of nonsensical syllables ·as in the example from Osgood, Nox ems glerf onmo kebs, is more difficult than the same sequence with a more obvious structure imposed on it, as in The nox ems glerfed the onmo kebs. But the latter is still more difficult to process than, The bad boys chased the pretty girls. It is easy to see that the gradation from nonsense to completely acceptable sequences of meaningful prose can vary by much finer degrees, but these examples serve to illustrate that as sequences of linguistic elements become increasingly more predictable in terms of grammatical organization, they become easier to handle. Not only are less constrained sequences more difficult than more constrained ones, but this generalization holds true regardless of which of the four traditionally recognized skills we are speaking of. It is also true for learning. In fact, there is considerable evidence to suggest that as organizational constraints on linguistic sequences are increased, ease of processing (whether perceiving, producing, learning, recalling, etc.) increases at an accelerating rate, almost exponentially. It is as though our learned expectations enable us to lie in wait for elements in a highly constrained linguistic context and make much shorter work of them than would be possible if they took us by surprise. As we have been arguing throughout this chapter, the constraints on what may follow in a given sequence of linguistic elements go far beyond the traditionally recognized grammatical ones, and they operate in every aspect of our cognition. In his treatise on thinking, John Dewey (1910) argued that the 'central factor in thinking' is an element of expectancy. He gives an example of a man strolling along on a warm day. Suddenly, the man notices that it has become cool. It occurs to him that it is probably going to rain; looking up, he sees a dark cloud between himself and the sun. He then quickens his steps (p. 61). Dewey goes on to define thinking as 'that operation in which 26 LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR LANGUAGE TESTS AT SCHOOL 27 Although the facts referred to by such terms as exaggerating and lying may be the same facts in certain practical cases, the attitudes expressed toward those facts by selecting one or the other term are quite different. Indeed, the accompanying facial expression and tone of voice may convey attitudinal information so forcefully as to even contradict the factive claims of a statement. For instance, the teacher who says to a young child in an irritated tone of voice to 'Never mind about the spilled glue! It won't be any trouble to clean it up!' conveys one message factively and a very different one emotively. In this case, as in most such cases, the tone of voice is apt to speak louder than the words. Somehow we are more willing to believe a person's manner of speaking than whatever his words purport to say. It is as if emotively coded messages were higher ranking and therefore more authoritative messages. As Watzlawick, Beavin, and Jackson (1967) point out in their book on the Pragmatics of Human Communication, it is part of the function of emotive messages to provide instructions concerning the interpretation of factively coded information. Whereas the latter can usually be translated into propositional forms such as 'This is what I believe is true', or 'This is what I believe is not true', the former can usually be translated into propositional forms about interpersonal relationships, or about how certain factive statements are to be read. For instance, a facetious remark may be said in such a manner that the speaker implies 'Take this remark as ajoke' or 'I don't really believe this, and you shouldn't either.' At the same time, people are normally coding information emotively about the way they see each other as persons. Such messages can usually be translated into such propositional forms as 'This is the way I see you' Or 'This is the way I see myself' or 'This is the way I see you seeing me' and so on. Although attitudes toward the self, toward others, and toward the things that the self and others say may be more difficult to pin down than are tangible states of affairs, they are nonetheless real. In fact, Watzlawick et ai, contend that emotive messages concerning such ~bstract aspects of interpersonal realities are probably much more Important to the success of communicative exchanges than the factively coded messages themselves. If the self in relationship to ?thers is satisfactorily defined, and if the significant others in lllteractionalrelationships confirm one's definition of self and others communication concerning factive information can take place: Otherwise, relationship struggles ensue. Marital strife over whether or not one party loves the other, children's disputes about who said presentfacts suggest other facts (or truths) in such a way as to induce~ belief in the latter upon the ground or the warrant of the former' (p. 8f). C. The emotive aspect To this point, we have been concerned primarily with the factive aspect oflanguage and cognition. However, much of what has been said applies as well to the emotive aspect oflanguage use. Nonetheless there are contrasts in the coding of the two types of information. While factive information is coded primarily in distinctly verbal sequences, emotive information is coded primarily in gestures, tone of voice, facial expression, and the like. Whereas verbal sequences consist of a finite set of distinctive sounds (or features of sounds), syllables, words, idioms, and collocations, and generally of discrete and countable sequences of elements, the emotive coding devices are typically non-discrete and are more or less continuously variable. For example, a strongly articulated p in the word pat hardly changes the meaning of the word, nor does it serve much better to distinguish a pat on the head from a cat in the garage. Shouting the word garage does not imply a larger garage, nor would whispering it necessarily change the meaning of the word in terms of its factive value. Either you have a garage to talk about or you don't, and there isn't much point in distinguishing cases in between the two extremes. With emotive information things are different. A wildly brandished fist is a stronger statement than a mere clenched fist. A loud shout means a stronger degree of emotion than a softer tone. In the kinds of devices typically used to code emotion information, variability in the strength of the symbol is analogically related to similar variabilityin the attitude that is symbolized. In both speaking and writing, choice of words also figures largely in the coding of attitudinal information. Consider the differences in the attitudes elicited by the following sentences: (l) Some people say it is better to explain our point of view as well as give the news; (2) Some people say it is better to include some propaganda as well as give the news. Clearly, the presuppositions and implications of the two sentences are somewhat different, but they could conceivably be used in reference to exactly the same immediate states of affairs or extralinguistic contexts. Of the people polled in a certain study, 42.8 %agreed with the first, while only 24.7 %agreed with the second (Copi, 1972, p. 70). The 18.1 %difference is apparently attributable to the difference between explaining and propagandizing. m 28 LANGUAGE TESTS AT SCHOOL what and whether or not he or she meant it, labor and management disagreements about fair wages, and the arms race between the major world powers, are all examples of breakdowns in factive communication once relationship struggles begin. What is very interesting for a theory of pragmatic expectancy grammar is that in normal communication, ways of expressing attitudes are nearly perfectly coordinated with ways of expressing factive information. As a person speaks, boundaries between linguistic segments are nearly perfectly synchronized with changes in bodily postures, gestures, tone of voice, and the like. Research by Condon and Ogston (1971) has shown that the coordination of gestures and verbal output is so finely grained that even the smallest movements of the hands and fingers are nearly perfectly coincident with boundaries in linguistic segments clear down to the level of the phoneme. Moreover, through sound recordings and high resolution motion photography they have been able to demonstrate that when the body movements and facial gestures of a speaker and hearer 'are segmented and displayed consecutively, the speakers and hearer look like puppets moved by the same set of strings' (p. 158). The demonstrated coordination of mechanisms that usually code factive information and devices that usually code emotive information shows that the anticipatory planning of the speaker and the expectations of the listener must be in close harmony in normal communication. Moreover, from the fact that they are so synchronized we may infer something of the complexity of the advance planning and hypothesizing that normal internalizep grammatical systems must enable language users to accomplish. Static grammatical devices which do not incorporate an element of real time would seem hard put to explain some of the empirical facts which demand explanation. Some sort of expectancy grammar, or a system incorporating temporal constraints on linguistic contexts seems to be required. D. Language learning as grammar construction and modification In a sense language is something that we learn, and in another it is a medium through which learning occurs. Colin Cherry (1957) has said that we never feel we have fully grasped an idea until we have 'jumped on it with both verbal feet.' This seems to imply that language is not just a means of expressing ideas that we already have, but rather that it is a means of discovering ideas that we have not yet fully discovered. LANGUAGE-5KILL AS A PRAGMATIC EXPECTANCY GRAMMAR 29 John Dewey argued that language was not just a means of 'expressing antecedent thought', rather that it was a basis for the very act of creative thinking itself. He wryly observed that the things that a person says often surprise himself more than anyone else. Alice in Through the Looking Glass seems to have the same thought instilled in her own creative imagination through the genius of Lewis Carroll. She asks, 'How can I know what I am going to say until I have already said it?' . Because of the nature of human limitations and because of the complexities of our universe of experience, in order for our minds to cope with the vastness of the diversity, it categorizes and systematizes elements into hierarchies and sequences of them. Not only is the universe of experience more complex than we can perceive it to be at a given moment of time, but the depths of our memories have registered untold millions of details about previous experience that are beyond the grasp of our present consciousness. Our immediate awareness can be thought of as an interface between external reality and the mind. It is like a corridor of activity where incoming elements of experience are processed and where the highly complex activities of thinking and language communication are effected. The whole of our cognitive experience may be compared to a more or less constant stream of complex and interrelated objects passing back and forth through this center of activity. Because of the connections and interrelationships between incoming elements, and since they tend to cluster together in predictable ways, we learn to expect certain kinds ofthings to follow from certain 'others. When you turn the corner on the street where you live you expect to see certain familiar buildings, yards, trees, and possibly your neighbor's dog with teeth bared. When someone speaks to you and you turn in his direction, you expect to see him by looking in the direction of the sound you have just heard. These sorts of expectations, whether they are le~rned or innate, are so commonplace that they seem trivial. They are not, however. Imagine the shock of having to face a world in which such expectations stopped being correct. Think what it would be like to walk into your living room and find· yourself in a strange place. Imagine walking toward someone and getting farther from him with every step. The violations of our commonest expectations are horror-movie material that make earthquakes and hurricanes seem like Disneyland. Man's great advantage over other organisms which are also prisoners of time and space, is his ability to learn and use language to 30 LANGUAGE TESTS AT SCHOOL systematize and organize experience more effectively. Through ~he use oflanguage we may broaden or narrow the focus of our att~ntl~n much the way we adjust the focus of our vision. We ma~ t~mk m terms of this sentence, or today, or this school year, or our hfetlme, or known history, and so on. Regardless of how broad or narrow our perspective, there is a sequence of elements attended ~o by our consciousness within that perspective. The sequence Itself may consist of relatively simple elements, or sets of interrelated and highly structured elements, but there must be a sequence because the totality of even a relatively simple aspect of our universe is too complex to be taken in at one gulp. We must deal with certain things ahead of others. In a sense, we miIst take in elements single file at a given rate, so that within the span of immediate consciousness, the number of elements being processed does not exceed certain limits. In a characteristic masterpiece publication, George 'Miller (1956) presented a considerable amount of evidence from a ~de variety of sources suggesting that the number of separate thmgs th~t our consciousness can handle at anyone time is somewhere m the neighborhood of seven, plus or minus one or two. He also pointed out that human beings overcome this limitation in part by what he ~alls 'chunking', By treating sequences or clusters of ele~ents as umtary chunks (or members of paradigms or classes) the mmd constructs a richer cognitive system. In other words, by setting up us~ful categories of sequences, and categories of sequenqes of c~tegones, our capacity to have correct expectations is enhanced -. that IS, we are enabled to have correct expectations about more objects, or more complex sorts of objects· (in the most abstract sense of 'object') without any greater cost to the cognitive system. All of this is merely a way of talking ab,out learning. As sequences of elements at one level are organized into classes at a higher order of abstraction, the organism can be said to be constructing an appropriate expectancy grammar, or learning. A universal c.onsequence of the construction and modification of an appropnate expecta,ncy grammar is that the processing of sequences of elements that conform to the constraints of the grammar is thus enhanced, Moreover, it may be hypothesized that highly organized sequences of elements that are presented in contexts where the basis for the organization can be discovered will be more conducive to the construction of an appropriate expectancy grammar than the presentation of similar sequences without appropriate sorts of context. LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 31 We are drawn to the generalization that there is an extremely important parallel between tlfenormal use of language and the learning of a language. The learner is never quite in the position of having no expectations to begin with. Even the newborn infant apparently has certain innate expectancies, e.g., that sucking its mother's breast will produce a desired effect. In fact, experiments by Bower (1971, 1974) seem to point to the conchision that an infant is born with certain expectations of a much more specific sort - for example, the expectation that a seen object should have some tangible solidity to it. He proved that infants at surprisingly early ages were astonished when they passed their hands through the space occupied by what appeared to be a tangible object. However, his experiments show that infants apparently hav~ to learn to expect entities (such as mother) to appear in only one place at one time. They also seem to have to learn that a percept of a moving object is caused by the same object as the percept of that same moving object when it comes to rest. The problem, it would seem, from an educational point of view is how to take advantage of the expectancies that a learner has already acquired in trying to teach new material. The question is, what does the learner alre,ady know, and how can that knowledge be optimally utilized in the presentation of new material? It has been demonstrated many times over that learning of verbal material is enhanced if the meaningfulness of the material is maximized from the learner's point of view, An unpronounceable sequence of letters like gbntmbwk is more difficult to learn and to recall than say, nox ems glerf, in spite of the fact that the latter is a longer sequence ofletters, The latter is easier because it conforms to some of the expectations that English speakers have acquired concerning phonological and graphological elements. A phrase like colorless grlien ideas conforms less well to our acquired expectancies than beautifulfall colors. Given appropriate contexts for the latter and the lack of them for the most part for the former, the latter should also be easier to learn to use appropriately than the fonner. A nonsensical passage like the one Mr Foote invented to stump Mr Macklin would be more difficult to learn than nofinal prose. The reason is simple enough. Learners know more about normal prose before the learning task begins. Language programs that employ fully contextualized and maximally meaningful language necessarily optimize the learner's ability to use previously acquired expectancies to help discover the pragmatic mappings of utterances in the new language onto 32 LANGUAGE TESTS AT SCHOOL extralinguistic contexts. Hence, they would seem to be superior to programs that expect learners to acquire the ability to use a language on the basis of disconnected lists of sentences in the form of pattern drills, many of which are not only unrelated to meaningful extralinguistic contexts, but which are intrinsically unrelatable. If one carefully examines language teaching methods and language learning settings which seem to be conducive to success in acquiring facility in the language, they all seem to have certain things in common. Whether a learner succeeds in acquiring a first language because he was born in the culture where that language was used, or was transported there and forced to learn it as a second language; whether a learner acquires a second language by hiring a tutor and speaking the language incessantly, or by marrying a tutor, or by merely maintaining a prolonged relationship with someone who speaks the language; whether the learner acquires the language through the command approach used successfully by J. J. Asher (1969,1974), or the old silent method (Gattegno, 1963), or through.z set of films of communicative exchanges (Oller, 1963-65), or by joining in a bilingual education experiment (Lambert and Tucker, 1972), certain sorts of data and motivations to attend to them are always present. The learner must be exposed to linguistic contexts in their peculiar pragmatic relationships to extralinguistic contexts, and the learner must be motivated to communicate with people in the target language by discovering tl;lOse pragmatic relationships. Although we have said little about education in a broader sense, everything said to this point has a broader application. In effect, the hypothesis concerning pragmatic expectancy grammar as a basis for explaining success and failure in' language learning and 'language teaching can be extended to all other areas of the school curriculum in which language plays a large part. We will return to this issue in Chapter 14 where we discuss reading curricula and other language based parts of curricula in general. In particular we will examine research into the developing language skills of children in Brisbane, Australia (Hart, Walker, and Gray, 1977). E. Tests that invoke the learner's grammar When viewed from the vantage point assumed in this chapter, language testing is primarily a task of assessing the efficiency of the pragmatic expectancy grammar the learner is in the process of constructing. In order for a language test to achieve validity in terms of the theoretical construct of a pragmatic expectancy grammar, it LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 33 will have to invoke and challenge the efficiency of the learner's developing grammar. We can be more explicit. Two closely interrelated criteria of construct validity may be imposed on language tests: first, they must cause the learner to process (either produce or comprehend, or possibly to comprehend, store, and recall, or some other combination) temporal sequences of elements in the language that conform to normal contextual constraints (linguistic and extralinguistic); second, they must require the learner to understand the pragmatic interrelatiOnship of linguistic contexts and extralinguistic contexts. The two validity requirements just stated are like two sides of the same coin. The first emphasizes the sequential constraints specified by the grammar, and the second emphasizes the function of the grammar in relating sequences of elements in the language to states of affairs outside of language. In subsequent chapters we will often refer to these validity requirements as the pragmatic naturalness criteria. We will explore ways of accomplishing such assessment in Chapter 3, and in greater detail in Part Three which includes Chapters 10 through 14. Techniques that fail to meet the naturalness criteria are discussed in Part Two - especially in Chapter 8. Multiple choice testing procedures are discussed in Chapter 9. KEY POINTS 1. To u~derstand the problem of constructing valid language tests, it is essentIal to understand the nature of the skill to be tested. 2. Two .a~pects of language in use need to be distinguished: the factive (or cogmtIve) aspect of language use has to do with the coding of information about states of affairs by using words, phrases, clauses, and discourse; the emotive (or affective) aspect oflanguage use has to do with the coding of infmmation about attitudes and interpersonal relationships by using facial expression, gesture, tone of voice, and choice of words. These two aspects oflanguage use are intricately interrelated. 3. Two major kinds of context are distinguished: linguistic context consists of verbal and gestural aspects; and extralinguistic context similarly consists of objective and subjective aspects. 4. The systematic correspondences between linguistic and extralinguistic contexts are referred to as pragmatic mappings. 5. !ragmatics asks how utterances (and of course other forms oflanguage III use) are related to human experience. 6. In relation to the factive aspect of coding information about states of affairs .o~tside of language, it is asserted that language is always an abb~evIatlOn for a much more complete and detailed sort of knowledge. 7. An Important aspect of the coding of information in language is the anti~ipatory planning of the speaker and the advance hypothesizing of the lIstener concerning what is likely to be said next. 34 LANGUAGE TESTS AT SCHOOL LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 8. A pragmatic expectancy grammar is defined as a psychologically real system that sequentially orders linguistic elements in time and in relation to extralinguistic contexts in meaningful ways. . 9. As linguistic sequences become more highly constrained by grammatical organization of the sorts illustrated, they become easier to process ... 10. Whereas coding devices for factive information are typically dIgI.tal (either on or off, present or absent), coding devices for emotIve information are usually analogical (continuously variable). A tone of voice which indicates excitement may vary with the degree of excitement, but a digital device for, say, referring to a pair of glasses cannot be whispered to indicate very thin corrective lenses and shouted to indicate thick ones. The word eyeglasses does not have such a continuous variability of meaning, but a wild-eyed shout probably does mean a greater degree of intensity than a slightly raised voice. . 11. Where there is a conflict between emotively coded information and factive level information, the former usually overrides the latter. 12. When relationship struggles begin, factive level communication usually ends. Examples are the wage-price spiral and the arms race. 13. The coding of factive and emotive information are very precisely synchronized, and the gestural movements of speaker and listener in a typical communicative exchange are also timed in surprisingly accurate cadence. 14. Some sort of grammatical system incorporating the element of real time and capable of substantial anticipatory-expectancy activity seems required to explain well known facts of normal language use. 15. Language is both an object and a tool of learning. Cherry suggests that we not only express ideas in words, but that we in fact discover them by putting them into words. 16. Language learning is construed as a process of constructing an appropriate expectancy generating system. Learning is enhancing one's capacity to have correct expectations about the nature of experience. 17. It is hypothesized that language teaching programs (and by implication educational programs in general) will be more effective if they optimize the learner's opportunities to take advantage of previously acquired expectancies in acquiring new knowledge. 18. It is further hypothesized that the data necessary to language acquisition are what are referred to in this book as pragmatic mappings - i.e., the systematic correspondences between lingUistic and extralinguistic contexts. In addition to opportunity, the only other apparent necessity is sufficient motivation to operate on the requisite data in appropriate ways. 19. Valid language tests are defined as those tests which meet the pragmatic naturalness criteria - namely, those which invoke and challenge the efficiency ofthe learner's expectancy grammar, first by causing the learner to process temporal sequences in the language that conform to normal contextual constraints, and second by requiring the learner to understand the systematic correspondences of linguistic contexts and extralinguistic contexts. 35 DISCUSSION QUESTIONS 1. Why is it so important to understand the nature of the skill you are trying to test? Can you think of examples of tests that have been used for educational or other decisions but which were not related to a careful consideration of the skill or knowledge they purported to assess? Study closely a test that is used in your school or that you have taken at some time in the course of your educational experience. How can you tell if the test is a measure of what it purports to measure? Does the label on the test really tell you what it measures? 2. Look for examples in your own experience illustrating the importance of grammatically based expectancies. Riddles, puns, jokes, and parlor games are good sources. Speech errors are equally good illustrations. Consider the example of the little girl who was asked by an adult where she got her ice cream. She replied, 'All over me,' as she looked shee.pishly at the vanilla and chocolate stains all over her dress. How dId her expectations differ from those of the adult who asked the question? . 3. Keep track oflistening or reading errors where you took a wrong turn III your thinking and had to do some retreading farther down the line. Discuss the source of such wrong turns. 4. Consider the sentences: (a) The boy was bucked off by the pony, and (b) The boy was bucked off by the barn (example from Woods, 1970). Why does the second sentence require a mental double-take? Note similar examples in your reading for the next few days. Write down examples. and be prep(lred to discuss them with your class. SUGGESTED READINGS 1. George A. Miller, 'The Magical Number Seven Plus or Minus Two: Some Limits on Our Capacity for Processing Information,' Psychological Review 63,1956,81-97. 2. Donald A. Norman, 'In Retrospect,' Memory and A ttention. New York: Wiley, 1969, pp. 177-181. 3. Part VI of Focus on the Learner. Rowley, Mass.: Newbury House, 1973, pp.265-300. 4. Bernard Spolsky, 'What Does It Mean to Know a Language or How Do You Get Someone to Perform His Competence?' In J. W. Oller, Jr. and J. C. Richards (eds.) Focus on the Learner. Rowley, Mass.: Newbury House, 1973, 164-76. m DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 3 Discrete Point, Integrative, or Pragmatic Tests A. Discrete point versus integrative testing B. A definition of pragmatic tests C. Dictation and cloze procedure as examples of pragmatic tests D. Other examples of pragmatic tests E. Research on the validity of pragmatic tests 1. The meaning of correlation 2. Correlations between different language tests 3. Error analysis as an independent source of validity data Not all that glitters is gold, and not everything that goes by the name is twenty-four karat. Neither are all tests which are called language tests necessarily worthy of the name, and some are better than others. This chapter deals with three classes of tests that are called measures of language - but it will be argued that they are not equal in effectiveness. It is claimed that only tests which meet the pragmatic naturalness criteria defined in Chapter 2 are language tests in the most fundamental sense of what language is and how it functions. 37 credited with first proposing the distinction between discrete point and integrative language test~. Although the types are not always different for practica! purposes, the theoretical bases of the two approaches contrast markedly and the predictions concerning the effects and relative validity of different testing procedures also differ in fundamental ways depending on which of the two approaches one selects. The contrast between these two philosophies, of course, is not limited to language testing per se, but can'be seen throughout the whole spectrum of educational endeavor. Traditionally, a discrete point test is one that attempts to focus attention on one point of grammar at a time. Each test item is aimed at one and only one element of a particular component of a grammar (or perhaps we should say hypothesized grammar), such· as phonology, syntax, or vocabulary. MOreover, a discrete point test purports to assess only one skill at a time (e.g., listening, or speaking, or reading, or writing) and only one aspect of a skill (e.g., productive versus receptive or oral versus visual). Within each skill, aspect, 'and component, discrete items supposedly focus on precisely one and only one phoneme, morpheme, lexical item, grammatical rule, or whatever the appropriate element may be. (See Lado, 1961, in Suggested Readings at the end of this chapter.) For instance, a phonological discrete item might require an examinee to distinguish between minimal pairs, e.g., pill versus peel, auditorily presented. An example of a morphological item might be one which requires the selection of an appropriate suffix such as -ness or -i/y to form a noun from an adjective like secure, or sure. An example of a syntactic item might be a fill-in-the-blank type where the examinee must supply the suffix -s as in He walk - to town each morning now that he lives in the city.! A. Discrete point versus integrative testing The concept of an integrative test was born in contrast with the definition of a discrete point test. If discrete items take language skill apart, integrative tests put it back together. Whereas discrete items attempt to test knowledge oflanguage one bit at a time, integrative tests attempt to assess a learner's capacity to use many bits all at the same time, and possibly while exercising several presumed components of a grammatical system, and perhaps more than one of the traditionally recognized skills or aspects of skills. However, to base a definition of integrative language testing on In recent years, a body ofliterature on language testing has developed which distinguishes two major categories oftests.John Carroll (1961, ' see the Suggested Readings at the end of this chapter) was the person 1 Other discrete item examples are offered in Chapter 8 where we return to the topic of discrete point tests and examine them in greater detail. 36 38 LANGUAGE TESTS AT SCHOOL what would appear to be its logical· antithesis and in fact its competing predecessor is to assume a fairly limiting point of view. It is possible to look to other sources for a theoretical basis and cc rationale for so-called integrative tests. B. A definition of pragmatic tests The term pragmatic test has sometimes been used interchangeably with the term integrative test in order to call attention to the possibility of relating integrative language testing procedures to a theory of pragmatics, or pragmatic expectancy grammar. Whereas integrative testing has been somewhat loosely defined in terms of what discrete point testing is not, it is possible to be somewhat more precise in saying what a pragmatic test is: it is any procedure or task that causes the learner to process sequences of elements in a language that conform to the normal contextual constraints of that 1anguage, and which requires the leamer to relate sequences of linguistic elements via pragmatic mappings to extralinguistic context. Integrative tests are often pragmatic in this sense, and pragmatic tests are always integrative. There is no ordinary discourse situation in which a learner might be asked to listen to and distinguish between isolated minimal pairs of phonological contrasts. There is no normal language use context in which one's attention would be focussed on _ the syntactic rules involved in placing appropriate suffixes on verb stems or in moving the agent of an active declarative sentence from the front of the sentence to the end in order to form a passive (e.g., " The dog bit John in the active form becoming John was bitten by the dog in the passive). Thus, discrete point tests cannot be pragmatic, and conversely, pragmatic tests cannot be discrete point tests. Therefore, pragmatic tests must be integrative. But integrative language tasks can be conceived which do not meet one or both of the naturalness criteria which we have imposed in our definition of pragmatic tests. If a test merely requires an examinee to use more than one of the four traditionally recognized skills mid/or one or more of the traditionally-recognized components of grammar, it must be considered integrative. But to qualify as a pragmatic test, more is required. . In order for a test user to say something meaningful (valid) about the efficiency of a learner's developing grammatical system, the pragmatic naturalness criteria require that the test invoke and challenge that developing grammatical system. This requires DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 39 processing sequences of elements in the target language (even if it is the learner's first and only language) subject to temporal contextual constraints. In addition, the tasks must be such that for examinees to do them, linguistic sequences must be related to extralinguistic contexts in meaningful ways. Examples of tasks that do not qualify as pragmatic tests include all discrete point tests, the rote recital of sequences of material without attention to meaning; the manipulation of sequences of verbal elements, possibly in complex ways, but in ways that do not require awareness of meaning. In brief, if the task does not require attention to meaning in temporally c~ntrained sequences oflinguistic elements, it cannot be construed as a pragmatic language test. Moreover, the constraints must be of the type that are found in normal uses of the language, not merely in some classroom setting that may have been contrived according to some theory of how languages should be taught. Ultimately, the question of whether or not a task is pragmatic is an empirical one. It cannot be decided by theory based preferences; or opinion polls. C. Dictation'arid cloze procedure as examples of pragmatic tests The traditional dictation, rooted in the distant past of language teaching, is an interesting example of a pragmatic language testing procedure. If the sequences of words or phrases to be dictated are selected from normal prose, or dialogue, or some other natural form of discourse (or perhaps if the sequences are carefully contrived to mirror normal discourse, as in well-written fiction) and if the material is presented orally in sequences that are long enough to challenge the short term memory of the learners, a simple traditional dictation meets the naturalness requirements for pragmatic language tests. First, sucha task requires the processing of temporally constrained sequences of material in the language and second, the task of dividing up the stream of speech and writing down what is heard requires understanding the meaning of the material - i.e., relating the linguistic context (which in a sense is given) to the extralinguistic context (which must be inferred). Although an inspection of the results of· dictation tests with appropriate statistical procedures (as we will see below) shows the technique to be very reliable and highly valid, it has not always been looked on with favor by the experts. For example, Robert Lado (1961) said: FA 40 LANGUAGE TESTS AT SCHOOL DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 41 count spelling as a criterion for correctness. Somaratne's remark seems to imply that one must, but research shows that one shouldn't. We will return to this question in particular, namely, the scoring of dictation, and other practical questions in Chapter 10. The view that a language learner can take dictation (which is presented in reasonably long bursts, say, five or more words between pauses, and where each burst is given at a conversational rate) without doing some very active and creative processing is credible only from the vantage point of the naive examiner who thinks that the learner automatically knows what the examiner knows about the material being dictated. As the famous Swiss linguist pointed out three quarters of a century ago, Dictation ... on critical inspection ... appears to measure very little of language. Since the word order is given ... it does not test word order. Since the words are given ... it does not test vocabulary. It hardly tests the aural perception of the examiner's pronunciation because the words can in many cases be identified by context. ... The student is less likely to hear the sounds incorrectly in the slow reading of the words which is necessary for dictation (p. 34). Other authors have tended to follow Lado's lead: As a testing device ... dictation must be regarded as generally uneconomical and imprecise (Harris, 1969, p. 5). Some teachers argue that dictation is a test of auditory comprehension, but surely this is a very indirect and inadequate test ·of such an important skill (Anderson, 1953, p. 43). Dictation is primarily a test of spelling (Somaratne, 1957, p. 48). ... the main characteristic of the sound chain is that it is linear. Considered by itself it is only a line, a continuous ribbon along which the ear perceives no self-sufficient and clear-cut division ... (quoted from lectures compiled by de Saussure's students, Bally, Sechehaye, and Riedlinger, 1959, pp. 103-104). More recently. J. B. Heaton (1975), though he cites some of the up-todate research on dictation in his bibliography, devotes less than two pages to dictation as a testing procedure and concludes that To prove that the words of a dictation are not necessarily 'given' from the learner's point of view, one only needs to try to write dictation in an unknown language. The reader may try this test: have a speaker of Y oruba, Thai, Mandarin, Serbian or some other language which you do not know say a few short sentences with pauses between them long enough for you to write them down or attempt to repeat them. Try something simple like, Say Man what's happening, or How's life been treating you lately, at a conversational rate. If the proof is not convincing, consider the kinds of errors that non-native speakers of English make in taking dictation. In a research report circulated in 1973, Johansson gave examples of vocabulary errors: eliquants, elephants, and elekvants for the word eloquence. It is possible that the first rendition is a spelling error, but that possibility does not exist for the other renditions. At the phrase level, consider of appearance for of the period, person in facts for pertinent facts, less than justice, lasting justice, last in justice for just injustice.Or when a foreign student at UCLA writes, to find particle man living better and mean help man and boy tellable damage instead of to find practical means of feeding people better and means of helping them avoid the terrible damage of windstorms, does it make sense to say that the words and their order were 'given'? Though much research remains to be done to understand better what learners are doing when they take dictation, it is clear from the above examples that whatever mental processes they are performing dictation ... as a testing device measures too many different language features to be effective in providing a means of assessing anyone skill (p. 186). Davies (1977) offers much the same criticism of dictation. He suggests that it is too imprecise in diagnostic information, and further that it is apt to have an unfortunate 'washback' effect (namely, in taking on 'the aura oflanguage goals'). Therefore, he argues it may be desirable to abandon such well-worn and suspect techniques for less familiar and less coherent ones (p. 66). . In the rest of the book edited by Allen and Davies (1977) there is only one other mention of dictation. Ingram (1977) in the same volume pegs dictation as a rather weak sort of spelling test (see p. 20). If we were to rely on an opinion poll, the weight of the evidence would seem to be against dictation as a useful language testing procedure. However, the validity of a testing procedure is hardly the sort of question that can be answered by taking a vote. Is it really necessary to read the material very slowly as is implied by Lado's remarks? The answer is no. It is possible to read slowly, but it is not necessary to do so. In fact, unless the material is presented in sequences long enough to challenge the learner's short term memory, and quickly enough to simulate the normal temporal nature of speech sequences, then perhaps dictation would become a test of spelling as Somaratne and Ingram suggest. However, it is not even necessary to - 42 LANGUAGE TESTS AT SCHOOL must be active and creative. There is much evidence to suggest tl~at'," there are fundamental parallels between tasks like taking dictation, ( and using language in a wide variety of other ways. Among closely' related testing procedures are sentence repetition tasks (or 'elicited imitation') which have been used in the testing of children for proficiency in one or more languages or language varieties. We return' to this topic in detail in Chapter 10. All of the research seems to indicate that in order for examinees to , take dictation, or to repeat utterances that challenge their short term memory, it is necessary not only to make the appropriate discriminations in dividing up the continuum of speech, but also to understand the meaning of what is said. Another example of a pragmatic language testing procedure is the cloze technique. The best known variety of this technique is the sort, ' of test that is constructed by deleting every fifth, sixth, or seventh \ word from a passage of prose. Typically each deleted word is replaced by a blank of standard length, and the task set the examinee is to fill in _, the blanks by restoring the missing words. Other varieties of the ' procedure involve deleting specific vocabulary items, parts of speech, affixes, or particular types of grammatical markers. The word cloze was invented by Wilson Taylor (1953) to call attention to the fact that when ~n examinee fills in the gaps in a " passage of prose, he is doing something similar to what Gestalt psychologists call 'closure', a process related to the perception of incomplete geometric figures, for example. Taylor considered words' deleted from prose to present a special kind of closure problem. From: what is known of the grammatical knowledge the examinee brings to " bear in solving such a closure problem, we can appreciate the fact that' the problem is a very special sort of closure. Like dictation, doze tests meet both ofthe naturalness criteria for pragmatic language tests. In order to give correct responses (whether the standard of correctness is the exact word that originally appeared ata particular point, or any other word that fully fits the context of the passage), the learner must operate ___ the basis of, both immediate and long-range ___ constraints. Whereas some of the blanks in a cloze test (say of the standard variety deleting every nth word) can be filled by attending only to a few words on either side of the blank, as in the first blank in the preceding sentence, other blanks in a typical cloze passage require attention to longer stretches of linguistic context. They often require inferences about extralinguistic context, as in the case of the second blank in the preceding sentence. DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 43 The word on seems to be required in the first blank by the words operate and the basis of, without any additional information. Bowever, unless long range constraints are taken into account, the second blank offers many possibilities. If the examinee attended only to such constraints as are afforded by the words from operate onward, it could be filled by such words as missile, legal, or leadership. The intended word was contextual. Other alternatives which might have occurred to the reader, and which are in the general semantic target area might include temporal, verbal, extralinguistic, grammatical, pragmatic, linguistic, psycho linguistic, sociolinguistic, psychological, semantic, and so on. In taking a cloze test, the examinee must utilize information that is inferred about the facts, events, ideas, relationships, states of affairs, social settings and the like that are pragmatically mapped by the linguistic sequences contained in the passage. Examples of cases where extralinguistic context and the linguistic context of the passage are interrelated are obvious in so-called deictic words such as here and now, then and there, this and that, pronouns that refer to persons or things, tense indicators, aspect markers on verbs, adverbs of time and place, determiners and demonstratives in general, and a host of others. For a simple example, consider the sentence, A horse was fast when he was tied to a hitching post, and the same animal was also fast when he won a horse-race. If such a sentence were part of a larger context, say on the difficulties of the English language, and if we deleted the first a, the blank could scarcely be filled with the definite article the because no horse has been mentioned up to that point. On the other hand, if we deleted the the before the words same animal, the indefinite article could not be used because ofthe fact that the horse referred to by the phrase A horse at the beginning of the sentence is the same horse referred.to by the phrase the same animal. This is an example of a pragmatic constraint. Consider the oddity of saying, The horse was fast when he was tied to a hitching post, and a same animal was also fast when he won a horse-race. Even though the pragmatic mapping constraints involved in normal discourse are only partially understood by the theoreticians, and though they cannot be precisely characterized in terms of grammatical systems (at least not yet), the fact that they exist is wellknown, and the fact that they can be tested by such pragmatic procedures and the cloze technique has been demonstrated (see Chapter 12). 44 LANGUAGE TESTS AT SCHOOL All sorts of deletions of so-called content words (e.g., nouns, adjectives, verbs, and adverbs), and especially grammatical con: nectors such as subordinating conjunctions, negatives, and a great many others carry with them constraints that may range backward or forward across several sentences or more. Such linguistic elements may entail restrictions that influence items that are widely separated in the passage. This places a strain on short term memory which presses the learner's pragmatic expectancy grammar into operation. The accuracy with which the learner is able to supply correct responses can therefore be taken as an index of the efficiency of the learner's developing grammatical system. Ways of constructing, administering, scoring, and interpreting cloze tests and a variety of related procedures for acquiring such indices are discussed in Chapter 12. D. Other examples of pragmatic tests Pragmatic testing procedures are potentially innumerable. The techniques discussed so far, dictation, cloze, and variations of them, by no means exhaust the possibilities. Probably they do not even begin to indicate the range of reasonable possibilities to be explored. There is always a danger that minor empirical advances in educational research in particular, may lead to excessive dependence on procedures that are associated with the progress. However, in spite of the fact that some of the pragmatic procedures thus far investigated do appear to work substantially better than their discrete point predecessors, there is little doubt that pragmatic tests can also be refined and expanded. It is iJ.11portant that the procedures which now exist and which have been studied should not limit our vision concerning .other possibilities. Rather, they should serve as guideposts for subsequent refinement and development of still more effective and more informative testing procedures. Therefore, the point of this section (and in a broader sense, this entire book) is not to provide a comprehensive list of possible pragmatic testing procedures, but rather to illustrate some of the possible types of procedures that meet the naturalness criteria concerning the temporal constraints on language in use, and the pragmatic mapping of linguistic contexts onto extralinguistic ones. Below, in section E of this chapter, we will discuss evidence concerning the validity of pragmatic tests. (Also, see the Appendix.) DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 45 Combined cloze and dictation. The examinee reads material from which certain portions have been deleted and simultaneously (or subsequently) hears the same material without deletions either live or on tape. The examinee's task is to fill in the missing portions the same as in the usual cloze procedure, but he has the added support of the auditory signal to help him fill in the missing portions. Many variations on this procedure are possible. Single words, or even parts of words, or sequences of words, or even whole sentences or longer segments may be deleted. The less material one deletes, presumably, the more the task resembles the standard cloze procedure, and the more one deletes, the more the task looks like a standard dictation. Oral cloze procedure. Instead of presenting a cloze passage in a written format, it is possible to use a carefully prepared tape recording of the material with numbers read in for the blanks, or with pauses where blanks occur. Or, it is possible merely to read the material up to the blank, give the examinee the opportunity to guess the missing word, record the response, and at that point either tell the examinee the right answer (i.e., the missing word), or simply go on without any feedback as to the correctness of the examinee's response. Another procedure is to arrange the deletions so that they always come at the end ofa clause or sentence. Any of these oral cloze techniques have the advantage of being usable with non-literate populations. Dictation with interfering noise. Several varieties of this procedure have been used, and for a wide range of purposes. The best known examples are the versions of the Spolsky-Gradman noise tests used with non-native speakers of English. The procedure simply involves superimposing white noise (a wide spectrum of random noise sounding roughly like radio static or a shhhhshing sound at a constant level) onto taped verbal material. If the linguistic context under the noise is fully meaningful and subject to the normal extralinguistic constraints, this procedure qualifies as a pragmatic testing technique. Variations include noise throughout the material versus noise over certain portions only. It is argued, in any event, that the noise constitutes a situation somewhat parallel to many of the everyday contexts where language is used in less than ideal acoustic conditions, e.g., trying to have a conversation in someone's livingroom when the television and air conditioner are producing a high level of competing noise, or trying to talk to or hear someone else 46 LANGUAGE TESTS AT SCHOOL DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS in the crowded lobby of a hotel, or trying to hear a message oVeL,ii 1 public address system in a busy air terminal, etc. 0 Paraphrase recognition. In one version, examinees are asked to read a sentence and then to select from four or five alternatives the best paraphrase for the given sentence. The task may be made somewhat more difficult by having examinees read a paragraph or longer passage and then select from several alternatives the one which best represents the central meaning or idea of the given passage. This task is somewhat similar to telling what a conversation was about, or what the main ideas of a speech were, and the like. Typically, such tests are interpreted as being tests of reading comprehension. However, they are pragmatic language tests inasmuch as they meet the naturalness criteria related to meaning and temporal constraints. A paraphrase recognition task may be either in a written format or an oral format or some combination of them. An example of an ora( format comes from the Test of English as a Foreign Language produced by Educational Testing Service, Princeton, New Jersey. Examinees hear a sentence like, John dropped the letter in the mailbox). , Then they must choose between (a) John sent: the letter; (b) John opened the letter; (c) John lost the letter; (d) John destroyed the letter. 2 . Of course, considerably more complicated items are possible. The discrete point theorist might object that since the first stimulus is , presented auditorily and since the choices are then presented in a written format, it becomes problematic to say what the test is a test of - whether listening comprehension, or reading comprehension, or both. This is an issue that we will return to in Chapters 8 and 9, and which will be addressed briefly in the section on the validity of· pragmatic tests below. Also, see the Appendix. Question answering. In one section ofthe TOEFL, examinees are required to select the best answer from a set of written alternatives to an auditorily presented question (e~ther on record or tape). For instance, the examinee might hear, When did Tom come here? In the test booklet he reads, (a) By taXi; (b) Yes, he did; (c) To study history; and (d) Last night. He must mark on his answer sheet the letter corresponding to the best answer to the given question. This example and subsequent ones from the TOEFL are based on mimeographed hand-outs prepared by the staff at Educational Testing Service to describe the new format ofthe TOEFL in relation to the format used from 1961-1975. 2 47 A slightly different question answering task appears in a different section of the test. The examinee hears a dialogue such as: MAN'S VOICE: Hello Mary. Thi~ is Mr Smith at the office. Is Bill feeling any better today? WOMAN'S VOICE: Oh, yes, Mr Smith. He's feeling much better now. But the doctor says he'll have to stay in bed until Monday. THIRD VOICE: Where is Bill now? possible answers from which the examinee must choose include: (a) A t the office; (b) On his way to work; (c) Home in bed; and (d) Away on vacation. Perhaps the preceding example, and other multiple choice examples may seem somewhat contrived. For this and other reasons to be discussed in Chapter 9, good items of the preceding type are quite difficult to prepare. Other formats which allow the examinee to supply answers to questions concerning less contrived contexts may be more suitable for classroom applications. For instance, sections of a television or radio broadcast in the target language may be taped. Questions formed in relation to those passages could be used as part of an interview technique aimed at testing oral skills. A colorful, interesting, and potentially pragmatic testing technique is the Bilingual Syntax Measure (Burt, Dulay, and Hernandez, 1975). It is based on questions concerning colorful cartoon style pictures like the one shown in Figure 1, on page 48. The test is intended for children between the ages of four and nine, from kindergarten through second grade. Although the authors of the test have devised a scoring procedure that is essentially aimed at assessing control of less than twenty so-called functors (morphological and syntactic markers like the plural endings on nouns, or tense markers on verbs), the procedure itself is highly pragmatic. First, questions are asked in relation to specific extralinguistic contexts in ways that require the processing of sequences of elements in English, or Spanish, or possibly some other language. Second, those meaningful sequences of linguistic elements in the form of questions must be related to the given extralinguistic contexts in meaningful ways. For instance, in relation to a picture such as the one shown in Figure 1, the child might be asked something like, How come he's so skinny? The questioner indicates the guy pushing the wheelbarrow. The situation is natural enough and seems likely to motivate a child to I .p 48 LANGUAGE TESTS AT SCHOOL DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 49 long time that subjective rankings of passages of. prose are :ometimes more reliable than rankings (for relative difficulty) based on readability formulas (Klare, 1974). An institutional technique that has been fairly well standardized by the Foreign Service Institute uses a training procedure for judges who are taught to conduct interviews and to judge performance on the basis of carefully thought-out rating scales. This procedure is discussed along with the /lyin Oral Interview (Ilyin, 1972) and Upshur's Oral Communication Test (no date), in Chapter 11. Figure 1. A cartoon drawing illustrating the style of the Bilingual Syntax Measure. want to respond. We return to the Bilingual Syntax Measure and a number of related procedures in Chapter 11. Oral interview. In addition to asking specific questions about pictured or real situations, oral tests may take a variety of other forms. In effect, every opportunity a learner is given to talk in an educational setting can be considered a kind of oral language test. The score on such a test may be only the sUbjective impression that it makes on the teacher (or another evaluator), or it may be based on some more detailed plan of counting errors. Surprisingly perhaps, the so-called objective procedures are not necessarily more reliable. In fact, they may be less reliable in some cases. Certain aspects of language performances may simply lend themselves more to sUbjective judgement than they do to quantification by formula. For instance, Richards (1 970b) has shown that naive native speakers are fairly reliable judges of word frequencies. Also, it has been known for Composition or essay writing. Most free writing tasks necessarily qualify as pragmatic tests. Because it is frequently difficult to judge examinees relative to one another when they may have attempted to say entirely different sorts of things, and because it is also difficult to say what constitutes an error in writing, various modified writing tasks have been used. For example, there is the so-called dehydrated sentence, or dehydrated essay. The examinee is given a telegraphic message and is asked to expand it. An instance of the dehydrated sentence is child/ride/bicycle/off embankment/last month. A dehydrated narrative might continue, was taken to hospital/lingered near death/family reunited/back to school/two weeks in hospital. Writing tasks may range from the extreme case of allowing examinees to select their own topic and to develop it, to maximally controlled tasks like filling in blanks in a pre-selected (or even contrived) passage prepared by the teacher or examiner. The blanks might require open-ended responses on the order of whole paragraphs, or sentences, or phrases, or words. In the last case, we have arrived back at a rather obvious form of doze procedure. Another version of a fairly controlled writing task involves either listening to or reading a passage and then trying to reproduce it from recall. If the original material is auditorily presented, the task becomes a special variety of dictation. This procedure and a variety of others are discussed in greater detail in Chapter 13. Narration. One of the techniques sometimes used successfully to elicit relatively spontaneous speech samples is to ask subjects to talk about a frightening experience or an accident where they were almost 'shaded out of the picture' (Paul Anisman, personal communication). With very young children, story re-telling, which is a special version of narration, has been used. It is important that such tasks seem natural to the child, however, in order to get a realistic attempt from 50 DISCRETE POINT , INTEGRATIVE OR PRAGMATIC TESTS LANGUAGE TESTS AT SCHOOL comprehension for adult foreign students in American universities: does the test require the learner to do the sort of thing that it supposedly measures his ability to do? Or, for a test that purports to measure the degree of dominance of bilingual children in classroom contexts that require listening and speaking, we might ask: does the test require the children to say and do things that are similar in some fundamental way to what they are normally required to do in the classroom? These are questions about content validity. With respect to concurrent validity, the question of interest is to what extent do tests that purport to measure the same skill(s), or component(s) of a skill (or skills) correlate statistically with each other? Below, we will digress briefly to consider the meaning of statistical correlation. An example of a question concerning concurrent validity would be: do several tests that purport to measure the same thing actually correlate more highly with each other than with a set of tests that purport to measure something different? For instance, do language tests correlate more highly with each other than with tests that are labeled IQ tests? And vice versa. Do tests which are labeled tests of listening comprehension correlate better with each other than they do with tests that purport to measure reading comprehension? And vice versa. A special set of questions about concurrent validity relate to the matter of test reliability. In the general sense, concurrent validity is about whether or not tests that purport to do the same thing actually do accomplish the same thing (or better, the degree to which they accomplish the same thing). Reliability of tests can be taken as a special case of concurrent validity. If all of the items on a test labeled as a test of writing ability are supposed to measure writing ability, then there should be a high degree of consistency of performance on the various items on that test. There may be differences of difficulty level, but presumably the type of skill to be assessed should be the same. This is like saying there should be a high degree of concurrent validity among items (or tests) that purport to measure the same thing. In order for a test to have a high degree of validity of any sort, it can be shown that it must first have a high degree of reliability. In addition to these empirical (and statistically determined) requirements, a good test must also be practical and, for educational purposes, we might want to add that it should also have instructional value. By being practical we mean that it should be usable within the limits of time and budget available. It should have a high degree of cost effectiveness. the examinee. For instance, it is important that the person to whom the child is expected to re-tell the story is not the same person who has just told the story in the first place (he obviously knows it). It should rather be someone who has not (as far as the child is aware) heard the story before - or at least not the child's version. Translation. Although translation, like other pragmatic procedures, has not been favored by the testing experts in recent years, it still remains in at least some of its varieties as a viable pragmatic procedure. It deserves more research. It would appear from the study by Swain, Dumas, and Naiman (1974) that if it is used in ways that approximate its normal application in real life contexts, it can provide valuable information about language proficiency. If the sequences of verbal material are long enough to challenge the shortterm memory of the examinees, it would appear that the technique is a special kind of pragmatic paraphrase task. E. Research on the validity of pragmatic tests We have defined language use and language learning in relation to the theoretical construct of a pragmatic expectancy grammar. Language use is viewed as a process of interacting plans and hypotheses concerning the pragmatic mapping of linguistic contexts onto extralinguistic ones. Language learning is viewed as a process of developing such an expectancy system. Further, it is claimed that a language test must invoke and challenge the expectancy system of the learner in order to assess its efficiency. In all of this discussion, we are concerned with what may be called the construct validity of pragmatic language tests. If. they were to stand completely alone, such considerations would fall far short of satisfactorily demonstrating the validity of pragmatic language tests. Empirical tests must be applied to the tests themselves to determine whether or not they are good tests according to some purpose or range of purposes (see Oller and Perkins, 1978). In addition to construct validity which is related to the question of whether the test meets certain theoretical requirements, there is the matter of so-called content validity and of concurrent validity. Content validity is related to the question of whether the test requires the examinee to perform tasks that are really the same as or fundamentally similar to the sorts of tasks one normally performs in exhibiting the skill or ability that the test purports to measure. For instance, we might ask of a test that purports to measure listening 51 ~ - 52 LANGUAGE TESTS AT SCHOOL DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS By having instructional value we mean that it ought to be possible to use the test to enhance the delivery of instruction % student populations. This may be accomplished in a foreign language classroom by diagnosing student progress (and teacher effectiveness) in more specific ways. In some cases the test itself becomes a teaching procedure in the most obvious sense. In multilingual contexts better knowledge of student abilities to process information coded verbally in one or. more languages can help motivate curricular decisions. Indeed, in monolingual contexts curricular decisions need to be related as much as is possible to the communication skills of students (see Chapter 14). It has been facetiously observed that what we are concerned with when we add the requirements of practicality and instructional value is something we might call true validity, or valid validity. With so many kinds of validity being discussed in the literature today, it does not seem entirely inappropriate to ask somewhat idealistically (and sad to say, not superfluously) for a valid variety of validity that teachers and educators may at least aim for. Toward this end we might examine the results of theoretical investigations of construct validity, practical analyses of the content of tests, and careful study of the intercorrelations among a wide variety of testing procedures to address questions of concurrent validity. 3 We will return to the matter of validity of pragmatic tests arid their patterns of interrelationship as determined by concurrent validity studies after a brief digression to consider the meaning of correlation in the statistical sense of the term. The reade~ who has· some background in statistics·or in the mathematics underlying statistical correlation may want to skip over the next eleven paragraphs and go directly to the discussion of results of statistical correlations between various tests that have been devised to assess language skills. 53 statistically trained rell;der to understand the meaning of correlation enough to appreciate some of the interesting findings of recent research on the reliability and validity of various language testing techniques. There are many excellent texts that deal with correlation more thoroughly and with its application to research designs. The interested reader may want to consult one of the many available references.4; No attempt is made here to achieve· any sort of mathematical rigor - and perhaps it is worth noting that most practical applications of statistical procedures do not conform to all of the nicetie~ necessary for mathematical precision attainable in theory (see Nunnally, 1967, pp. 7-10; for a discussion of this point). Few researchers, however, would therefore deny the usefulness of the applications. Here we are concerned with simple correlation, also known as Pearson product-moment .correlation. To understand the meaning of this statistic, it is first necessary to understand the simpler statistics of the arithmetic mean, the variance, and the standard deviation on which it is based. The arithmetic mean of a set of scores is computed by adding up all of the scores in the set of interest and dividing by the number of scores in the set. This procedure provides a measure of central tendency of the scores. It is like an answer to the question, if we were to take all the amounts of whatever the test measures and distribute an equal amount to each examinee, how much would each one get with none left over? Whereas the mean is an index of where the true algebraic center of the scores is, the variance is an index of how much scores tend to differ from that central point. Since the true degree of variability of possible scores on a test tends to be somewhat larger than the variability of scores made by a given group of examinees, the computation oftest variance must correct for this bias. Without going into any detail, it has been proved mathematically that the best estimate of true test variance can be made as follows: first, subtract the mean score from each of the scores on the test and record each of the resulting deviations from the mean (the deviations will be positive quantities for scores larger than the mean, and negative quantities for scores less than the mean); second, square each of the deviations (i.e., multiply each deviation by itself) and record the :t;"esult each time; third, add up all of the squares (note that all of the quantities must be either zero or a positive value since 1, The meaning of correlation. The "purpose here is not to teach the reader to apply correlation necessarily,_ but to help the non3 Another variety of validity sometimes referred to in the literature is face validity. Harris (1969) defines it as 'simply the way the test looks - to the examinees, test administrators, educators, and the like' (p. 21). Since these kinds of opimons are often based on mere experiences with things that have been called tests of such and such a skill in the past, Harris notes that they are not a very important part of determining the validity of tests. Such opinions are ultimately important only to the extent that they affect performance on the test. Where judgements of face validity can be shown to be ill-informed, they should not serve as a basis for the evaluation of testing procedures at all. An excellent text written principally· for ed~cators is Merle Tate, Statistics in Education and Psychology: A First Course. New York: Macmillan, 1965, especially Chapter VII. Or, see Nunnally (l967),or Kerlinger and Pedhazur (1973). 4 / .i, 54 LANGUAGE TESTS AT SCHOOL the square of a negative value is always a positive value); fourth, divide the sum of squares by the number of scores minus one (the subtraction of one at this point is the correction of estimate bias noted at the beginning of this paragraph). The result is the best estimate of the true variance in the population sampled. The standard deviation of the same set of scores is simply the square root of the variance (i.e., the positive number which times itself equals the variance). Hence, the standard deviation and the variance are interconvertible values (the one can be easily derived from the other). Each of them provides an index ofthe overall tendency of the scores to vary from the mathematically defined central quantity (the mean). Conceptually, computing the standard deviation is something like answering the question: if we added to the mean and subtracted from the mean amounts of whatever the test measures, how much would we have to add and subtract on the average to obtain the original set of scores? It can be shown mathematically that for normal distributions of scores, the mean and the standard deviation tell everything there is to know about the distribution of scores. The mean defines the central point about which the scores tend to cluster and their tendency to vary from that central point is the standard deviation. We can now say what Pearson product-moment correlation means. Simply stated, it is an index of the tendency for the scores of a group of examinees on one test to covary (that is, to differ from their respective mean in similar direction and magnitUde) with the scores of the same group of examinees on another test. If, for example, the examinees who tend to make high scores on a certain cloze test also tend to make high scores on a reading comprehension test, and if those who tend to make low scores on the reading test also tend to make low scores on the cloze, the two tests are positively correlated. The square of the correlation between any two tests is an index of the variance overlap between them. Perfect correlation will result if the scores of examinees on two tests differ exactly in proportion to each other from their respective means. One of the conceptually simplest ways to compute the productmoment correlation between two sets of test scores is as follows: first, compute the standard deviation for each test; second, for each examinee, compute the deviation from the mean on the first test and the deviation from the mean on the second test; third, multiply the deviation from the mean on test one times the deviation from the mean on test two for each examinee (whether the value of the DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 55 deviation is positive or negative is important in this case because it is possible to get negative values on this operation); fourth, add up the products of deviations from step three (note that the resulting quantity is conceptually similar to the sum of squares of deviations in the computation of the variance of a single set of scores); finally, divide the quantity from step four by the standard deviation of test one times the standard deviation of test two times one less than the number of examinees. The resulting value is the correlation between the two tests. Correlations may be positive or negative. We have already considered an example of positive correlation. An instance of negative correlation would result if we counted correct responses on say, a cloze test, and errors, say, on a dictation. Thus, a high score on the cloze test would (if the tests were correlated positively, as in the previous example) correspond to a low score on the other. High scorers on the cloze test would typically be low scorers on the dictation (that is, they would make fewer errors), and low scorers on the doze would be high scorers on the dictation (that is, they would make many errors). However, if the score on the doze test were converted to an error count also, the correlation would become positive instead of negative. Therefore, in empirical testing research, it is m9st often the magnitude of correlation between two tests that is of interest rather than the direction of the relationship. However, the value of the correlation (plus or minus) becomes interesting whenever it is surprising. We will consider several such cases in Chapter 5 when we discuss empirical research with attitudes and motivations. What about the magnitude of correlations? When should a correlation be considered high or low? Answers to such questions can be given only in relation to certain purposes, and then only in general and somewhat imprecise terms. In the first place, the size of correlations cannot be linearly interpreted. A correlation of .90 is not three times larger than a correlation of .30 - rather it is nine times larger. It is necessary to square the correlation in each case in order to make a more meaningful comparison. Since .90 squared is .81 and .30 squared is .09, and since .81 is nine times larger than .09, a correlation of .90 is actually nine times larger than a correlation of .30. Computationally (or perhaps we should say mathematically), a correlation is like a standard deviation, while the square of the correlation (or the coefficient of determination as it is called) is on the same order as the variance. Indeed, the square of the correlation of two tests is an index of the amount of variance overlap between the f 56 LANGUAGE TESTS AT SCHOOL two tests - or put differently, it is an index of the amount of variance that they have in common. (For more thorough discussion, see Tate, 1965, especially Chapter VII.) With respect to reliability studies, correlations above .95 between, say, two alternate forms of the same test are considered quite adequate. Statistically, such a correlation means that the test forms overlap in variance at about the .90 level. That is, ninety percent of the total variance in both tests is present in either one by itself. One could feel quite confident that the tests would tend to produce very similar results if administered to the same population of subjects. What can be known from the one is almost identical to what can be known from the other, with a small margin of error. On the other hand, a reliability index of .60 for alternate forms of the same test would not be considered adequate for most purposes. The two tests in this latter instance are scarcely interchangeable. It would hardly be justifiable to say that they are very reliable measures of whatever they are aimed at assessing. (However, one cannot say that they are necessarily measuring different things on the basis of such a correlation. See Chapter 7 on statistical traps.) In general, whether the question concerns reliability or validity, low correlations are less informative than high correlations. An observed low correlation between two tests that are expected to correlate highly is something like the failure of a prospector in search of gold. It may be that there is no gold or it may be that the prospector simply hasn't turned the right stones or panned the right spots in the stream. A low correlation may result from the fact that one of the tests is too easy or too hard for the popUlation tested. It may mean that one of the tests is unreliable. Or that both ofthem are unreliable. Or a low correlation may result from the fact that one or both tests do not measure what they are supposed to measure (i.e., are not valid), or merely that one of them (or both) has (or have) a low degree of validity. A very high correlation is less difficult to interpret. It is more like a gold strike. The richer the strike, that is, the higher the correlation, the more easily it can be interpreted. A correlation of .85 or .90 between two tests that are superficially very different would seem to be evidence that they are tapping the same underlying skill or ability. In any event, it means at face value that the two tests share.72 or .81 of the total variance in both tests. That is, between 72 and 81 percent of what can be known from the one can be known equally well from the other. r DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 57 A further point regarding the interpretation of reliability estimates should be made. Insofar as a reliability estimate is accurate, its square may be interpreted as the amount of non-random variance in the test in question. It follows that the validity of a test can never exceed its reliability, and further that validity indices can equal reliability indices only in very special circumstance~ - namely, when all the reliable (non-random) variance in one test is also generated by the other. We shall return to this very important fact about correlations ~iability indices and correlations as validity indices below. In the meantime, we should keep in mind that a correlation between two tests should normally be read as a reliability index if the two tests are "Coiisidered to be different forms of the same test or testing procedure. However, if the two tests are considered to be different tests or testing pfO"'cedures, the correlation between them should normally be read as a validity index. 2. Correlations between different language tests. One of the first studies that showed surprisingly high correlations between substantially different language tests was done by Rebecca Valette (1964) in connection with the teaching of French as a foreign language at the college level. She used a dictation as part of a final examination for a course in French. The rest of the test included: (1) a listening comprehension task in a multiple choice format that contained items requiring (a) identification of a phrase heard on tape, (b) completion of sentences heard 0]1 tape, and (c) answering of questions concerning paragraphs heard ol'ltape; (2) a written sentence completion task of the fill-in-the-blank variety; and (3) a sentence writing task where students were asked to answer questions in the affirmative or negative or to follow instructions entailed in an imperative sentence like, Tell John to come here, where a correct written response might be, John, come here. For two groups of subjects, all first semester French students, one of which had practiced taking dictation and the other of which had not, the correlations between" dictation scores and the other test scores combined were .78 and .89, respectively. Valette considered these correlations to be notably high and concluded that the 'dictee' was measuring the same basic overall skills as the longer and more difficult to prepare French examination. Valette concluded that the difference in the two correlations could be explained as a result of a practice effect that reduced the validity of dictation as a test for students who had practiced taking dictation. 58 DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS LANGUAGE TESTS AT SCHOOL 59 better with the Listening Comprehension subscore on the TOEFL than it did with any other part of that examination (which also includes a subtest aimed at reading comprehension and one aimed at vocabulary knowledge). The correlation between the cloze test and the Listening Comprehension subtest was .73 and the correlation with the total score on all subtests of the TOEFL combines was .82. In another study of the UCLA ESLPE 2A Revised, correlations of .74, .84, and .85 were observed between dictations and cloze tests (Oller, 1972). Also, three different cloze tests used with different populations of subjects (above 130 in number in each case) correlated above. 70 in six cases with grammar tasks and paraphrase recognition tasks. The cloze test was scored by the contextually-appropriate ('acceptable word') method; see Chapter 12. While Valette was looking at the performance of students in a formal language classroom context where the language being studied was not spoken in the surrounding community, the studies at UCLA and the one by Darnell (at the University of Colorado) examined populations of students in the United States who were in social contexts where English was in fact the language of the surrounding community. Yet the results were similar in spite of the contrasts in tested populations and regardless of the contrasts in the tests used (the various versions of UCLA ESLPE, the TOEFL, and the foreign language French exams). Similar results are available, however, from still more diverse settings. Johansson (1972) reported on the use of a combined cloze and dictation procedure which produced essentially the same results as the initial studies with dictation at UCLA. He found that his combined doze and dictation procedure correlated better with scores on several language tests than any of the other tests correlated with each other. It is noteworthy that Johansson's subjects were Swedish college students who had learned English as a foreign language. The correlation between his cloze-dictation and a traditional test of listening comprehension was .83. In yet another context, Stubbs and Tucker (1974) found that a cloze test was generally the best predictor of various sections on the American University at Beirut English Entrance Examination. Their subject population included mostly native speakers of Arabic learning English as a foreign or second language. The cloze test appeared to be superior to the more traditional parts of the EEE in spite of greater ease of preparation of the cloze test. In particular, the cloze blanks seemed to discriminate better between high scorers and However, the two groups also had different teachers which suggests another possible explanation for the differences. Moreover, Kirn (1972) in a study of dictation as a testing technique at UCLA found that extensive practice in taking dictation in English did not result in substantially higher scores. Another possible explanation for the differences in correlations between Valette's two groups might be that dictation is a useful teaching procedure in which case the difference might be evidence for real learning. Nevertheless, one of the results of Valette's study has been replicated on numerous occasions with other tests and with entirely different populations of subjects - namely, that dictation does correlate at surprisingly high levels with a vast array of other language tests. For instance, in a study at UCLA a dictation task included as part of the UCLA English as a Second Language Placement Examination Form 1 correlated better with every other part of that test than any other two parts correlated with each other (Oller, 1970, Oller and Streiff, 1975). This would seem to suggest that dictation was accounting for more of the total variance in the test than any other single part of that test. The correlation between dictation and the total score on all other test parts not including the dictation (Vocabulary, Grammar, Composition, and Phonology for description see Oller and Streiff, 1975, pp. 73-'5) was .85. Thus the dictation was accounting for no less than 72 %of the variance in the entire test. In a later study, using a different form of the UCLA placement test (Form 2C), dictation correlated as well with a cloze test as either of them did with any of the other subtests on the ESLPE 2C (Oller and Conrad, 1971). This was somewhat surprising in view of the striking differences in format of the two tests. The dictation is heard and written, while the cloze test is read with blanksto be filled in. The one test utilizes an auditory mode primarily whereas the other uses , mainly a visual mode. Why would they not correlate better with more similar tasks than with each other? For instance, why would the cloze test not correlate better with a reading comprehension task or a vocabulary task (both were included among the subtests on the ESLP E 2C)? The correlation between cloze and dictation was .82 while the correlations between cloze and reading, and cloze and vocabulary were .80 and .59, respectively. This surprising result confirmed a similar finding of Darnell (1968) who found that a cloze task (scored by a somewhat complicated procedure to be discussed in Chapter 12) correlated l 4 60 LANGUAGE TESTS AT SCHOOL low scorers than did the traditional discrete point types of items (see Chapter 9 on item analysis and especially item discrimination). A study by Pike (1973) with such diverse techniques as oral interview (FSI type), essay ratings, cloze scores, the subscores on the TOEFL and a variety of other tasks yielded notably strong correlations between tasks that could be construed as pragmatic tests. He tested native speakers of Japanese in Japan, and Spanish speakers in Chile and Peru. There were some interesting surprises in the simple correlations which he observed. For instance, the essay scores correlated better with the subtest labeled Listening Comprehension for all three populations tested than with any of the other tests, and the cloze scores (by Darnell's scoring method) correlated about as highly with interview ratings as did any other pairs of subtests in the data. The puzzle remains. Why should tests that look so different in terms of what they require people to do correlate so highly? Or more mysterious still, why should tests that purport to measure the same skill or skills fail to correlate as highly with each other as they correlate with other tests that purport to measure very different skills? A number of explanations can be offered, and the data are by no means all in at this point. It would appear that the position once favored by discrete point theorists has been excluded by experimental studies ~ that position was that different forms of discrete point tests aimed at assessing the same skill, or aspect of a skill, or component of a skill, ought to correlate better with each other than with, say, integrative (especially, pragmatic) tests of a substantially different sort. This position now appears to have been incorrect. There is considerable evidence to show that in a wide range of studies with a substantial variety of tests and a diverse selection of subject populations, discrete point tests do not correlate as well with each other as they do with integrative tests. Moreover, integrative tests of very different types (e.g., cloze versus dictation) correlate even more highly with each other than they do with language tests which discrete point theory would identify as being more similar. The correlations between diverse pragmatic tests, in other words, generally exceed the correlations observed between quite similar discrete point tests. This would seem to be a strong disproof of the early claims of discrete point theories of testing, and one will search in vain for an explanation in the strong versions of discrete point approaches (see especially Chapter 8, and in fact all of Part Two). Having discarded the strong version of what might be termed the DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 61 discrete point hypothesis ~ namely, that tests aimed at similar elements, components, aspects of skills, ot skills ought to correlate more highly than tests that are apparently requiring a greater diversity of performances ~ we must look elsewhere for an explanation of the pervasive results of correlation studies. Two explanations have been offered. One is based on the pragmatic theory advocated in Chapter 2 of this book, and the other is a modified version of the discrete point argument (Upshur, 1976, discusses this view though it is doubtful whether or not he thinks it is correct). From an experimental point of view, it is obviously preferable to avoid advocacy and let the available data or obtainable data speak for themselves (Platt, 1964). One hypothesis is that pragmatic language tests must correlate highly if they are valid language tests. Therefore, the results of correlation studies can be easily understood or at least straightforwardly interpret~ as evidence of the fundamental validity of the variety of language tests that have been shown to correlate at such remarkably high levels. The reason that a dictation and a cloze test (which are apparently such different tasks) intercorrelate so strongly is that both are effective devices for assessing the efficiency of the learner's developing grammatical system, or language ability, or pragmatic expectancy grammar, or cognitive network of the language or whatever one chooses to call it. There is substantial empirical evidence to suggest that there may be a single unitary factor that accounts for practically all of the variance in language tests (Oller, 1976a). Perhaps that factor can be equated with the learner's developing grammatical system. One rather simple but convincing source of data on this question is the fact that the validity estimates on pragmatic tests of different sorts (i.e. the correlations between different ones) are nearly equal to the reliability estimates for the same tests. From this it follows that the tests must be measuring the same thing to a substantial extent. Indeed, if the validity estimates were consistently equal to, or nearly equal to the reliability estimates we would be forced to conclude that the tests are essentially measures of the same factor. This is an empirical question, however, and another plausible alternative remains to be ruled out. Upshur (1976) suggeststhat perhaps the grammatical system of the learner will account for a large and substantiai portion of the variance in a wide variety of language tests. This central portion of variance might explain the correlations mentioned above; but there could still be meaningful portions of variance left which would be attributable 62 LANGUAGE TESTS AT SCHOOL to components of grammar or -aspects of language skill, or the traditional skills themselves. Lofgren (1972) concluded that 'there appear to be four main factors which are significant for language proficiency. These have been named knowledge of words and structures, intelligence, pronunciation, and fluency' (p. 11). He used a factor analytic approach (a sopnisticated variation on correlation techniques) to test 'Lado's idea ... that language' can be broken down into 'smaller components in order to find common elements' (p. 8). In particular, Lofgren wanted to test the view thllt language skills could be differentiated into listening, speaking, reading, writing, and possibly translating fact,ors. His evidence would seem to support either the unitary pragmatic factor hypothesis, or the central grammatical factor with meaningful peripheral components as suggested by Upshur. His data seem to exclude the possibility that meaningful variances will emerge which are unique to the traditionally recognized skills. Clearly, more research is needed in relation to t~is important topic (see the Appendix). ' Closely related to the questions about the composition oflanguage skill (and these questions have only recently been posed with reference to native speakers of any given language), are questions about the important relation- of language skill(s) to IQ and other psychological constructs. If pragmatic tests are actually more valid tests thaq other widely used measures of language skills, perhaps these new measurement techniques can be used to determine the' relationship between ability to perform meaningful language proficiency tasks and ability to answer questions on so-called IQ tests, and educational tests in general. Preliminary results reported in Oller and Perkins (1978) seem to support a single factor ~olution where language proficiency accounts for nearly all of the reliaole variance in IQ and achievement tests. , Most of the studies oflanguage test validity have dealt with second or foreign language learners who are either adults or postadolescents. Extensions to native speakers and to children who are either native or non-native speakers of the language tested are more recent. Many careful empirical investigations are now under way or have fairly recently been reported with younger subjects. In a pioneering doctoral study at the University of New Mexico, Craker (1971) used an oral form of the cloze procedure to assess language skills of children at the first grade level from four ethnic backgrounds. She reported significant discrimination between the four groups DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 63 suggesting that the procedure may indeed be sensitive to variations in levels of proficiency for children who are either preliterate or are just beginning to learn to read. Although data are lacking on many of the possible pragmatic testing procedures that might be applied' with children, the cloze procedure has recently been used with literate children in the elementary grades in contexts ranging from the Alaskan bush country to the Mrican primary school. Streiff (1977) investigated the effects of the availability of reading resources on reading proficiency among Eskimo children from the third to sixth grade using cloze tests as 'the criterion measure for reading proficiency. Hofman (1974) used cloze tests as measures of reading proficiency in Uganda schools from grades 2 through 9. Data were collected on children in 14 schools (12 African and 2 European). Since the tests were in a second language for many of the African children, and in the native language for many of the European children, some interesting comparisons are possible. Concerning test reliabilities and internal consistencies of the various cloze tests used, Hofman reports somewhat lower values for the 2nd graders, but even including them, the average reliability estimate for all nine test batteries is .91 - and none is below .85. These data were based on a mean sample size of264 subjects. The smallest number for any test battery was 232. In what may be the most interesting study oflanguageproficiency in young children to date, Swain, Lapkin, and Barik (1976) have reported data closely paralleling results obtained with adult second language learners. In their research, 4th grade bilinguals (or English speaking children who are becoming bilingual in French) were tested. Tests of proficiency in French were used and correlated with a cloze test in French scored by the exact and acceptable word methods (see Chapter 12 for elaboration on scoring ,methods). Proficiency tests for English ability were also correlated with a cloze test in English (also scored by both methods). In every case, the correlations between cloze scores and the other measures of proficiency used (in both languages) were higher than the correlations between any of the other pairs of proficiency tests. This result would seem to support the conclusion that the cloze tests were, simply accounting for more ofthe available meaningful variance in both the native language (English) and in the second language (French). The authors conclude, 'tbis study has indicated that the cloze tests can be used effectively with young children ... the cloze technique has been shown to be a valid and reliable means of measuring second 66 LANGUAGE TESTS AT SCHOOL or experimenter; neither was spontaneous). They reasoned that if the sentences children were to repeat exceeded immediate memory span, then elicited imitation ought to be a test both of comprehension and production skills. Translation on the other hand could be done in two directions, either from the native language of the children (in this case English) to the target language (French), or from the target language to the native language. The first sort of translation, they reasoned, could be taken as a test of productive ability in the target language, whereas the second could be taken as a measure of comprehension ability in the target language (presumably, if the child could understand something in French he would have no difficulty in expressing it in his native language, English). In order to rule out the possibility that examinees might be using different strategies for merely repeating a sentence in French as opposed to translating it into English, Swain, Dumas, and Naiman devised a careful comparison of the two procedures. One group of children was told before each sentence whether they were to repeat it or to translate it into English. If subjects used different strategies for the two tasks, this procedure would allow them to plan their strategy before hearing the sentence. A second group was given each sentence first and told afterward whether they were to translate it or repeat it. Since the children in this group did not know what they were to do with the sentence beforehand, it would be impossible for them consistently to use different strategies for the imitation task and the translation task. Differences in error types or in the relative difficulty of different syntactic structures might have implied different strategies of processing, but there were no differences. The imitation task was somewhat more difficult, but the types of errors and the rank ordering of syntactic structures were similar for both tasks and for both groups. (There were no significant differences at all between the two groups.) There were some striking similarities in performance on the two rather different pragmatic tasks. 'For example, 75 % of the children who imitated the past tense by "a + the third person of the present tense of the main verb" made the same error in the production task. ... Also, 69 % of the subjects who inverted pronoun objects in imitation made similar inversions in production .... In sum, these findings lead us to reject the view held by Fraser, Bellugi, and Brown (1963) and others that imitation is only a perceptual-motor skill' (Swain, Dumas, and Naiman, 1974, p. 72). Moreover, in a study by Naiman (1974), children were observed to make many of the same DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 67 errors in spontaneous speech as they made in elicited translation from the native language to the target language. In yet another study, Dumas and Swain (1973) 'demonstrated that when young second language learners similar to the ones in Naiman's study (Naiman, 1974) were given English translations of their own French spontaneous productions and asked to translate these utterances into French, 75 % of their translations matched their original spontaneous productions' (Swain, Dumas, and Naiman, 1974, p. 73). Although much more study needs to be done, and with a greater variety of subject populations and techniques of testing, it seems reaso,nable to say that there is already substantial evidence from studies oflearner outputs that quite different pragmatic tasks may be tapping the same basic underlying competence. There would seem to be two explanations for differences in performance on tasks that require saying meaningful things in the target language versus merely indicating comprehension by, say, translating the meaning of what someone else has just said in the target language into one's own native language. Typically, it is observed that tasks of the latter sort are easier than the former. Learners can often understand things that they cannot say; they can often repeat things that they could not have put together without the support of a model; they may be able to read a sentence which is written down that they could not have understood at all if it were spoken; they may be able to comprehend written material that they obviously could not have written. There seems to be a hierarchy of difficulties associated with different tasks. The two explanations that have been put forth parallel the two competing explanations for the overlap in variance on pragmatic language tests. Discrete point testers have long insisted on the separation of tests of the. traditionally recognized skills. The extreme version of this argument is to propose that learners possess different grammars for different skills, aspects of skills; compot1et1tsof aspects of skills, and so on right down to the individual phonemes, morphemes, etc. The disproof of this view seems to have already been provided now many times over. Such a theoretical argument cannot embrace the data from correlation studies or the data from error analyses. However, it seems that a weaker version cannot yet be ruled out. lt is possible that there is a basic grammatical system underlying all uses oflanguage, but that there remain certain components which are not part of the central core that account for what are frequently referred to as differences in productive and receptive repertoires (e.g. » "'1 68 DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS LANGUAGE TESTS AT SCHOOL 69 catches up with what he intended to say in form and meaning all the while continuing to plan and actively construct further forms and meanings, the listener needs only to monitor his inferences concerning the speaker's intended meanings, and to help him do this, the listener has the already constructed sensory signals (that he can hear and see) which the speaker is outputting. A similar explanation can be offered for the fact that reading is somewhat less taxing than writing, and that reading is somewhat easier than listening, and so forth for each possible contrast. Swain, Dumas, and Naiman (1974) anticipate this sort of explanation when they talk about 'the influence of memory capacity on some of the specific aspects of processing involved in tasks of imitation and translation' (p. 75). Crucial experiments to force the choice between the two competing explanations for hierarchies of difficulties in different language tasks remain to be done. Perhaps a double-edged approach from both correlation studies and error analysis will disprove one or the other of the two competing alternatives, or possibly other alternatives remain to be put forward. Irrthe meantime, it would seem that the evidence from error analysis supports the validity of pragmatic tasks. Certainly, it would appear that studies of errors on different elicitation procedures are capable of putting both of the interesting alternatives in the position of being vulnerable to test. This is all that science requires (Platt, 1964). Teachers and educators, of course, require more. We cannot wait for all of the data to come in, or for the crucial experiments to be devised and executed. Decisions must be made and they will be made either wisely 01: unwisely, for better or for worse. Students in classrooms cannot be left to sit there without a curriculum, and the decisions concerning the curriculum must be made with or without the aid of valid language tests. The best available option seems to be to go ahead with the sorts of pragmatic language tests that have proved to yield high concurrent validity statistics and to provide a rich supply of information concerning 'sa Ii •_ _ 1111 Ii In the next chapter we will consider aspects oflanguage assessment in multilingual contexts. In Part Two, we will discuss reasons for rejecting certain versions of discrete point tests in favor of pragmatic testing, and in Part Three, specific pragmatic testing procedures are discussed in greater detail. The relevance of the procedures recommended there to educational testing in general is discussed at numerous points throughout Part Three. phonology for speaking versus for listening), or differences in productive and receptive abilities (e.g. speaking and writing versus listening and reading), or differences in oral and visual skills (e.g. speaking and listening versus reading and writing), or components that are associated with special abilities such as the capacity to do ~simultaneous translation, or to imitate a wide variety of accents, and so on, and on. This sort of reasoning would harmonize with the hypothesis discussed by Upshur (1976) concerning unique variances associated with tests of particular skills or components of grammar. Another plausible alternative exists, however, an,d was hinted at in the article by Swain, Dumas, and Naiman (1974). Consider the rather simple possibility that the differences in difficulties associated with different language tasks may be due to differences in the load they impose on the brain. It is possible that the grammatical system (call it an expectancy grammar, or call it something else) functions with different levels of efficiency in different language tasks - not because it is a different grammar - but because of differences in the load it must bear (or help consciousness to bear) in relation to different tasks. No one would propose that because a man can carry a one hundred pound weight up a hill faster than he can carry a one hundred and fifty pound weight that he therefore must have different pairs of legs for carrying different amounts of weight. It wouldn't even seem reasonable to suggest that he has more weight-moving ability when carrying one hundred pounds rather than when carrying one hundred and fifty pounds. Would it make Sense to suggest that there is an additional component of skill that makes the one hundred pound weight easier to carry? The case of, say, speaking versus listening skill is obviously much more complex than the analogy, which is intentionally a reduction to absurdity. But the argument can apply. In speaking, the narrow corridor of activity known as attention or consciousness must integrate the motor coordination of signals to the articulators telling them what moves to make and in what order, When to turn the voice on and off; when to push air and when not to; syllables must be timed, monitoring to make certain the right ones get articulated in the right sequence; facial expressions, tones, gestures, etc. must be synchronized with the stream of speech; and all of the foregoing must be coordinated with certain ill-defined intentions to communicate (or with pragmatic mappings of utterances onto extralinguistic contexts, if you like). In listening, somewhat less is required. While the speaker must both plan and monitor his articulatory output to make sure it c I. ., : , , 70 DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS LANGUAGE TESTS AT SCHOOL 71 13. A high correlation is more informative and easier to interpret than a low KEY POINTS 1. Discrete point tests are aimed at specific elements of phonology, syntax, or vocabulary within a presumed aspect (productive or receptive, oral or visual) of one of the traditionally recognized language skills (listening, speaking, reading, or writing). 2. The strong version of the discrete point approach argues that different test items are needed to assess different elements of knowledge within each component of grammar, and different subtests are needed for each different component of each different aspect of each different skill. Theoretically, many different tests are required. 3. Integrative tests are defined as antithetical to discrete point' tests. Integrative tests lump many elements and possibly several components, aspects and skills together and test them all at the same time. 4. While it can be argued that discrete point tests and integrative tests are merely two extremes on a continuum, pragmatic tests constitute a special class of integrative tests. It is possible to conceive of a discrete point test as being more or less integrative, and an integrative test as being more or less discrete, but pragmatic tests are more precisely defined. 5. Pragmatic language tests must meet two naturalness criteria: first, they must require the learner to utilize normal contextual constraints on sequences in the language; and, second, they must require comprehen~ion (and possibly production also) of meaningful sequences of elemen,tsin the limguage in relation to extralinguistic contexts. 6. Discrete point tests cannot be pragmatic tests. 7. The question whether or not a task is pragmatic is an empirical one. It can be decided by logic (that is by definition) and by experiment, but not by opinion polls. 8. Dictation and cloze procedure are examples of pragmatic tests. First, they meet the requirements of the definition, and second, they function in experimental applications in the predicted ways. 9. Other examples include combinations of cloze and dictation, oral cloze tasks, dictation with interfering noise, paraphrase recognition, questionanswering, oral interview, essay writing, narration, and translation. 10. A test is. v~}idtot;Jae}(}xt€nt thatitm¢a~l,lres\Vhat.itissuElflosedtf5 tiu~asure. Constr.uctyalidity has to do with theoretic111j:t,jstitie~tiofliofa testing procedllre; cqnteflt. ~alidity ha~ to' d0.with th<:\ falthfulnesswit)l 'Yhi~h,a,~~t~eft~\;t~t~eI}oPI).al US~S ofIall:guage towhichit isrelated as{} meast1!eofart exafiil11e~'sskill; con~urrent validityhfs t()~R,With} tlie 'strength of correhtti'ons betweetrteststf1'arfYtliportt6ril:~asure:tn~safi1e thing. II. Correlation is a statistical index of the tendency of scores on two tests to vary proportionately from their respective means. It is an index of the square root of variance overlap, or variance common to two tests. 12. The square of the simple correlation between two tests is an unbiased estimate of their variance overlap. The technical term for the square of the correlation is the coefficient of determination. Correlations cannot be compared linearly, but their squares can be. 14. IS. 16. 17. 18. 19. 20. one. While a high correlation does mean that some sort of strong relationship exists, a low correlation does not unambiguously mean that a strong relationship does not exist between the tested variables. There are many more explanations for low correlations than for high ones. Generally, pragmatic tests of apparently very different types correlate at higher levels with each other than they do with other tests. However, they also seem to correlate better with the more traditional discrete point tests than the ,latter do with each other. Hence, pragmatic tests seem to be generating more meaningful variance than discrete item tests. Two possibilities seem to exist: there may be a large factor of grammatical knowledge of some sort in every language test, with certain residual variances attributable to specific components, aspects, skills, and the like; or, language skill may be a relatively unitary factor and there may be no unique meaningful variances that can be attributed to specific components, etc. The relation oflanguage proficiency to intelligence (or more specifically to scores on so-called IQ tests) remains to be studied more carefully. Scores on achievement tests and educational measures of all sorts should also be examined critically with careful experimental procedures. Error analysis, or interlanguage analysis, can provide additional validity data on language tests. If language tests are viewed as elicitation procedures, and if errors are analyzed carefully, it is possible to make very specific observations about whether certain tests measure different things or the same thing. Available data suggest that very different pragmatic tasks, such as spontaneous speech, or elicited imitation, or elicited translation tend to produce the same kinds of learner errors. Differences in difficulty across tasks may be explained by considering the relative load on mental mechanisms. Teachers and educators can't wait for all of the research data to come in. At present, pragmatic testing seems to provide the most promise as a reliable, valid, and usable approach to the measurement of language ability. DISCUSSION QUESTIONS I. What sort of testing procedure is more common in your school or in educational systems with which you are familiar; discrete point testing, or integrative testing? Are pragmatic tests used in classrooms that you know of? Have you used dictation or cloze procedure in teaching a foreign language? Some other application? For instance, assessing reading comprehension, or the suitability of materials aimed at a certain grade level? V2. Discuss ways that you evaluate the language proficiency of children in your classroom, or students at your university, or perhaps how you might estimate your own language proficiency in a second language you have studied. 3. What are some of the drawbacks or advantages to a phonological - 72 DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS LANGUAGE TESTS AT SCHOOL discrimination task where the examinee hfiars a sentence like, He thinks 'he will sail his boat at the lake, and must decide whether he heard sell or sail. Try writing several items of this type and discuss the factors that enter into determining the correct answer. What expectancy biasas may arise? 4. In what way is understanding the sentence, Shove off, Smith, I'm tired of talking to you, dependent on knowing the meaning of off? Could you test such knowledge with an appropriate discrete item 1 5. Try doing a cloze test and a dictation. Reflect on the strategies you use in performing one or the other. Or give one to a class and ask them to tell you what they attended to and what they were doing mentally .while they filled in the blanks or wrote down what they had heard. 6. Propose other forms oftasks that you think would qualify as pragmatic tasks. Consider whether they meet the two naturalness criteria. If so try them out and see how they work. 7. Can you think of possible objections to a dictation with noise? What are some arguments for and against such a procedure? "8. Given a low correlation between two language tests, say, .30, what are some of the possible conclusions? Suppose the correlation is .90. What could you conclude for the latter? 9. Keep a diary on listening errors and speech errors that you or people around you make. Do the processes appear to be distinct? Interrelated? r 10. Discuss Alice's query about not knowing what she would say next till she had already said it. How does this fitthe strategies you follow when you , spelj:k? In, what cases might the statement apply to your own speech? 11. Can a high correlation between two tests be taken as an indication of test validity? When is it merely an indication of test reliability? What is the difference? Can a test be valid without being reliable? How about the reverse? .il2. Try collecting samples of learner outputs in a variety of ways. Compare error types. Do dissimilar pragmatic tests elicit similar or dissimilar errors? v" 13. Consider the prbs and cons of the long despi~ed technique of translation as a teaching and as a testing device. Are the potential uses similar? Can v you define clearly abuses to be avoided? 14. What sorts of tests do you think will yield superior diagnostic information to help you to know what to do to help a learner by teaching strategies? Consider what you do now with the test data that are available. Are there reading tests? Vocabulary tests? Grammar tests? What do you do differently because of the information that you get from the tests you use? v 15. Compute the means, variances, and standard deviations for the following sets of scores: George Sarah Mary TestA 1 2 3 TestB 5 10 15 (Note that these data are! highly artificial. They are contrived purely to 73 illustrate'the meaning' of correlation while keeping the computations extremely simple and manageable.) "-16. What is the correlation between tests A and B? / 17. What would your interpretation of the correlation be if Tests A and B were alternate forms of a placement test? What ifthey were respectively a vi' reading comprehension test and an oral interview? 18. Repeat questions 15, 16, and 17 with the following data: TestC George Sarah Mary 5 10 15 TestD 7 6 5 (Bear in mind that correlations would almost never be done on such small numbers of subjects. Answers: correlation between A and B is + 1.00, between C and D it is -1.00.) SUGGESTED READINGS 1. Anne Anastasi, Psychological Testing. New York: Macmillan, revised edition, 1976. See especially Chapters 5 and 6 on test validity. 2. John B. Carroll, 'Fundamental Considerations in Testing for English Language Proficiency of Foreign Students,' Testing, Washington, D.C.: Center for Applied Linguistics, 1961, 31-40. Reprinted in H. B. Allen' and R. N. Campbell (eds.) Teaching English as a Second Language: A Book of Readings. New York: McGraw Hill, 1972,313-320. 3. L. J. Cronbach, Essentials of Psychological Testing. New York: Harper and Row; 1970. See especially the discussion of different types of validity in Chapter 5. 4. Robert Lado, Language Testing. New York: McGraw Hill, 1961. See especially his discussion of discrete point test rationale pp. 25-29 and, 39-203. 5. John W. Oller, Jr., 'Language Testing,' in Ronald Wardhaugh and H. Douglas Brown (eds.) Survey' of Applied Linguistics. Ann Arbor, Michigan: University of Michigan, 1976,275-300. 6. Robert L. Thorndike, 'Reliability,' Proceedings of the 1963 Invitational Conference on Testing Problems. Princeton, N.J.: Educational Testing Service, 1964. Reprinted in Glenn H. Bracht, Kenneth D. Hopkins, and Julian C. Stanley (eds.) Perspectives in Educational and Psychological Measurement. Englewood Cliffs, N.J.: Prentice-Hall, 1972,66-73. --, MULTILINGUAL ASSESSMENT Multilingual Assessment A. Need B. Multilingualism versus ~~/ multidialectalism C. Factive and emotive aspects of multilin,gualism D. On test biases E. Translating tests or items F~ Dominance and proficiency G. Tentative suggestions Multiling~alism is a pervasive modern reality. Ever since that cursed Tower was erected, the peoples of the world have had this problem. In the United States alone, there are millions of people in every major urban center whose home and neighborhood language is not one of the majority varieties of English.' Spanish, Italian, German, Chinese and a host of othet 'foreign' languages have actually become American languages. Furthermore, Navajo, Eskimo, Zuni, Apache, and many other native languages of this continent can hardly be called 'foreign' languages. The implications for education are manifold. How shall we deliver curricula to children whose language is not English? How shall we determine what their language skills are? A. Need Zirkel (1976) concludes an article entitled 'The why's and way,s of testing bilinguality before teaching bilingually,' with the following paragraph: 74 75 The movement toward an effective and efficient means of testing bilinguality before teaching bilingually is in progress. In its wake is the hope that in the near future 'equality of educational opportunity' will become more meaningful for linguistically different pupils in our nation's elementary schools (p.328). Earlier he observes, however, that 'a substantial number of bilingual programs do not take systematic st~ps to determine the language dominance of their pupils' (p. 324). Since the 1974 Supreme Court ruling in the case of Lau versus Nichol~, the interest i~. multilingual testing in the schools of the United States has taken a sudden upswing. The now famous court case involved a contest between a Chinese family in San Francisco and the San Francisco school system. The following quotation from the Court's Syllabus explains the nature of the case: The failure of the' San Francisco school system to provide English language instruction to approximately 1,800 students of Chinese ancestry who do not speak English denies them a meaningful opportunity to participate in the public educational program and thus violates §601 of the Civil Rights Act of 1964, which bans discrimination based 'on the ground ofrace, color, or national origin,' (Lau vs. Nichols, 1974, No. 72-6520). In page 2 of an opinion by Mr. Justice Stewart concurred in by The Chief Justice and Mr. Justice Blackmun, it is suggested that 'no specific remedy is urged upon us. Teaching English to the students of Chinese ancestry who do not speak the language is one choice. Giving instruction to this group in Chinese is another' (1974, No. 72-6520). Further, the Court argued:, . Basic English skills are at the very COre of what these ,public schools teach. Imposition of a requiremeqt that; before a-child can effectively participate in the educational program, he must already have acquired those basic skills is to make a mockery of public education. We know that those who do not understand English are certain to find their classroom experiences wholly incomprehensible and in no way meaningful (1974, No. 72-6520, p. 3). As a result of the interpretation rendered by the Court, the U.S. Office of Civil Rights convened a Task Force which recommended certain so-called 'Lau Remedies'. Among other things, the main document put together by the Task Force requires language assessment procedures to determine certain facts about language use and it requires the rating of bilingual proficiency on a rough five point r 76 MULTILINGUAL ASSESSMENT LANGUAGE tEsTS AT SCHOOL 77 results are in, but it is equally true that we cannot afford to play political games of holding out for more budget to make changes in ways the present budget is being spent, especially when those changes were needed years ago. This year's budget and next year's (if there is·a next year) will be spent. People in the schools will be busy at many tasks, and all of the av~able time will get used up. Doing all of it the way it was done last year is proof only of the disappointing fact that a: system that purports to'teach doesn't necessarily learn. Indeed, it is merely another commtmton the equally discouraging fact that many students in the schools (and universities no doubt) must learn in spite of the system which becomes an adversary instead of a servant to the needs oflearners. The problems are not unique to the United States. They are worldwide problems. Hofman (1974) in reference to the schools of Rhodesia says, 'It is important to get some idea, one that should have been reached ten years ago, of the state of English in the primary school' (p. 10). In the case of Rhodesia, and the argument can easily be extended to many of the world's nations, Hofman questions the blind and uninformed language policy imposed on teachers and children in the schools. In the case of Rhodesia, at least until the time his report was written, English was the required school language from 1st grade onward. Such policies have recently been challenged in many parts of the world (not just in the case of a super-imposed English) and reports of serious studies examining important variables are beginning to appear (see for instance, Bezanson and Hawkes, 1976, and Streiff, 1977). This is not to say that there may not be much to be gained by thorough knowledge of one of the languages of the world currently enjoying much power and prestige (such as English is at the present moment), burthere are many questions concerning the price that must be paid for such knowledge. Such questions can scarcely be posed without serious multilingual testing on a much wider scale than has been common up till now. scale (1. monolingual in a language other than English; 2. more proficient in another language than in English; 3. balanced bilingual in English and another language; 4. more proficient in English than in another language; 5. monolingual in English). Multilingual testing seems to have come to stay for a while in U.S. schools, but as Zirkel and others have noted, it has come very late. It is late in the sense of antiquated and inhumane educational pro~rams that placed children of language backgrounds other than English in classes for the 'mentally retarded' (Diana versus California State Education Department, 1970, No. C-7037), and it is late in terms of bilingual education programs that were started in the 1960s and even in the early 1970s on a hope and a promise but without adequate assessment of pupil needs and capabilities(cf. John and Horne'r, 1971, and Shore, 1974, cited by Zirkel, 1976). In fact, as recently as , 1976, Zirkel observes, 'a fatal flaw in many bilingual programs lies in the linguistic identification of pupils at the critical point of the planning and placement process' (p. 324). Moreover, as Teitelbaum (1976), Zirkel (1976), and others have often noted, typical methods of assessment such as surname surveys (to identify Spanish speakers, for instance) or merely asking about language preferences (e.g., teacher or student questionnaires) are largely inadequate. The one who is most likely to be victimized by such inadequate methods is the child in,the school. One second grader indicated that he spoke 'English-only' .on a rating sheet, but when 'casually asked later whether his parents spoke Spanish, the child responded without he'sitation: Si, ellos hablan espafiol- pero yo, no.' Perh.aps someone had convinced him that he was not supposed.to speak Spanish 1 Surely, for the sake of the child, it is necessary to obtain reliable and valid information about what language(s) he speaks and understands {ang how well) before decisions are reached about curriculum delivery and the language policy of the classroom. 'But,' some sincere administrator may say, 'we simply can't afford to do systematic testing on a wide scale. We don't have the people, the budget, or the time.' The answer to such an objection must tie indignant if it is equally sincere and genuine. Can the schools afford to do this year what they did last year? Can they afford to continue to deliver instruction in a language that a substantial number of the children cannot understand? Can they afford to do other wide scale standardized testing programs whose results may be less valid? It may be true that the educators cannot wait until all the research B. Multilingualism versus multidialectalism The problems of language testing in bilingual or multilingual contexts seem to be compounded to a new order of magnitude each time a new language is added to the system. In fact, it would seem that the problems are more than just doubled by the presence of more than one language because there must be social and psychological interactions between different language communities producing' L 78 MULTILINGUAL ASSESSMENT LANGUAGE TESTS AT SCHOOL complexities not present in monolingual commumtles. However, there are some parallels between so-called monolingual communities and multilingual societies. The former display a rich diversity of language varieties in much the way that the latter exhibit a variety of languages. To the extent that differences in language varieties parallel differences in languages, there may be less contrast between the two sorts of settings than is commonly thought. In both cases there is the need to assess performance in relation to a plurality of group norms. In both cases there is the difficulty of determining what group norms are appropriate at different times and places within a given social order. It has sometimes been argued that children in the schools should be compared only against themselves and never against group norms, but this argument implicitly denies the nature of normal human communication. Evaluating the language ability of an individual by comparing him only against himself is a little like clapping with one· hand. Something is inissing. It only makes sense to say that a person knows a language in relation to the way that other persons who also know that language perform when they use it. Becoming a speaker of a particular language is a distinctively socializing process. It is a process of identifying with and to some degree functioning as a member of a social group. In multilingual societies, where many mutually unintelligible languages are common fare in the market places and urban centers, the need for language proficiency testiI1g as a basis for informing. edu6~tional policy is perhaps more obvious than in so-called monolingual societies. However, the case of monolingual societies; which are typically multidialectal, is deceptive. Although different varieties of a language may be mutually intelligible in many situations, in others they are not. At least since 1969, it has been known that school children who speak different varieties of English perform about equally badly in tests that require the repetition of sentences in the other group's variety (Baratz, 1969). Unfortunately for children who speak a non-majority variety of English, all of the other testing in the schools is done in the majority variety (sometimes· referred to as 'standard English'). An important question currently being researched is the extent to which educational tests in general may contain a built-in language variety bias and related to it is the more general question concerning how much of the variance in . educational tests in general can be explained by variance in language proficiency. tests (see Stump, 1978 Gunnarsson, 1978 and Pharis and Perkins, in press; also see the Appendix). The parallel between multilingualism and multidialectalism is still more fundamental. In fact, there is a serious question of principle concerning whether it is possible to distinguish languages and dialects. Part of the trouble lies in the fact that for any given language (however we define it), there is no sense at all in trying to distinguish it from its dialects or varieties. The language is its varieties. The only sense in which a particular variety of a language may be elevated to a status above other varieties is in the manner of Orwell's pigs ~ by being a little more equal or in this case, a little more English or French or Chinese or Navajo or Spanish or whatever. One of the important rungs on the status ladder for a language variety (and for a language in the general sense) is whether or not it is written and whether or not it can lay claim to a long literary tradition. Other factors are who happens to be holding the reins of political power (obviously the language variety they speak is in a privileged position), and who has the money and the goods that others must buy with their money. The status of a particular variety of English is subject to many of the same influences that the status of English (in the broader sense) is controlled by. The question, where does language X (or variety X) leave off and language Y (or variety Y) begin is a little like the question, where does the river stop and the lake begin. The precision of the answer, or lack of it, is not so much a product of clarity or unclarity of thought as it is a product of the nature of the objects spoken of. New terms will not make the boundaries between languages, or dialects, or between languages and language varieties any clearer. Indeed, it can be argued that the distinction between languages as disparate as Mandarin and English (or Navajo and Spanish) is merely a matter of degree. For languages that are more closely related, such as German and Swedish, or Portuguese and Spanish, or Navajo and Apache, it is fairly obvious that their differences are a matter of degree. However, in relation to abstract grammatical systems that may be shared by all human beings as part of their genetic programming, it may be the case that all languages share much of the same universal grammatical system (Chomsky, 1965, 1972). Typically, varieties of a language that are spoken by minorities are termed 'dialects' in what sometimes becomes an unintentional (or possibly intentional) pejorative sense. For example, Ferguson and Gumperz (1971) suggest that a dialect is a 'potential language' (p. 34). This remark represents the tendency to institutionalize what may be ii··· " :::',.:' i..i.l. 79 b 80 LANGUAGE TESTS AT SCHOOL appropriately termed the 'more equal syndrome'. No one would ever suggest that the language that a middle class white speaks is. a potential language - of course not, it's a real language. But the language spoken by inner city blacks - that's another matter. A similar tendency is apparent in the remark by The Chief Justice in the Lau case where he refers to the population of Chinese speaking children in San Francisco (the 1,800 who were not being taught English) as 'linguistically deprived' children: (Lau versus Nichols, 1974, No. 72-6520, p. 3). Such remarks may reflect a modicum of truth, but deep within they seem to arise from ethno-centric prejudices that define the way I do it (or the way we do it) as intrinsically better than the way anyone else does it. It is not difficult to extend such intimations to 'deficit theories' of social difference like those advocated by Bernstein (1960), Bereiter and Engleman (1967), Herrnstein (1971), and others. Among the important questions that remain unanswered and that are crucial to the differentiation of multilingual and monolingual societies are the following: to what extent does normal educational testing contain a language variety bias? And further, to what extent is that bias lesser or greater than the language bias in educational testing for children who come from a non-English speaking background? Are the two kinds of bias really different in type or merely in degree? C. Factive and emotive aspects of multilingualism Difficulties in communication between social groups of different language backgrounds (qialects or language varieties included) are apt to arise in two ways: first, there may be a failure to communicate on the factive level; or second,there may be a failure to communicate on the emotive level as well as the factive level. If a child comes to a school from a cultural and linguistic background that is substantially different from the background of the majority of teachers and students in the school, he brings to the communication contexts of the school many sorts of expectations that will be inappropriate to many aspects of the exchanges that he might be expected to initiate or participate in. Similarly, the representatives of the majority culture and possibly other minority backgrounds will bring other sets of expectations to the communicative exchanges that must take place. In such linguistically plural contexts, the likelihood of misinterpretations and breakdowns in communication is increased. On the MULTILINGUAL ASSESSMENT 81 factive level, the actual forms of the language(s) may present some difficulties. The surface forms of messages may sometimes be uninterpretable, or they may sometimes be misinterpreted. Such problems may make it difficult for the child, teacher, and others in the system to c;ommunicate the factive-level information that is usually the focus of classroom activities - namely, transmitting the subject matter content of the curriculum. Therefore, such factive level communication problems may account for some portion of the variance in the school performance of children from ethnically and culturally different backgrounds, i.e., their generally lower scores on educational tests. As Baratz (1969) has shown, however, it is important to keep in mind the fact that if the tables were turned, if the majority were suddenly the minority, their scores on educational tests might be expected to plummet to the same levels as are typical· of minorities in today's U.S. schools. Nevertheless, there is another important cluster of factors that probably affect variance in learning far more drastically than problems of factive level communication. There is considerable evidence to suggest that the more elusive emotive or attitudinal level of communication may be a more important variable than the surface form of messages concerning subject matter. This emotive aspect of communication in the schools directly relates to the self-concept that a given child is developing, and also it relates to group loyalties and ethnic identity. Though such factors are difficult to measure (as we shall see in Chapter 5), it seems reasonable to hypothesize that they may account for more of the variance in learning in the schools that can be accounted for by the selection of a particular teaching methodology for instilling certain subject matter (factive level communication). As the research in the Canadian experiments has shown, if the socio-cultural (emotive level) factors are not in a turmoil and if the child is receiving adequate encouragement and support at home, etc., the child can apparently learn a whole new way of coding information factively (a new linguistic system) rather incidentally and can acquire the subject matter and skills taught in the schools without great difficulty (Lambert and Tucker, 1972, Tucker and Lambert, 1973, Swain, 1976a, 1976b). However, the very different experience of children in schools, say, in the Southwestern United States where many ethnic minorities do not experience such success requires a different interpretation. Perhaps the emotive messages that the child is bombarded with in the Southwest help explain the failure of the schools. Pertinent questions r 82 are: how does the child see his culture portrayed in the curriculum? How does the child see himself in relation to the other children who may be defined as successful by the system? How does the child's , home experience match up with the experience in the school? It is hypothesized that variance in rate oflearning is probably more sensitive to the emotive level messages communicated in facial expressions, tones of voice, deferential treatment of some children in a classroom, biased representation of experiences in the school curriculum, and so on, than to differences in factive level methods of presenting subject matter, This may be more true for minority children than it is for children who are part of the majority. A similar view has been suggested by Labov (1972) and by Goodman and Buck (1973). The hypothesis and its consequences can be visualized as shown in Figure 2 where the area enclosed by the larger circle represents the total amount of variance in learning to be accounted for (obviously the Figure is a metaphor, not an explanation or model). The area enclosed by the smaller concentric circle represents the hypothetical amount of variance that mi.ght be explainedQy emotive message factors. Among these are messages that the child perceives concerning his own worth, the value of his people and culture, the viability of his language as a means" of communication, and the validity of his life experience. The,area in the outer ring represents the hypothetical portion of variance in learning that may be accounted for by appeal to factive level aspects of communication in the schools, such as methods of teaching, .subject -matter taught, language of presentation of the material, IQ, initial achievement levels, etc. Of all the ways struggles for ethnic identity manifest themselves, and of all the messages that can be communicated between different social groups in theit mutual struggles to identify and define themselves, William James (as cited by Watzlawick, et ai, 1967, p. 86) suggested that the most crFel possible message one human being can communicate to another (or one group to another) is simply to pretend that the other individual (or group) does not exist. Examples _ are too common for comfoit in the history of education. Consider the, statement that Columbus discovered America in 1492 (Banks, 1972). Then ask, who doesn't count? (Clue: Who was already here before: Columbus ever got the wind in his sails?) James said, , 'No more fiendish punishment could' be devised, ... than that one should be turned loose in a society and remain absolutely unnoticed by all the members thereof' (as cited by Watzlawick, et ai, 1967, p. 86). I .MULTILINGUAL ASSESSMENT LANGUAGE TESTS AT SCHOOL which can attributed 83 be to METHODS Representation of the validity of the child's own the child's own people experience Portrayal of the Portrayal of the viability of the value of the child's own child's own culture Figure 2. A hypothetical view of the amount of variance in learning to be accounted for by emotive versus factive sorts of information (methods of conveying subject matter are represented as explaining variance in the outer ring, while the bulk is explained by emotive factors). The interpretation of low scores on tests, therefore, needs to take account of possible emotive conflicts. While a high·- score on a language test or any other educational test probably can be confidently interpreted as indicating a high degree of skill in communicating factive information as well as a good deal of harmony between the child and the school situation on the emotive level, a low score cannot be interpreted so easily. In this respect, low scores on tests are somewhat like low correlations between tests (see the discussion in Chapter 3, section E, and again in Chapter 7, section B); they leave a greater number of options open. A low score may occur because the test was not reliable or valid, or because it was not suited to the child in difficulty level, or because it created emotive reactions that interfered with the cognitive task, or possibly because F 84 i ': ·ii LANGUAGE TESTS AT SCHOOL the child is really weak in the skill tested. The last interpretation, however, should be used with caution and only after the other reasonable alternatives have been ruled out by careful study. It is important to keep in mind the fact that an emotive-level conflict IS more likely to call for changes in the educational system and the way that it affects children than for changes in the children. In some cases it may be that simply providing the child with ample opportunity to learn the ~anguage or language variety of the educational system is the best solution; in others, it may be necessary to offer instruction in the child's native language, or in the majority language and the minority language; and no doubt other untried possibilities exist. If the emotive factors are in harmony between the school and the child's experience, there is some reason to believe that mere exposure to the unfamiliar language may generate the desired progress (Swain, 1976a, 1976b). In the Canadian experiments, English speaking children who are taught in French for the first several years of their school experience, learn the subject matter about as well as monolingual French speaking children, and they also incidentally acquire French. The term 'submersion' has recently been offered by Swain (1976b) to characterize the experience of minority children in the Southwestern United States who do not fare so welL the children are probably not all that different, but the social contexts of the two situations are replete with blatant contrasts (Fishman, 1976). D. On test biases A great deal has been written recently concerning cultural bias in tests (Briere, 1972, Condon, 1973). No doubt much of what is being said is true. However, some welk-meaning groups have gone so far as to suggest a 'moratorium on ~ll testing of minority children.' Their argument goes something like this. Suppose a child has learned a language that is very different from the language used by the representatives of the majority language and culture in the schools. When the child goes to school, he is systematically discriminated against (whether intentionally or not is irrelevant to the argument). All of the achievement tests, all of the classes, all of the informal teacher and peer evaluations that influence the degree of Success or failure of the child is in a language (or language variety) that he has not yet learned. The entire situation is culturally biased against the child. He is regularly treated in a prejudicial way by the school system MULTILINGUAL ASSESSMENT 85 as a whole. So, some ~urge that we should aim to get the cultural bias out of the school situation as much as possible and especially out of the tests. A smaller group urges that all testing should be stopped indefinitely pending investigation ~f other educational alternatives. The arguments supporting such proposals are persuasive, but the suggested solutions do not solve the problems they so graphically point out. Consider the possibility of getting the cultural bias out .of language proficiency tests. Granted that language tests though they may vary in the pungency of their cultural flavor all have cultural bias built into them. They have cultural bias because they present sequences of linguistic elements of a certain language (or language variety) in specific .contexts. In. fact, it is the purpose of such tests to discriminate between various levels of skill, often, to discriminate between native and non-native performance. A language test is therefore intentionally biased against those who do not speak the language or who do so at different levels of skill. Hence, getting the bias out of language tests, if pushed to the logical limits, is to get language tests to stop functioning as language tests. On the surface, preventing the cultural bias and the discrimination between groups that such tests provide might seem like a good idea, but in the long run it will create more problems than it can solve. For one, it will do harm to the children in the schools who most need help in coping with the majority language system by pretending that crucial communication problems do not exist. At the same time it would also preclude the majority culture representatives in schools from being exposed to the challenging opportunities of trying to cope in settings that use a minority language system. If this latter event is to occur, it will be necessary to evaluate developing learner proficiencies (of the majority types) in terms of the norms that exist for the minority language(s) and culture(s). The more extreme alternative of halting testing in the scho()ls is no real solution either. What is needed is more testing that is based on carefully constructed tests and with particular questions in mind followed by deliberate and careful analysis. Part of the difficulty is the lack of adequate data - not an overabundance of it. For instance, until recently (Oller and Perkins, 1978) there was no data on the relative importance of language variety bias, or just plain language bias in educational testing in general. There was always plenty of evidence that such a factor must be important to a vast array of educational tests, but how important? Opinions to the effect that it is not very important, or that it is of great importance merely accent the w , d' 86 LANGUAGE TESTS AT SCHOOL MULTILINGUAL ASSESSMENT need for empirical research on the question. It is not a question that can be decided by vote ~ not even at the time honored 'grass roots level' ~ but that is where the studies need to be done. There is another important way that some of the facts concerning test biases and their probable effects on the school performance of certain groups of children may have been over-zealously interpreted. These latter interpretations relate to an extension of a version of the strong contrastive analysis hypothesis familiar to applied linguists. The reasoning is not unappealing. Since children who do poorly on tests in school, and on tasks such as learning to read, write, and do arithmetic, are typically (or at least often) children who do not use the majority variety of English at home, their failure may be attributed to differences in the language (or variety of English) that they speak at home and the language that is used in the schools. Goodman (1965) offered such an explanation for the lower performance of inner city black children on reading tests. Reed (1973) seemed to advocate the same view. They suggested that structural contrasts in sequences of linguistic elements common to the speech of such children accounted for their lower reading scores. Similar arguments have been popular for years as an explanation of the 'difficulties' of teaching or learning a language other than the native language ofthe learner group. There is much controverting evidence, however, for either application ofthe contrastive analysis hypothesis (also see Chapter 6 below). F or instance, contrary to the prediction that would follow from the contrastive analysis hypothesis, in two recent studies, black children' understood majority English about as well as white children, but the white children had greater difficulty with minority black English (Norton and Hodgson, 1973, and Stevens, Ruder, and Tew, 1973). While Baratz (1969) showed that white children tend to transform sentences presented in black English into their white English counterparts, and similarly, black children render sentences in white English into their own language variety, it would appear from her remarks that at least the black children had little difficulty understanding white English. This evidence does not conclusively eliminate the position once advocated by Goodman and Reed, but it doe$ suggest the possibility oflooking elsewhere for an explanation"Df reading problems and other difficulties of minority children in the schools. For example, is it not possible that sociocultural factors that are of an essentially non-linguistic sort might play an equal if not greater part in explaining school performance? One might ask whether black children in the communities where differences have 87 been observed are subject to the kinds of reinforcement and punishment cOl1tingencies that are present in the experience of comparable grOllps of children in the majority culture. Do they see their parents reading books at home? Are they encouraged to read by parents and older siblings? These are tentative attempts at phrasing the right questions, but they hint at certain lines of research. As to the contrastive explanation of the difficulties of language learners in acquiring a new linguistic system, a question should suffice. Why should Canadian French be so much easier for some middle class children in Montreal to acquire, than English is for many minority children in the Soqthwest? The answer probably does not lie in the contrasts between the language systems. Indeed, as the data continues to accummulate, it would appear that many of the children in bilingual programs in the United States (perhaps most of the children in most of the programs) are dominant in English when they come to school. The contrastive explanation is clearly inapplicable to those cases. For a review of the literature on second language studies and the contrastive analysis approaches, see Oller (1979). For a sys1ematic study based on a Spanish~English bilingual program in Albuquerque, New Mexico, see Teitelbaum (1976). If we reject the contrastive explanation, what then? Again it seems we are led to emotive aspects of the school situation in relation to the child's experience outside of the school. If the child's cultural background is neglected by the curriculum, if his people are not represented or are portrayed in an unfavorable light or are just simply misrepresented (e.g., the Plains Indian pictured in a canoe in a widely used elementary text, Joe Sando, personal communication), if his language is treated as unsuitable for educational pursuits (possibly referred to as the 'home language' but not the 'school language'), probably just about any teaching method will run into major difficulties. It is in the area of cultural values and ways of expressing emotive level information in general (e.g., ethnic identity, approval, disapproval, etc.) where social groups may contrast more markedly and in ways that are apt to create significant barriers' to communication between groups. The barriers are not so much in the structural systems of the languages (nor yet in the educational tests) as they are in the belief systems and ways of living of different cultures. Such differences may create difficulties for the acceptance of culturally distinct groups and the people who represent them. The failure of the minority child in school (or the failure of the school) is l f 88 LANGUAGE TESTS AT SCHOOL more likely to be caused by a conflict between cultures lind the personalities they sustain rather than a lack of cognitive skills or abilities (see Bloom, 1976). , In any event, none of the facts about test bias lessens the need for sound language proficiency testing. Those facts merely accent the demands on educators and others who are attempting to devise tests and interpret test results. And, alas, as Upshur (1969b) noted testis still a four letter word. E. Translating tests or items Although it is possible to translate tests with little apparent loss of information, and without drastically altering the task set the examinees under some conditions, the translation of items fOr standardized multiple choice tests is beset by fundamental problems of principle. First we will consider the problems oftranslating tests item by item from one mUltiple choice format into another, and then we will return to consider the more general problem of translating pragmatic tasks from one language to another.. It may seem surprising at the outset to note that the former translation procedure is probably not feasiqle while the latter can be accomplished without . . ~. . great difficulty. A doctoral dissertation completed in 1974 at the University of New Mexico investigated the feasibility of translating the Boehm Test of Basic Concepts from. English into Navajo (Scoon, 1974). The test attempts to measure the ability of school children in the early grade.s to handle such notions as sequence in time (before versus after) and location in space (beside, in front of, behind, under, and so on). It was reasoned by the original test writer that children need to be able to understand such concepts in order to follow everyday classroom instructions, and to carry out simple educational tasks. Scoon hoped to be able to get data from the translated test which would help t~ define instructional strategies to aid the Navajo child in the acquisition and use of such concepts. Even though skilled bilinguals in English and Navajo helped with the translation task, and though allowances were made for unsuccessful translations, dialect variations, and the like, the tendency for the translated items to produce results similar to the original items was surprisingly weak. It is questionable whether the two tests can be'said to be similar in what they require of examinees. Some of the items tha,t were among the easiest ones on the English test .. MULTILINGUAL ASSESSMENT 89 turned out to be very difficult on the Navajo version, and vice versa. The researcher began the project hoping to be able to diagnose learning difficulties of Navajo childrerl in their own language. The study evolved to an investigation of the feasibility of translating a standardized test in a multiple choice format from English into Navajo. Scoon conclud~d that translating standardized tests is probably not a feasible approach to the diagnosis of educational aptitudes and skills .. All of this would lead us to wonder about the wisdom of translating standardized tests of 'intelligence' or achievement. Nevertheless, such translations exist. There are several reasons why translating a test, item by item, is apt to produce a very different test than the one the translators began with. Translation of factive-Ievel information is of course possible. However, much more is required. Translation of a multiple choice test item requires not only the maintenance of the factive information in the stem (or lead-in part) of the item, but the maintenance of it in roughly the same relation to the paradigm of linguistic and extralinguistic contexts that it calls to mind. Moreover, the relationships between the distractors and the correct answer must remain approximately the same in terms of the linguistic and extralinguistic contexts that they call to mind and in terms of the relationships (similarities and differences) between all of those contexts. While it may sometimes be difficult to maintain the factive content of one linguistic form when translating it into another language, this may be possible. However, to maintain the paradigm of interrelationships between linguistic and extralinguistic contexts in a set of distractors is probably not just difficult - it may well be impossible. Translating delicately composed test items (on some of the delicacies, see Chapter 9) is something like trying to translate a joke, a pun, a riddle, or a poem. As 'Robert Frost once remarked, when a poem is translated, the poetry is often lost' (Kolers, 1968, p. 4). With test items (a lesser art form) it is the meaning and the relationship between alternative choices which is apt to be lost. A translation of a joke or poem often has to undergo such changes that if it were literally translated back into the source language it would not be recognizable. With test items it is the meaning of the items in terms of their effects on examinees that is apt to be changed, possibly beyond recognition. A very common statement about a very ordinary fact in English f' 90 MULTILINGUAL ASSESSMENT LANGUAGE TESTS AT SCHOOL may be an extremely uncommon statement about a very extr~ ordinary fact in Navajo. A way of speaking in English maybe incomprehensible in Navajo; for instance, the fact that you cut down a tree before you cut it up which is very different from cutting up in a class or cutting down your teacher. Conversely, a commonplace saying in Navajo may be enigmatic ifliterally translated into English: Successful translation of items requires maintaining roughly the same style level, the same frequency of usage of vocabulary and idiom, comparable phrasing and reference complexities, and the same relationships among alternative choices. In some cases this simply cannot be done. Just as a pun cannot be directly translated from one language into another precisely because of the peculiarities of the particular expectancy grammar that makes the pun possible, a test item cannot always be translated so as to achieve equal effect in the target language. This is due quite simply to the fact that the real grammars of natural languages are devices that relate to paradigms of extralinguistic contexts in necessarily unique ways. The bare tip of the iceberg can be illustrated by data from word association experiments conducted by Paul Kolers and reported in Scientific American in March 1968. He was not concerned with the word associations suggested by test items, but his dat~ illustrate the nature of the problem we are' considering. The method consists of presenting a word to a subject such as mesa (or table) and asking him to say whatever other word comes to mind, such as silla (or chair). Kolers was interested in determining whether pairs of associated words were similar in both of the languages of a bilingual subject. In~ fact, he found that they were the same in only about one-fifth of the cases. Actually he complicated the task by asking the subject to respond in the same language as the stimulus word on two tests (one in each of the subject's two languages), and in the opposite language in two other tests (e.g., once in English to Spanish stimulus words, and once in Spanish to English stimulus words). The first two tests· can be referred to as intralingual and the second pair as interlingual. The chart given below illustrates typical responses in Spanish and English under all four conditions. It shows that while the response in English was apt to be girl to the stimulus boy, in Spanish the word muchacho generated the response hombre. As is shown in the chart, the interlingual associations tended to be the same in about one out of five cases. In view of such facts, it is apparent that it would be very difficul~ indeed to obtain similar associations between sets of alternative INTRALINGUAL 91 INTERLINGUAL ENGLISH table boy king house ENGLISH dish girl queen window' ENGLISH "table' boy king house SPANISH sill a nina reina blanco SPANISH mesa muchacho rey casa SPANISH silla hombre reiila madre SPANISH mesa muchacho rey casa ENGLISH chair trousers queen mother TYPICAL RESPONSES in a word-association test wete given by a subject whose native language was Spanish. He was asked to respond in Spanish to Spanish stImulus words, in English to the same words in English and in each language to stimulus words in the other. choices on multiple choice items. Scoon's results showed that the attempt at translating the Boehm test into Navajo did not produce a comparable test. This, of course, does not prove that a comparable test could not be devised, but it does suggest strongly that other methods for test development should be employed. For instance, it would be possible to devise a concept test in Navajo by writing items in Navajo right from the start instead of trying to translate items from an existing English test. , Because of the diversity of grammatical systems that different languages employ, it would be pure luck if a translated test item should produce highly similar effects on a population of speakers who have internalized a very different grammatical system for relating language sequences to extralinguistic contexts. We should predict in advance that a large number of such translation attempts would produce markedly different effects in the target language. Is translation therefore never feasible? Quite the contrary. Although it is difficult to translate puns, jokes, or isolated test items in a mUltiple choice format, it is not terribly difficult to translate a novel, or even a relatively short portion of prose or discourse. Despite the idiosyncrasies of language systems with respect to the subtle and delicate interrelationships of their elements that make poetry and mUltiple choice test items possible, there are certain robust features that all languages seem to share. All of them have ways of coding factive (or cognitive) information and all of them have ways of r 90 I MULTILINGUAL ASSESSMENT LANGUAGE TESTS AT SCHOOL may be an extremely uncommon statement about a very extr;aordinary fact in Navajo. A way of speaking in English maybe incomprehensible in Navajo; for instance, the fact that you cut down a tree before you cut it up whichjs very different from cutting up in a class or cutting down your teacher. Conversely, a commonplace saying in Navajo may be enigmatic ifliterally translated into English. Successful translation of items requires maintaining roughly the same style level, the .same frequency of uSage of vocabulary and idiom, comparable phrasing and reference complexities, and the same relationships among alternative choices. In some cases this simply cannot be done. Just as a pun cannot be directly translated from one language into another precisely because of the peculiarities of the particular expectancy grammar that makes the pun possible~ a test item cannot always be translated so as to achieve equal effect in the target language. This is due quite simply to the fact that the real grammars of natural languages are devices that relate to paradigms of extralinguistic contexts in necessarily unique ways. ' The bare tip of the iceberg can be illustrated by data from word association experiments conducted by Paul Kolers;and reported in. Scientific American in March 1968. He was not concerned with the word associations suggested by test iteins,but rus data illustrate the nature of the problem we are considering. The method consists of presenting a word to a subject suCh as mesa (or table) and asking him to say whatever other word comes to mind, such as silla (or chair). Kolers was interested in determining whether pairs of associate!i words were similar in both of the languages of a bilingual subject. ,~n ' fact, he found that they were the same in only about one-fifth of the cases. Actually he complicated the task by asking the subject to respond in the same language as the stimulus word on two tests (one in each of the subject's two languages), and in the opposite language in two other tests (e.g., once in English to Spanish stimulus words, and once in Spanish to English stimulus words). The first two tests ~ can be referred to as intralingual and the second pair as interlingual. The chart given below illustrates typical responses in Spanish and English under all four conditions. It shows that while the response in English was apt to be girl to the stimulus boy, in Spanish the word muchacho generated the response hombre. As is shown in the chart, the interlingual associations tended to be the same in about one out of five cases. ) In view of such facts, it is apparent that it would be very difficult inde{(d to obtain similar associations between sets of alternative INTRALINGUAL 91 INTERLINGUAL ENGLISH table boy king house ENGLISH dish girl queen Window ENGLISH table boy king 'house SPANISH silla nifia reina blanco SPANISH mesa muchacho rey casa SPANISH silla hombre reina madre SPANISH 'mesa muchacho rey casa ENGLISH chair trousers queen . mother TYPICAL RESPONSES in a word-association test were given by a subject whose native language was Spanish. He was asked to respond in Spanish to Spanish stimulus words, in English to the same words in English and in each language to stimulus words in the other. choices on multiple choic~ items. Scoon's results showed that the attempt at translating the Boehm test into Navajo did not produce a comparable test. This, of course, does not prove that a comparable test could not be devised, but it does suggest strongly that other methods for test development should be employed. For instance, it would be possible to devise a concept test in Navajo by writing items in Navajo right from the start instead of trying to translate items from an existing English test. . Because of the diversity of grammatical systems that different languages employ, it would be pure luck if a translated test item should produce highly similar effects on a population of speakers who have internalized a very different grammatical system for relating language sequences to extralinguistic contexts. We should predict in advance that a large number of such translation attempts would produce markedly different effects in the target language. Is translation therefore never feasible? Quite the contrary. Although it is difficult to translate puns, jokes, or isolated test items in a multiple choice format, it is not terribly difficult to translate a novel, or even a relatively short portion of prose or discourse. Despite the idiosyncrasies of language systems with respect to the subtle and delicate interrelationships of their elements that make poetry and multiple choice test items possible, there are certain robust features that all languages seem to share. All of them have ways of coding factive (or cognitive) information and all of them have ways of F 92 MULTILINGUAL ASSESSMENT LANGUAGE TESTS AT SCHOOL expressing emotive (or affective) messages, and in doing these things, all languages are highly organized. They are both redundant and creative (Spolsky, 1968). According to recent research by John McLeod (1975), many languages are probably about equally redundant. These latter facts about similarities oflinguistic systems suggest the possibility of applying roughly comparable pragmatic testing procedures across languages with equivalent effects. For instance, there is empirical evidence in five languages that translations and th~ original texts from which they were translated can be converted into cloze tests of roughly comparable difficulty by following the usual procedures (McLeod, 1975; on procedures see also Chapter 12 of this . book). The languages that McLeod investigated included Czech, Polish, French, German, and English. However, using different methods, Oller, Bowen, Dien, and Mason (1972) showed that it is probably possible to create cloze tests of roughly comparable difficulty (assuming the educational status, age, and socioeconomic factors are controlled) even when the languages are as different as English, Thai, and Vietnamese. In a different application of cloze procedure, Klare, Sinaiko, and Stolurow (1972) recommend 'back translation' by a translator who has not seen the original passage. McLeod (1975) used this procedure to check the faithfulness of the translations used in his study an~ referred to it as 'blind back translation'. Whereas in the item-by-item translation that is necessary for multiple choice tests of the typical' standardized type there will be a systematic one-for-one correspondence between the original test items and the translation version, this is not possible in the construction of cloze tests over translation of equivalent passages. In fact it would violate the normal use of the procedure to try to obtain equivalences between individual blanks in the passages. Other pragmatic testing procedures besides the cloze technique could perhaps also be translated between two or more languages in order to obtain roughly equivalent tests in different languages. Multiple choice tests that qualify as pragmatic tasks should be. translatable in this way (for examples, see Chapter 9). Passages in English and Fante were equated by a translation procedure tp test reading comprehension in a standard mUltiple choice form'at by Bezanson and Hawkes (1976). Kolers (1968) used a brief paragraph carefully translated into French as a basis for investigating the nature of bilingualism. He also 93 constructed two test versions in which either English or French phrases were interpolated into the text in the opposite language. The task required of subjects was to read each passage aloud. Kolers was interested in whether passages requiring switching between English and French would require more time than the monolingual passages. He determined that each switch on the average required an extra third of a second. The task, however, and the procedure for setting up the task could be adapted to the interests of bilingual or multilingual testing in other settings. Ultimately, the arguments briefly presented here concerning the feasibility of setting up equivalent tests by translation between two or more languages are related to the controversy about discrete point and integrative or pragmatic tests. Generally speaking it should be difficult totranslate discrete point items in a multiple choice (or other) format while maintaining equivalence, though it is probably quite feasible to translate pragmatic tests and thereby to obtain equivalent tests in more than one language. For any given discrete item, translation into another language will produce (in principle and of necessity) a substantially different item. For any given pragmatic testing procedure on the other hand, translation into another language (if it is done carefully) can be expected to produce a substantially similar test. In this respect, discrete items are delicate whereas pragmatic procedures are robust. Of course, the possibility of translating tests of either type merits a great deal more empirical study. F. Dominance and proficiency We noted above that the 'Lau Remedies' require data concerning language use and dominance. The questions of importance would seem to be, what language does the child prefer to use (or feel most comfortable using) with whom and in what contexts of experience? And, in which of two or more languages is the child most competent? The most common way of getting information concerning language use is by either interviewing the children and eliciting information from them, or by addressing a questionnaire to the teacher(s) or some other person who has the opportunity to observe the child. Spolsky, Murphy, Holm, and Ferrel (1972) offer a questionnaire in Spanish and one in Navajo (either of which can also be used in English) to 'classify' students roughly according to the same basic categories that are recommended in the 'Lau Remedies' (see Spolsky, et ai, p. 79). - 94 r f LANGUAGE TESTS AT SCHOOL' Since the 'Remedies' came more than two years later than the Spolsky, et al article, it may be safe to assume, that the scale recommended in the 'Remedies' derives from that source. The teacher questionnaire involves a similar five point scale (see Spolsky, et ai, p. 81). Two crucial questions arise. Can children's responses to questiqns concerning their language-use patterns be relied upon for the important educational'decisions that must be made? And, seco~d, can teachers judge the language ability of children in their classrooms (bear in mind that in many cases the teachers are not bilingual themselves; in fact, Spolsky, 1974, estimates that only about 5 %~of the teachers on the Navajo reservation and in BIA schools speak Navajo)? Related to these crucial questions is the empirical problt;:m of devising careful testing procedures to assess the validity of selfreported data (by the child), and teacher-reported judgements. Spolsky, et al made several assumptions concerning the interview technique which they used for assessing dominance in Spanish and English: ifJ 1. . .. ' bilingual dominance varies from domain to domain. Subscores ~ere therefore given for the domains of home, neigh, borhood, and school. i. 2. A child's report of-Ills own language use is likely to be quite accurate. 3. Vocabulary fluency (word-naming) is a good measure of knowledge of a language and it is a good method of comparing knowledge of two languages. 4. The natural bias of the schools in Albuquerque as a testing situation favors the use of English; this needs to be counteracted by using a Spanish speaking interviewer (p. 78). If such assumptions can be made safely, it ought to be possible t9 make similar assumptions in other contexts and with little modification extend the 'three functional tests of oral proficiency' recommended by Spolsky, et al. Yet they themselves say, somewhat ambiguously: Each of the tests described was developed for a specific purpose and it would be unwise to use it more widely without careful consideration, but the general principles involved may prove useful to others,who need tests that can serve similar purposesL (p.77). One is inclined to ask what kind of consideration? Obviously a local MULTILINGUAL ASSESSMENT 95 meeting of the School Board or some other organization will not suffice to justify the assumptions listed above, or to guarantee the success of the testing procedures, with or without adaptations to fit a particular local situation (Spolsky, 1974). What is required first is some careful logical investigation' of possible outcomes from the procedures recommended by Spolsky, et ai, and ~the~ procedures which can be devised for the purpose of crossvahdatlOn. Second, empirical study is required as illustrated, for example, in the Appendix below. Zirkel (1974) points out that it is not enough merely to place a child on a dominance scale. Simple logic will explain why. It is possible for two children to be balanced bilinguals in terms of such a scale but to differ radically in terms of their developmental levels. An extreme case would be children at different ages. A more pertinent case would be two children of the same age and grade level who are balanced bilinguals (thus both scoring at the mid point of the dominance scale, see p. 76 above), but who are radically different in language skill in both languages. One child might be performing at an advanced level in both languages while the other child is performing at a much lower level in both languages. Measuring for dominance-only would not reveal such a difference. No experimentation is required to show the inadequacy of any procedure that merely assesses dominance - even if it does the job accurately, and it is doubtful whether some of the procedures being recommended can do even the job of dominance assessment accurately. Besides, there are important considerations in addition to mere language dominance which can ,enter the picture only when valid proficiency data are available for both languages (or each of the several languages in a multilingual setting). Moreover, with care to insure test equivalence across the languages assessed, dominance scores can be derived directly from proficiency data - the reverse is not necessarily possible. Hence, the question concerning how to acquire reliable information concerning language proficiency in multilingual contexts, including the important matter of determining language dominance, is essentially the same question we have been dealing with throughout this book. In order to determine language dominance accurately, it is necessary to impose the additional requirement of equating tests across languages. Preliminary r~sults of McLeod (1975), Klare, Sinaiko, and Stolurow (1972), Oller, Bowen, Dien, and Mason (1972), and Bezanson and Hawkes (1976) suggest that careful t 96 MULTILINGUAL ASSESSMENT LANGUAGE TESTS AT SCHOOL 97 A more important question is not whether there are contrasts across domains, but whether the 'word-naming' task is a valid indication oflanguage proficiency. Insufficient data are available. At face value such a task appears to have little relation to the sorts of things that people normally do with language, children especially. Such a task does not qualify as a pragmatic testing procedure because it does not require time-constrained sequential processing, and it is doubtful whether it requires mapping of utterances onto extralinguistic contexts in the normal ways that children might perform such mappings - naming objects is relatively simpler than even the speech of median-ranged three-and-a-halfyear old children (Hart, 1974, and Hart, Walker, and Gray, 1977). Teitelbaum (1976) correlated scores on word-naming tasks (in Spanish) with teacher-ratings; and self-ratings differentiated by four domains (,kitchen, yard, block, school'). For a group of kindergarten through 4th grade children in a bilingual program in Albuquerque (nearly 100 in all), the correlations ranged from .15 to .45. Correlations by domain with scores on an interview task, however, ranged from .69 to. 79. These figures hardly justify the differentiation of language dominance by domain. The near equivalence of the correlations across domains with a single interview task seems to show that the domain differentiation is pointless. Cohen (1973) has adapted the word-naming task slightly to convert it into a storytelling procedure by domains. His scoring is based on the number of different words used in each story-telling domain. Perhaps other scoring techniques should also be considered. The second assumption quoted above was that a child's own report of his language use is apt to be 'quite accurate'. This may be more true for some children than for others. For the children in Teitelbaum's study neither the teacher's ratings nor the children's own ratings were very accurate. In no case did they account for more than 20 % of the variance in more objective measures oflanguage proficiency. What about the child Zirkel (1976) referred to? What if some children are· systematically indoctrinated concerning what language they are supposed to use at school and at home as some advocates of the 'home language/school language' dichotomy adyocate? Some research with bilingual childnm seems to suggest that at an early age they may be able to discriminate appropriately between occasions when one language is called for and occasions when the other language is required (e.g., answering in French when spoken to in French, but in English when spoken to in English, Kinzel, 1964), translation may offer a solution to the equivalence problem, and no doubt there are other approache~ that will prove equally effective. There are pitfalls to be avoided, however. There is no doubt that it is possible to devise tests that do not accomplish what they were designed to accomplish - that are not valid. Assumptions of validity are justifiable only to the extent that assumptions of lack of validity have been disposed of by careful research. On that theme let us reconsider the four assumptions quoted earlier in this section. What is the evidence that bilingual dominance 'varies from· domain to domain'? In 1969, Cooper reported that a word-naming task (the same sort of task used in the Spanish-English test ofSpolsky, et ai, 1972) which varied by domains such as 'home' versus 'school' or 'kitchen' versus 'classroom' produced different scores depending on the domain referred to in a particular portion of the test. The task set the examinee was to name all the things he could think of in the 'kitchen', for example. Examinees completed the task for each domain (five in all in Cooper's study) in both languages without appropriate counterbalancing to avoid an order effect. Since there were significant contrasts between relative abilities of subjects to do the task in Spanish and English across domains, it was concluded that their degree of dominance varied from one domain to another. This is a fairly broad leap of inference, however. Consider the following question: does the fact that I can name more objects in Spanish that I see in my office than objects that I can see under the hood of my car mean that I am relatively more proficient in Spanish when sitting in my office than when looking under the hood of my car? What Cooper's results seem to show (and Teitelbaum, 1976, found similar results with a similar task) is that the number of things a person can name in reference to one physical setting may be smaller or greater than the number that the same person can name in reference to another physical setting. This is not evidence of a very direct sort about possible changes in language domin.ance when sitting in your living room, or when sitting in a classroom. Not even the contrast in 'word-naming' across languages is necessarily an indication of any difference whatsoever in language dominance in a broader sense. Suppose a person learned the names bf chess pieces in a language other than his native language, and suppose further that he does not know the names of the pieces in his native language. Would this make him dominant in the foreign language when playing chess? tr 98 LANGUAGE TESTS AT SCHOOL without being able to discuss this ability at a more abstract level (e.g., reporting when you are supposed to speak French rather than English). Teitelbaum's data ~ reveal little correlation between questions about language use and scores on more objective language proficiency measures. Is it possible that a bilingual child is smart enough to be sensitive t() what he thinks the interviewer expects him to S!lY? Upshur (197Ia) observes, 'it isn't fair to ask a man to cut his own throat, and even if we should ask, it isn't reasonable to expect him to do it. We don't ask a man to rate his proficiency when an honest answer might result in his failure to achieve some desired goal' (p. 58). Is it fair to expect a child to respond independently of what he may think the interviewer wants to hear? We have dealt earlier with the third assumption quoted above, so we come to the four,th. Suppose that we assume the interviewer should be a speaker of the minority language (rather than English) in order to counteract an English bias in the schools. There are several possibilities. Such a provision may have no effect, the desired effect . (if indeed it is desired as it may distort .the picture of the child'§- true capabilities along the lines of the preceding paragraph), or an effect , that is opposite to the desired one. The only way-to determine which re~ult 'is the actual oneisfu devise sOme etnpirical measure of the relative magnitude of a p'ossible interViewer effect. r i 99 MULTILINGUAL ASSESSMENT providing a basis for equivalent tests in different languages that will yield proficiency data in both languages and that will simultaneously provide dominance scores of an accurate and sensitive sort. Figure 3 offers a rough conceptualization of the kinds of equivalent measures needed. If Scale A in the diagram represents a range of possible scores on a test in language A, and if Scale B represents a range of possible scores on an equivalent test in language B, the relative difference in scores on A and B can provide the basis for placement on the dominance scale C (modelled after the 'Lau Remedies' or the Spolsky, et aI, 1972, scales). 0% 100% I Scale A 0% 100% I Scale B ~ G. Tentatiye suggestions .\ What methods then ~an be recom1!lended for multilingual testing? There are many methods that can be expected to work well and deserve to be tried - among them are suitable adaptations of the methods discussed in this book in Part Three. Some of the ones that have been used with encouraging results include oral interview procedures of a wide range of types (but designed to elicit speech from the child and data on comprehension, not necessarily the child's own estimate of how well he speaks a certain language) - see Chapter 11. Elicited imitation (a kind of oral dictation procedure) has been widely used - see Chapter 10. Versions of the cloze procedure (particularly ones that may be administered orally) are promising and have been used with good results - see Chapter 12. Variations on composition tasks and story telling or retelling have also been used - see Chapter 13. No doubt many other procedures can be devised - Chapter 14 offers some suggestions and guidelines. In brief, what seems to be ...required is a class oftesti~g procedures A Ab AB Ba I I IScale C Figure 3. A dominance scale in relation to proficiency scales. (Scales A and B represent . equivalent proficiency tests in languages A and B, while scale C represents a dominance scale, as required by the Lau Remedies. It is claimed that the meaning of C can only be adequately defined in relation to scores on A and B.) I B I It would be desirable to calibrate both ofthe proficiency scales with reference to comparable groups of monolingual speakers of each language involved (Cowan and Zarmed, 1976, followed such a procedure) in order to be able to interpret scores in relation to clear criteria of performance. The dominance scale can be calibrated by defining distances on that scale in ter1fs of units of difference in proficiency on Scales A and B. This can be done as follows: first, subtract each subject's score on A from the score on B. (If the tests are calibrated properly, it is not likely that anyone will get a perfect score on either test though there may be some zeroes.) Then rank order the results. They should range from a series of positive values to a series of negative values. If the group tested consists only of children who are dominant in one language, there will be only positive or only negative values, but not 100 LANGUAGE TESTS AT SCHOOL both. The ends of the rank will define the ends of the dominance scale (with reference to the population tested) - that is the A and B ,points on Scale C in Figure 3. The center point, AB on Scale C, is simply the zero position in the rank. That is the point at which a subje-<.:t's scores in both languages are equal. The points between the ends and the center, namely, Ab and Ba;can be defined by finding the mid point in 'v the rank between that end (A or B) and the center (AB). The degree or' accuracy with which a particular subject can be classed as A = monolingual in A, Ab = dominant in A, AB ,i: equally bilingual in A and, B, Ba = dominant in B, B = monolingual in B, can . be judged quite accurately in' terms of t];le standard error of differences on Scales A and B. The standard error of the differences can be computed by finding the variance of the differences (A minus B, Jor each subject) and then dividing it by the square root of the number of subjects tested. If the distribution of differences is approximately normal, chances are better than 99 in 100 that a subject's true degree ofbilinguality will fall within the range of plus or minus three standard errors above· or below his actual attained score on 'Scale C. If measuring off ± 3 standard errors from a subject's ~attained score sti11leaves him close t~,say, Ab on the Scale, we can be confident in classifying him a~dominant in A'. Thus, if the average standard error of differences in scores on tests A and B is large, the accuracy of Scale C will be less than if the average standard error is small. A general guideline might be to require at least six standard errors between: each of the five points on the domiriance scale. It remains to be seen, however, what degree, of accuracy will be possible. For suggestions on equating scales across languages, see Chapter 10, pp. 289-295. 1 KEY POINTS 1. There is a serious need for multilingual testing in the schools not just of the United States, but in many nations. 2. In the Lau versus Nichols case in 1974, the Supreme Court ruled that the San Francisco Schools were violating a section of the Civil Rights Code which 'bans discnmination based on the ground of race, color, or national origin' (1974, No. 72-6520). It was ruled that the schools should either provide instruction in the native language of the 1,800 Chinese speaking children in question, or provide special instruction in the English language. 3. Even at the present, in academic year 197~-9, many bilingual programs 1 Also see discussion question number 10 at the end of Chapter 10. Much basic research is needed on these issues. MULTILINGUAL ASSESSMENT 101 and many schools which have children of multilingual backgrounds are not doing adequate language assessment. 4. There are important parallels between multilingual and multidialectal societies. In both cases there is a need for language assessment procedures referenced against group norms (a plurality of them). 5. In a strong logical sense, a language is its varieties or dialects, and the dialects or varieties are languages. A particular variety may be elevated to a highe~ status by virtue of the 'more equal syndrome', but this' does not necessItate that other varieties must therefore be regarded as less than languages - mere 'potential languages ' . 6. Prejudices can be institutionalized in theories of 'deficits' or 'deprivations'. The pertinent question is, from whose point of view? The instit~tionalization of such theories into· discriminatory educational practIces may well create real deficits. 7. It i~ h~poth~sized that, at least for the minority child, and perhaps for the maJonty chIld as well, variance in learning in the schools may be much more a function of the emotive aspects of interactions within and outside of schools than it is a function of methods of teaching and presentation of subject matter per se. 8. When emotive level struggles arise, factive level communication usually stops a l t o g e t h e r . . 9. Ignoring the existence of a child or social group is a cruel punishment. Who discovered America? _ 10. ~etting the c.ultural ?ias out oflanguage tests would mean making them mto. something beSIdes language tests. However, adapting them to partIcular cultural needs is another matter. 11. C<;mtr~stive ~nalysis based explanations for the generally lower scores of ~mont~children on educational tests run into major empirical difficultIes. Other factors appear tb~be much more important than the surface forms of different languagei(;r language varieties. . 12. !ranslating discrete point test items is roughly comparable to translating Jokes, or puns, or poems. It is difficult if not impossible. 13. Translating pragmatic tests or testing procedures on the other hand is mo~e like translating prose or translating a novel. It can be done, not . easily perhaps, but at least it is feasible. 14. 'Blind back translation' is one procedure for checking the accuracy of translation attempts. 15. Measures of multilingual proficiencies require valid proficiency tests. The val~dity of proposed procedures is an empirical question. AssumptIOns must be tested, or they remain a threat to every educational decision based on them. 16. Measuring dominance is not e~ough. To interpret the meaning of a score on a dominance scale, it is useful to know the proficiency scores which it derives from. 17. It is suggested that a dominance scale of the sort recommended in the Lau ~emedies c~n be calibrated in terms of the average standard error of the dIfferences m test scores against which the scale is referenced. 18. Similarly, it is recommended that scores on the separate proficiency tests be referenced against (and calibrated in terms of) the scores of 102 LANGUAGE TESTS {\.T SCHOOL MULTILINGUAL ASSESSMENT monolingual children who speak the language of the test. (It is realized that this may not be possible to attain in reference to some very small populations where the minority language is on the wane.) DISCUSSION QUESTIONS 1. Ho~ are' curricular decisions regarding the delivery of instructional , materials made in your school(s)? Schools that you know of? 2. If you work in or know of a bilingual program, what steps are taken in that program to assess language proficiency? How are the scores interpreted in relation to the curriculum? If you knew that a substantial number of the children in your school were ~approximately equally proficient in two languages, what curricular decisions would .you recommend? What else would you need to know in order to reQommend policies? .13. If you were asked to rank priorities for budgetary expenditures, where would language testing come on the list for tests and measurements? Or, would there be any such list? 4. What is wrong with 'surname surveys' as a basis for determining language dominance? What can be said about a child whose name is Ortiz? Smith? ReitzeJ? What about asking the child concerning his language preferences? What are some of the factors that might influence the child's response? Why not just have the teachers in the schools judge . the profiCiency of the children ?' . 5. What price would you say wouid be a fair one for being able to communicate in one of the world's power languages (especially English)? Consider the case for the child in the African -primary schools as suggested by Hofman. Is it worth the cost? '116. Try to conceive of a language test that need not make reference to group norms. How would such a test relate to educational policy? 7. Consider doing a study of possible language variety bias in the tests used in your school. Or perhaps a language bias study, or combination of the two would be more appropriate. What kinds of scores would be available for the study? IQ? Aptitude? Achievement? Classroom observations? What sorts oflanguage proficiency measures might you use? (Anyone seriously considering such a study is urged to consult Part Three, and to ask the advice of someone who knows something about research design before actually undertaking the project. It could be done, however, by any teacher or administrator capable of persuading others that it is worth doing.) Are language variety biases in educational testing different . ~~? ' 8. Begin to construct a list of the sociocultural factors that might btl partially accountable for the widely discussed view that has been put forth by Jensen and Herrnstein to the effect that certain races are superior in intelligence. What is intelligence? How is it measured with reference to your school or school populations that you know of? 9. Spend a few days as an observer of children in a classroom setting. Note ways in which the teacher and significant others in the school communicate emotive information to the children. Look especially for "10. 11. 12. 13. --"i4. 15. 16. 103 contrasts in what is said and what is meant from the child's perspective. Consider the school curriculum with a view to its . representation of different cultures and language varieties. Observe the behaviour of the children. What kinds of me-ssages-' do they pick up and pass on? What kinds of beliefs and attitudes are they being encouraged to accept or reject? Discuss the differences between 'submersion' and 'immersion' as educational approaches to the problem of teaching a child a new language and cultural system. (See Barik and Swain, 1975, in Suggested Readings at the end of this chapter.) Consider an example or several of them of children who are especially low achievers in your school or in a school that you know of. Look for sociocultural factors that may be related to the low achievement. Then, consider some of the high achievers in the same context. Consider the probable effects of sociocultural contexts. Do you see a pattern emerging? To what extent is 'cultural bias' in tests identical to 'experiential bias' i.e., simply not having been exposed to some possible set of experiences? Can you find genuine cultural factors that are distinct from experiential biases? Examine a widely used standardized test. If possible discuss it with someone whose cultural experience is very different from your own. Are there items in it that are clearly biased? Culturally? Or merely experientially? Discuss factors influencing children who learn to read before they come to school. Consider those factors in relation to what you know of children who fail to learn to read after they come to. scho.ol. Where do.es language development fit into the picture? Boo.ks in the home? Mo.dels who are readers that the child might emulate? Do children of widely different dialect o.rigins ~ the United States (or elsewhere) learn to read much the same material? Consider Great Britain, Ireland, Australia, Canada, and other nations. Consider the fact that test is a four letter wo.rd (Upshur, 1969b). Why? How have tests been misused o.r simply used to make them seem so o.mino.us, portentous, even wicked? Reconsider the definition o.f language tests o.ffered in Chapter 1. Can you think o.f any innocuo.us and benign procedures that qualify as tests? Ho.w co.uld such procedures be used to reduce the threat of tests? Try tran~lating a few discrete items o.n several types o.f tests. Have so.meo.ne else who. has not seen the o.riginal items do a 'blind back translation'. Check the co.mparability of the results. Try the same with a passage o.f prose. Can co.rrections be made to. clear up the difficulties that arise? Are there substantial contrasts between the two tasks? Try a language interview procedure that asks children how well they understand o.ne o.r more languages and when they use it. A procedure such as the one suggested fo.r Spanish-English bilinguals by Spo.lsky, et al (1972, see Suggested Read~gs fo.r Chapter 4) sho.uld suffice. Then test the children o.n a battery o.f o.ther measures - teacher evaluations of the type suggested by Spo.lsky, et al (1972) fo.r Navajo-English bilinguals might also be used. Interco.rrelate the sco.res and determine empirically --...--,----- -- ------ 104 - ---.- - LANGUAGE TESTS AT SCHOOL their degree of variance overlap. To what extent can the various procedures be said to yield the same information? 17. What kinds of tests can you conceive of to assess the popular opinion that language dominahce varies from domain to domain? Be careful to define 'domain'in a sociolinguistically meaningful and pragmatically ' useful way. 18. Devise proficiency tests in two or more languages and attempt to calibrate them in the recommended ways - both against the scores of monolingual reference groups (if these are accessible) and in terms of the average standard error of the differences on the two tests. Relate them to a five point dominance'scale such as the one shown in Figure 3 above. Correlate scpres on the proficiency tests that you have devised with other standardized tests used in the school from which your sample population was drawn. To what extent are variances on other tests accounted for by variances in language proficiencies (especially in English, assuming that it is thilanguage of practically all standardized testing in the schools)? SUGGESTED READINGS 1. H. C. Barik and Merrill Swain, 'Three-year Evaluation of a Large Scale Early Grade French Immersion Program: the Ottawa study,' Language Learning 25, 1975, 1-30. 2. Andrew Cohen, 'The Sociolinguistic Assessment of Speaking Skills in a Bilingual Education Program,' i,n L. Palmer andB. Spolsky (eds.) Papers on Language Testing. Washington, D\C.: TESOL, 1975, 173-186. 3. Paul A. Kolers, 'Bilingualism and Information Processing,' Scientific American 218,1968,78':"84. 4. 'OCR Sets Guidelines for Fulfilling Lau Decision,' The Linguistic Reporter 18,1975,1,5-7. (Gives addresses ofLau Centers and quotes the text of the 'Lau Remedies' recommended by the Task Force appointed by the Office of Civil Rights.) 5. Patricia J. Nakano, 'Educational Implications of the Lau v. Nichols Decision,' in M. Burt, H. Dulay, and M. Finocchinaro (eds.) Viewpoints on English as a Second Language. New York: Regents, 1977,219-34. 6. Bernard Spolsky, 'Speech Communities and Schools,' TESOL Quarterly 8, 1974, 17-26. 7. Bernard Spolsky, Penny Murphy, Wayne Holm, and Allen Ferrel, 'Three Functional Tests of Oral Proficiency,' TESOL Quarterly 6,1972, 221-235. (Also in Palmer and Spolsky, 75-90, see reference 2 above. Page references in this text are to the latter source.) 8. Perry A. Zirkel, 'A Method for Determining and Depicting Language Dominance,' TESOL Quarterly 8, 1974,7-16. 9. Perry A. Zirkel, 'The Why's and Ways of Testing Bilinguality Before Teaching Bilingually,' The Elementary School Journal, March 1976, 323-330. r 5 Measuring Attitudes and Motivations A. The need for validating affective measures B. Hypothesized relationships between affective variables and the use and learning oflanguage C. Direct and indirect measures of affect D. Observed relationships to achievement and remaining puzzles A great deal of research has been done on the topic of measuring the affective side of human experience. Personality, attitudes, emotions, feelings, and motivations, however, are subjective things and even our own experience tells us that they are as changeable as the wind. The question here is whether or not they can be measured. Further, what is the relationship between existing measures aimed at affective variables and measures aimed at language skill or other educational constructs? A. The need for validating affective measures Noone seems to doubt that attitudinal factors are related to human performances. 1 In the preceding chapter we considered the hypothesis that emotive or affective factors play a greater part in determining success or failure in schools than do factive or cognitive factors (particularly teaching methodologies). The problem is how to determine what the affective factors might be. It is widely believed 1 Throughout this chapter and elsewhere in the book, we often use the term 'attitudes' as a cover term for all sorts of affective variables. While there are many theories that - distinguish between many sorts of attitudinal, motivational and personality variables, all ofthem are in the same boat when it comes to validation. 105 106 LANGUAGET~TSATSCHOOL that a child's self-concept (confidence or lack of it, willingness to take social risks, and all around sense of well-being) must contribute to virtually every sort of school performance - or performance outside of the school for that matter. Similarly, it is believed that the child's view of others (whether of the child's own ethnic and sodal background, or of a different cult~ral background, whether peers.or non-peers) will influence virtually every,aspect of his interpersonal interactions in positive and negative ways that contribute to success or failure (or perhaps just to happiness in life, which though a vaguer concept, may be a better one). . . It is not difficult to believe that attitude variables are important to a wide range of cognitive phenomena - perhaps the whole range - but it is difficult to say just exactly what attitude variables are. Therefore, it is difficult to prove by the usual empirical methods of science that attitudes actually have the importance usually attributed to them. In his widely read and often cited book, Beyond Freedom and Dignity, B. F. Skinner (1971) offers the undisguised thesis that such concepts as 'freedom' and 'dignity' not to mention 'anxiety', 'ego', and the kinds of terms popular in the attitude literature are merely loose and misleading ways of speaking about the kinds of events that control behavior. He advocates sharpening up our thinking and our ways of controlliug behavior in order to save the world - 'not ... to destroy [the] environment or escape from it ... [but] to redesign it' (p. 39). To Skinner, attitudes are dispensable intervening variables between behavior and the consequences of behavior. They can thus be done away with. If he were quite correct, it ought to be possible to observe behaviors and their consequences astutely enough to explain all there is to explain about human beings - however, there is a problem for Skinner's approach. Even simpler systems than human beings are not fully describable in that way - e.g., try observing the input and output of a simple desk calculator and see if it is possible to determine how it works inside - then recall that human beings are much more complex than desk calculators. Only very radical and very narrow (and ther~fore largely uninteresting and very weak theories) are able to dispense completely with attitudes, feelings, personalities, and other difficult-to-measure ! internal states and motives of human beings. It seems necessary, therefore, to take attitudes into account. The problem begins, however, as soon as we try to be very explicit about what an attitude is. Shaw and Wright (1967) say that 'an attitude is a hypothetical, or latent, variable' (p. 15). That is to say, an attitude is not the sort of MEASURING ATTITUD~ AND MOTIVATIONS Hi? variable that can be observed directly. If someone reports that he is angry, we must either take his word for it, or test his statement on the basis of what we see him doing to determine whether or not he is really angry. In attitude research, the chain of inference is often much longer than just the inference from a report to an attitude or from a behavioral pattern to an attitude. The quality or quantity of the attitude can only be inferred from some other variable that can be measured. For instance, a respondent is frequently asked to evaluate a statement about some situation (or possibly a mere proposition about a very: general state of affairs). Sometimes he is asked to say how he would act or feel in a given described situation. In so-called 'projective' techniques, it is further necessary Jor some judge or evaluator to rate the response of the subject for the degree to which it displays some attitude. In some of these techniques there are so many steps of inference where error might arise that it is amazing that such techniques ever produce usable data, but apparently they sometimes do. Occasionally, we may be wrong in judging whether someone is happy or sad, angry or glad, anxious or calm, but often we are correct in our judgements, and a trained observer may (not necessarily will) become very skilled in making such judgements. Shaw and Wright (1967) suggest that 'attitude measurement consists of the assessment of an individual's responses to a set of situations. The set of situations is usually a set of statements (items) about the attitude object, to which the individual responds with a set of specific categories, e.g., agree and disagree .... The ... number derived from his scores represents his position on the latent attitude variable' (p. 15). Or, at least, that is what the researcher hopes and often it is what the researcher merely asserts. For example, Gardner and Lambert (1972) assert that the degree of a person's agreement with the statement that 'Nowadays more and more people are prying into matters that should remain personal and private' is a reflection of their degree of 'anti-democratic ideology' (p. 150). The statement appears in aJ scale supposed by its authors (Adorno, FrenkelBrunswick, Levinson, and Sanford, 1950) to measure 'prejudice and authoritarianism'. The scale was used by Gardner and Lambert in the 1960s in the states of Maine, Louisiana, and Connecticut. Who can deny that the statement was substantially true and was becoming more true as the Nixon regime grew and flourished? The trouble with most attitude measures is that they have never been subjected to the kind of critical scrutiny that should be applied 108 LANGUAGE TESTS AT SCHOOL to any test that is used to make judgements (or to refute judgements) about human beings.' In spite of the vast amounts of research completed in the last three or four decades on the topic of attitudes, personality, and measures of related variables, precious little has been learned that is not subject to severe logical and empirical doubts. Shaw and Wright (1967) who report hundreds of attitude measures along with reliability statistics ahd validity results, where available, lament their 'impression that much effort has been wasted .... ' and that 'nowhere is this more evident than in relation to tlie instruments used in the measurement of attitudes' (p. ix). They are not alone in their disparaging assessment of attitude measures. In his preface to a tome of over a thousand pages on Personality: Tests and Reviews, Oscar K. Buros (1970) says: Paradoxically, the area of testing which has outstripped all others in the quantity of research over the past thirty years is also the area in which our testing procedures have the generally least accepted validity .... In my own case, the preparation,of this volume has caused me to become increasingly discouraged at the snail's pace at which we are advancing compared to the tremendous progress being made in the areas of medicine, science, i;lnd technology. As a profession, we are prolific researchers; but somehow or other there is very little agreement aboutwhat is the resulting verifiable knowledge (p. ixx). In an' article on a different topic and addressed to an entirely different audience, John R. Platt offered some comments that may help to explain the relative lack of progress in social psychology and in certain aspects of the measurement of sociolinguistic variables. He argued that a thirty year failure to agree is proof of a thirty year failure to do the kind of research that produces the astounding and remarkable advances of fields like 'molecular biology and highenergy physics' (1964, p. 350). The fact is that the social sciences generally are among the 'areas of science that are sick by comparjson because they have forgotten the necessity for alternative hypotheses and disproof' (p. 350). He was speaking of his own field, chemistry, when he coined the terms 'The Frozen Method, The Eternal Surveyor, The Never Finished, The Great Man with a Single Hypothesis, The Little Club of Dependents, The Vendetta, The All Encompassing Theory which Can Never Be Falsified' (p. 350), but do these terms not have a certain ring offamiliarity with reference to. the social sciences? What is the solution? He proposes a return to the oldfashioned method of inductive inference - with a couple of embellishments. MEASURING ATTITUDES AND MOTIVATIONS 109 It is necessary to form multiple working hypotheses instead of merely popularizing the ones that we happen to favor, and instead of trying merely to support or worse yet to prove (which strictly speaking is a logical impossibility for interesting empirical) hypotheses, we should be busy eliminating the plausible alternativesalternatives which are rarely addressed in the social sciences. As Francis Bacon stressed so many years ago, and Platt re-emphasizes, science advances only by disproofs. Is it the failure to disprove that explains the lack of agreement in the social sciences? Is there a test that purports to be a measure of a certain construct? What else, might it be a measure of? What other alternatives are there that need to be ruled out? Such questions have usually been neglected. Has a researcher found a significant difference between two groups of subjects? What plausible explanations for the difference have not been ruled out? In addition to forming multiple working hypotheses (which will help to keep our thinking impartial as well as clear), it is necessary always to recycle the familiar steps of the Baconian method: (1) form clear hypotheses; (2) design crucial experiments to eliminate as many as possible; (3) carry out the experiments; and (4) 'refine the possibilities that remain' (Platt, 1964, p. 347) and do it all over again, and again, and again, always eliminating some of the plausible alternatives and refining the remaining ones. By such a method, the researcher enhances the chances of formulating a more powerful explanatory theory on each cycle from data-to-theory to data-to-theory with maximum efficiency. Platt asks if there is any virtue in plodding almost aimlessly through thirty years of work that might be accomplished in two or three months with a little forethought and planning. In the sequel to the first volume on personality tests, Buros offers the following stronger statement in 1974 (Personality Tests and Reviews II) : It is my considered belief that most standardized tests are poorly constructed, of questionable or unknown validity, pretentious in their claims, and likely to be misused more often than not (p. xvii). In another compendium, one that reviews some 3,000 sources of psychological tests, R Lowell Kelly writes in the Foreword (see Chun, Cobb, and French, 1975): At first blush, the development of such a large number of assessment devices in the past 50 years would appear to reflect remarkable progress in the development of the social sciences. 110 LANGUAGE TESTS AT SCHOOL Unfortunately, this conclusion is not justified, ... nearly three out of four of the instruments are used but once and often only by the developer of the instrument. Worse still, ... the more popular instruments tend to be selected more frequently not because they are better measuring instruments, but primarily because of their convenience (p. v). It is one thing to say that a particular attitude questionnaire, or rating procedure of whatever sort, measures a specific attitude or personality characteristic (e.g., authoritarianism, prejudice, anxiety, ego strength/weakness, or the like) but it is an entirely different matter to prove that this is so. Indeed, it is often difficult to conceive of any test whatsoever that would constitute an adequate criterion for 'degree of authoritarianism' or 'degree of empathy', etc. Oft~n the only validity information offered by the designer or user of a particular procedure for assessing some hypothetical (at best latent) attitudinal variable is the label that he associates with the procedure. Kelly (1975) classes this kind of validity with several other types -as what he calls 'pseudo-validity'. He refers to this kind of validity as 'nominal validity' - it consists in 'the assumption that the instrument measures what its name implies'. It is related closely to 'validity by fiat' - 'assertion by the author (no matter how distinguished 1) that t4e instrument measures variable X, Y, or Z' (p. vii, from the Foreword to Chun, et ai, 1975). One has visions ofa king laying the sword blade on either shoulder of the knight-to-be and declaring to all ears, 'I dub thee Knight.' Hence the test author gives 'double' (or 'dubious', if you prefer) validity to his testing instrument - first by authorship and then by name. Is there no way out of this difficulty? Is it not possible to require more of attitude measures than the pseudo-validities which characterize so many of them? In a classic paper on 'construct validity' Cronbach and Meehl (1955) addressed this important question. If we treat attitudes as theoretical constructs and measures of attitudes (or measures that purport to be measures of attitudes) as tests, then the tests are subject to the same sorts of empirical arid theoretical justification that are applied to any construct in scientific study. Cronbach and Meehl say, 'a construct is some postula~ed attribute of people assumed to be reflected in performance' (p. 283). They continue, 'persons who possess this attribute will, in situation X, act in manner Y (with a stated probability)' (p. 284). Among the techniques for assessing the validity of a test of a postulated construct are the following: select a group whose behavtor r !I , t MEASURING ATTITUDES AND MOTIVATIONS 111 is well known (orean be determined) and test the hypothesis that the behavior is related to an attitude (e.g., the classic example used by Thurstone and others was church-goers versus non-church-goers). Or, if the attitude or belief of a particular group is known, some behavioral criterion can be predicted and the hypothesized outcome can be tested by the usual method. These techniques are rather rough and ready, and can be improved upon generally (in fact they must be improved upon if attitude measures are to be more finely calibrated) by devising alternative measures of the attitude or construct and assessing the degree (and pattern) of variance overlap on the various measures by correlation and related procedlires(e.g., factor analysis and multiple regression technique~). For instance, as Cronbach and Meehl (1955) observe, 'if two tests are presumed to measure the same construct, a correlation between them is predicted ... If the obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrelations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being a useful computational method in such studies' (p. 287). Following this latter procedure, scales which are supposed to assess the same construct can be correlated to determine whether in fact they share some variance. The extent of the correlation, or of their tendency to correlate with a mathematically defined factor (or to 'load' on such a factor), may be taken as a kind of index of the validity of the measure. Actually, the latter techniques are related to the general internal consistency of measures that purport to measure the same thing. To prove a high degree of internal consistency between a variety of measures of, say, 'prejudice' is not to prove that the measures are indeed assessing degree of prejudice, but if anyone of them on indepyndent grounds can be shown to be a measure of prejudice then confidence in all of the measures is thereby strengthened. (For several applications of rudimentary factor analysis, see the Appendix.) Ultimately, there may be no behavioral criterion which can be proposed as a basis for validating a given attitude measure. At best there may be only a range of criteria, which taken as a whole, cause us to have confidence in the measurement technique (for most presently used measures, however, there is no reason at all to have confidence in them). As Cronbach and Meehl (1955) put it, 'personality tests, and some tests of ability, are interpreted in terms of attributes for which ) 112 LANGUAGE TESTS AT SCHOOL there is no criterion' (p. 299). In such cases, it is principally through statistical and empirical methods of assessing the internal consistency of such devices that their construct validity is to be judged. They give the example of the measurement of temperature. One criterion we might impose on any measuring device is that it would show higher temperature for water when it is boiling than when it is frozen solid, or when it feels hot to the touch rather than cold to the touch - but the criterion can be proved to be more crude than the degrees of temperature a variety of thermometers are capable of displaying. Moreover, the subjective judgement of temperature is more subject to error than are measurements derived directly from the measuring techniques. The trouble with attitudes is that they are so out of reach, and at the same time they are apparently subject to a kind of fluidity that allows them to change (or perhaps be created on the spot) in resportse to different social situations. Typically, it is the effects of attitudes that we are interested in rather than the attitudes per se;' or alternatively, it is the social situations that give rise to both the attitudes and their effects which are the objects of interest. Recently, there has been a surge of interest in the topic of how attitudes affect language use and language behavior. We turn now to that topic and, return below to the kinds of measures of attitudes that have been widely used in the testing of certain research hypotheses relatt?d to it. B. Hypothesized relationships between affective variables and the use and learning oflanguage At first look, the relationship of attitudes to the use and learning Qf language may appear to be a small part ofthe problem of attitudes in general - what about attitude toward self, significant others, situations, groups, etc. ? But a little reflection will show that a very' wide range of variables are compassed by the topic of this subsection. Moreover, they are variables that have received much special attention in recent years. In addition to concern for attitudes and the way they influence a child's learning and use of his own native language (or possibly his native languages in the event that he grows up with more than one at his command), there has been much concern for the possible effects that attitudes may have on the learning of a second or third language and the way such attitudes affect the choice of a particular linguistic code (language or style) in a , particular context. Some of the concern is generated by actuaI MEASURING ATTITUDES AND MOTIVATIONS 113 research results - for instance, results showing a significant relationship between certain attitude measures or motivation indices and attainment in a second language. ProbaQly most of it, however, is generated by certain appealIng arguments that often accompany meager or even contradictory research findings - or no findings at all. A widely cited and popular position is that of Guiora and his collaborators (see Guiora, Paluszny, Beit-Hallahmi, Catford, Cooley, and Dull, 1975). The argument is an elaborate one and has many interesting ramific-ations and implications, but it can be capsulized by a few selected quotations from the cited article. Crucial, to the argument are the notions of empathy (being able to put yourself in someone else's shoes - apropos to Shakespeare's claim that 'a friend is another self'), and language ego (that aspect of self awareness related to the fact that 'I' sound a certain way when'!, speak and that this is part of 'my' identity) : We hypothesized that this ability to shed our native pronunciation habits and temporarily adopt a different pronunciation is closely related to empathic capacity (p. 49). One wonders whether having an empathic spirit is a necessary criterion for acquiring a native-like pronunciation in another language? A sufficient one? But the hypothesis has a certain appeal as we read more. With pronunciation viewed as the core of language ego, and as the most critical contribution of language ego to selfrepresentation, we see that the early flexibility of ego boundaries is reflected in the ease of assimilating native-like pronunciation by young children; the later reduced flexibility is reflected in the reduction of this ability in adults (p. 46). Apart from some superfluity and circularity (or perhaps because of it) so far so good. They continue, At this point we can link empathy and pronunciation of a second language. As conceived here, both require a temporary relaxation of ego boundaries and thus a temporary modification of self-representation. Although psychology traditionally regards language performance as a cognitiveintellectual skill, we are concerned here with that particular aspect of language behavior that is most critically tied to selfrepresentation (p. 46). But the most persuasive part of the argument comes two pages later: . .. superimposed upon the speech sounds of the words one chooses to utter are sounds which give the listener information 114 MEASURING ATTITUDffi AND MOTIVATIONS LANGUAGE TffiTS AT SCHOOL 115 \ or 3 (native-like). Data tapes are rated independently br three native Thai speakers, trained in pronunciation evalu~tlOn .. A distinct advantage of the STP is. that it can be u~ed W1t~ nalVe subjects. It bypasses the necessity of first teaching subjects a second language (p. 50). about the speaker's identity. The listener can decide w~ether one is sincere or insincere. Ridicule the way I sound, my dialect, or my attempts at pronouncing French and you will have ridiculed me. (One might, however, be regarded as oversensitive if one spoke very baa French,mightn't one?) Ask me to change the way I sound and you ask me to change myself. To speak a second language authentically is to take on a new identity. As with empathy, it is to step into a new a~d perhaps unfamiliar pair of shoes (p. 48). What about the empirical studies designed to test the relationship between degree of empathy and acquisition of unfamiliar phonological systems? What does the evidence show? The empirical data is reviewed by Schumann (1975), who, though he is clearly sympathetic) with the general thesis, finds the empirical evidences for it excessively weak. The first problem is in the method for measuring 'empathy' by the Micro Momentary Expression test - a procedure that consists of watching silent psychiatric interviews and pushing a buttoI,l everytime a change is noticed in the expression on the face of the person being interviewed by the psychiatrist (presumably the film, does not focus on the psychiatrist, but Guiora, et aI, are not clear on this in their description of the so-called measure of empathy). Reasonable questions might include, is this purported measure of empathy correlated with other purported measures of th~ same' construct? Is the· technique reliable? Are the results similar on different occasions under similar circumstances? Can trained' 'empathizers' do the task better than untrained persons? Is there any meaningful discrimination on the MME between persons who are judged on other grounds to be highly empathetic and persons who are judged to be less so? Apparently, the test, as a measure of empathy must appeal to 'nominal validity'. A second problem with the empirical research on the Guiora, et aI, hypothesis is the measure of pronunciation accuracy, the Standard Thai Procedure: The STP consists of a master tape recording of 34 test items (ranging in length from 1 to 4 syllables) separated by a 4 second pause. The voicer is a female Thai native speaker.... Total .. . testing time is 4t minutes.: .. (p. 49). The scoring procedure is currently under reVlSlon. The baSiC evaluation method involves rating tone, vowel and consonant quality for selected phonetic units on a scale of 1 (poor), 2 (fair), I A distinct advantage to test people who have not learned the language? Presumably the test ought to discriminate between more and less native-like pronouncers of a language that the subjects have already learned. Does it? No data are offered. Obviously, the STP would not be a very direct test of, say, the ability to pronounce Spanish utterances with a native-like accent - or would it? N.o data are given. Does the test discriminate between persons who are Judged to speak 'seven languages in Russian' (said with a thick Russian accent) and persons who are judged by respective native speaker groups to speak two or more languages so as to pass the~se~v:es for a native speaker of each of the several languages? Reliablhty and validity studies of the test are conspicuously absent. The third problem with the Guiora, et aI, hypothesis is that the empirical results that have been reported are either only weakly interpretable as supporting their hypothesis, or they are not interpretable at all. (Indeed, it makes little sense to try to interpret the meaning of a correlation between two measures about which it cannot be said with any confidence what they are measures of.) Their prediction is that empathetic persons will do better on the STP than less empathetic persons. Two empirical studies attempted to establish the connection between empathy and attained skill in pronunciation - both are discussed more fullY,by Schumann (1975) than by Guiora, et aI, (1975). The first study with Japanese learners failed to show any relationship between attained pronunciation skill and the scores on the·MME (the STP was not used). The second study, a'more extensive one, found significant correlations between rated pronunciatioil in Spanish (for 109 subjects at the Defense Language Institute in a three month intensive language course), Russian (for 201 subjects at DLI), Japanese (13 subjects at DLI), Thai (40 subjects), and Mandarin (38). There was a difficulty however. The correlations were positive for the first three languages and negative for the last two.· This would seem to indicate that if the MME measures empathy, forBome groups it is positively associated with the acquisition of pronunciation skills in another language, and for other groups, it is negatively associated with the same process. What can be concluded from such data? Very 118 LANGUAGE TESTS AT SCHOOL want to acquire another (see Lambert, 1955, and Gardner and Lambert, 1959). A similar claim was offered for ethnocentrism although it was expected to becnegatively correlated with attainment in the target language. Neither of these hypotheses (if they can be termed hypotheses) has proved to be any more susceptible to empirical test thanthe one about types of motivation. The arguments in each case have a lot of appeal, but the hypotheses themselves do not seem to be borne out by the data - the results are either unclear or contradictory. We will return to consider some of the specific measuring instruments used by Gardner and Lambert in section C. Yet another research tradition in attitude measurement is that of Joshua Fishman and his co-workers. Perhaps his most important argument is that there must be a thoroughgoing investigation of the factors that determine who speaks what language (or variety) to whom and under what conditions. In a recent article, Cooper and Fishman (1975), twelve issues of interest to researchers in the area of 'language attitude' are discussed. The authors define language attitudes either narrowly as related to how people think people ought to talk (following Ferguson, 1972) or broadly as 'those attitudes which influence language behavior and behavior toward language' (pp. 188-9). The first definition they consider to be too narrow and the second 'too broad', (p. 189). Among the twelve questions are: (1) Does attitude have a characteristic structure? ... ; (2) To what extent can language attitudes be componentialized? ... ; (3) Do people have characteristic beliefs about their language? (e.g., that it is well suited for logic, or poetry, or some special use); (4) Are language attitudes -really attitudes toward people who speak the language in question, or ' toward the language itself? Or some of each? (5) What is the effect of context on expressed stereotypes? Is one language considered inappropriate in some social settings but appropriate in others? (6) Where do language attitudes come from? Stereotypes that are irrational? Actual experiences? (7) What effects do they have? (,Typically, only modest relationships are found between attitude measures and the overt behaviors 'Yhich such scores are designed to predict' p. 191); (8) Is one language apt to be more effective than another for persuading bilinguals under different circumstances? (9) 'What relationships exist among attitude scores obtained from different classes of attitudinal measurements (physiological or psychological reaction, situational behavior, verbal report)?' (p. 191); (10) How about breaking down the measures into ones that MEASURING ATTITUDES AND MOTIVATIONS 119 assess reactions 'toward an object, toward a situation, and toward the object in that situation' (p. 191)? What would the component structure be like then? (11) 'Do indirect measures of attitude (i.e., 1l1easures whose purpose is not apparent to the respondent) have higher validities than direct measures'? ... (p. 192); (12) 'How reliable are indirect measures ... ?' (p. 192). It would seem that attitude studies generally share certain strengths and weaknesses. Among the strengths is the intuitive appeal of the argument that people's feelings about themselves (including the way they talk) and about others (and the way they talk) ought to be related to their use and learning of any language(s). Among the weaknesses is the general lack of empirical vulnerability of most of the theoretical claims that are made. They stand or fallon the merits of their intuitive appeal and quite independently of any experimental qata that may be accumulated. They are subject to wide differences of interpretation, and there is a general (nearly complete) lack of evidence on the validity of purported measures of attitudinal constructs. What data are available are often contradictory or uninterpretable - yet the favored hypothesis still survives. In brief, most attitude studies are not empirical studies at all. They are mere attempts to support favored 'working' hypotheses - the hypotheses will go right on working even if they turn out to predict the wrong (or worse yet, all possible) experimental outcomes. Part of the problem with measures of attitudes is that they require subjects to be honest in the 'throat-cutting' way - to give information about themselves which is potentially damaging (consider Upshur's biting criticism of such a procedure, cited above on p. 98). Furthermore, they require subjects to answer questions or react to statements that sometimes place them in the double-bind of being damned any way they turn - the condemnation which intelligent people can easily anticipate may be light or heavy, but why should subjects be expected to give meaningful (indeed, non-pathological) responses to such items ? We consider some of the measurement techniques that are of the double-bind type in the next section. As Watzlawick, Beavin, and Jackson (1967) reason, such double-bind situations are apt to generate pathological responses ('crazy' and irrational behaviors) in people who by all possible non-clinical standards are 'normal'. And there is a further problem which even if the others could be resolved remains to be dealt with - it promises to be knottier than all of the rest. Whose value system shall be selected as the proper 120 LANGUAGE TESTS AT SCHOOL criterion for the valuation of scales or tpe interpretatioh of responses? By whose standards will the questions be interpreted? This is the essential validity problem of attitude inventories personality measures, and affective valuations of all sorts. It knock~ at the very doors of the unsolved riddles of human existence. Is there meaning in life? By whose vision shall it be declared? Is there an absolute truth? Is there a God? Will I be called to give an accounting for what I do? Is life on this side of the grave all there is? What shall I do with this man Jesus? What shall I do with the moral precepts of my own culture? Yours? Someone else's? Are all solutions to the riddles of equivalent value? Is none of any value? Who shall we have as our arbiter? Skinner? Hitler? Kissinger? Shaw and Wright (1967) suggest that 'the only inferential step' in the usual techniques for the measurement of attitudes 'is the assumption that the evaluations of the persons involved in' scale construction correspond to those of the individuals whose attitudes are being measured' (p. 13). One does not have to be a logician to know that the inferential step Shaw and Wright are describing is clearly not the only one involved in the 'measurement of attitudes;. In fact, if that were the only step involved it would be entirely pointless to waste time trying to devise measurement instruments --:- just ask Ithe test constructors to divulge their own attitudes at the outset. Why bother with the step of inference? There are other steps that involve inferential leaps of substantial magnitude, but the one they describe as the 'only' one is no doubt the crucial one. There is a pretense here that the value judgements concerning what is a prejudice or what is not a prejudice, or what is anxiet~ or what is not anxiety: what is aggressiveness or acquiescence, what IS strength and what IS weakness, etc. ad infinitum, can be acceptably and impartially determined by some group consensus. What group will the almighty academic community pick? Or will the choice of a value system for affective measurement in the schools, be made by political leaders'? By Marxists? Christians? Jews? Blacks? Whites? Chicanos? Navajos? Theists? Atheists? Upper class? Middle class? Humanists? Bigots? Existentialists? Anthropologists? Sexists? Racists? Militarists? Intellectuals? The trouble is the same one that Plato discussed in his Republic - it is not a question of how to interpret a single statement OQ a questionnaire (though this is a questiop. of importance for each such statement), it is a question of how to .decide cases of disagreement. Voting is one proposal, but if the minority (or a great plurality 'of MEASURING ATTITUDES AND MOTIVATIONS 121 minorities, down to the level of individuals) get voted down, shall their values be repressed in the institutionalization of attitude measures? C. Direct and indirect measures of affect So many different kinds of techniques have been developed for the purpose of trying to get people to reveal their beliefs and feelings that it would be impossible to be sure that all of the types had been covered in a single review. Therefore, the intent of this section is to discuss the most widely used types of attitude scales and other measurement techniques and to try to draw some conclusions concerning their empirical validities - particularly the measures that have been used in conjunction with language proficiency and the sorts of hypotheses considered in section B above. Traditionally, a distinction is made between 'direct' and 'indirect' attitude measures. Actually we have already seen that there is no direct way of measuring attitudes, nor can there ever be. This is not so much a problem with measuring techniques per se as it is a problem of the nature of attitudes themselves. There can be no direct measure of a construct that is evidenced only indirectly, and subjectively even then. . As qualities of human experience, emotions, attitudes, and values are notoriously ambiguous in their very expression. Joy or sadness may be evident by tears. A betrayal or an unfeigned love may be demonstrated with a kiss. Disgust or rejoicing may be revealed by laughter. Approval or disapproval by a smile. Physiological measures might offer a way out, but they would have to be checked against subjective judgements. An increase in heart rate might be caused by fear, anger, surprise, etc. But even if a particular constellation of glandular secretions, palmar sweating, galvanic skin response, and other physiological responses were thought to indicate some emotional state, presumably the test would have to be validated against subjective judgement by asking the patient displaying the pertinent constellation of factors, 'Do you feel angry now?' As D. M. MacKay (1951) noted In his insightful article 'Mindlike Behavior in Artefacts,' we could know all about the inner workings of a neon sign "\'ithout knowing the meaning of the words that it displays. The problem of attitudes is like that; it is distinctly a matter of interpretation. What measures of attitudinal variables have been used or ~r----------=---~~--- --- -! 122 ! LANGUAGE TESTS AT SCHOOL recommended in studies oflanguage use and language learning? We have mentioned a few above; we will look more closely at a number of them below. They include the. notoriously unreliable 'projective' techniques such as the Thematic Apperception Test (see Anastasi, 1976, pp. 565-7, and 579-85) which is a Rorschach-like -test consisting of 19 deliberately vague pictures concerning wlVch subjects are asked to tell stories. They also use the -,Tanking of statements concerning how well they describe the respondent. More often attitude scales are of the more traditional agree-disagree type originated by Likert (1932). Recently, a variant on the Likert scale technique has been developed by Zirkel and Jackson (1974) for use with children from four different ethnic backgrounds. It employs line drawings offaces supposedly ranging from very happy to very sad. First, we will consider three of the attitude scales of the Likert-type used by Gardner and Lambert (1972) - the F Scale, the E Scale, and the Anomie Scale. The instructions used for each scale were similar: The following statements are opinions which have often been expressed by students about your age. They cover a wide range of topics and it has been found that many people agree with each statement and many disagree. There are no right or wro:Q-g answers. You are askedlo mark each statement in th~ left-hand margin according to your agreement or disagreement as . follows: +1 : slight support, agreement +2: moderate support, agreement +3: strong support, agreement -1 : slight opposition, disagreement -2: moderate opposition, disagreement -3: strong opposition, disagreement MODIFIED F SCALE 1. Obedience and respect for authority are the most important virtues children should learn. 2. What youth needs most is strict discipline, rugged determination, and the will to work and fight for family and country. 3. Nowadays when so many different kinds of people move around and mix together .so much, a person has to protect himself especially carefully against catching an infection or disease from them. 4. What this country needs most, more than laws and political programs, is a few courageous, tireless, devoted leaders in whom the people can put their faith. . MEASURING ATTITUDES AND MOTIVATIONS 123 5. No weakness or difficulty can hold us back if we have enough will power. 6. Human nature being what it is, there will always be war and conflict. 7. A person who has bad manners, habits, and breeding can hardly expect to get along with decent people. 8. People can be divided into two distinct classes: the weak and the strong. 9. There is hardly anything lower than a perso~ who does not feel a great love, gratitude, a?-d ~esI?ect for ~lS parents. 10. The true American way ofhfe is dlsappeanng so fast that force may be necessary to preserve it. 11. Nowadays more and more people are prying into matters that should remain personal and private. 12. If people would talk less and work more, everybody would be better off. 13. Most people don't realize how much our lives are controlled by plots hatched in secret places. According to Gardner ~nd Lambert (1972, p. 150), all of the foregoing statements 'reflect antidemocratic ideology'. In fact, the original authors of the scale developed items from research on 'the potentially fascistic individual' (Adorno, et aI, 1950, p. 1) which 'began with anti-Semitism in the focus of attention' (p. 2). Sources for the items were subject protocols from 'factual short essay questions pertaining to such topics as religion, war, ideal society, and so forth; early results from projective questions; finally, and by far the most important, material from the interviews and the Thematic Apperception Tests' (p. 225). The thirteen scales given above were selected from Forms 45 and 40 of the Adorno, et al (1950) F Scale consisting of some 46 items according to Gardner and Lambert (1972). Actually, item 10 given above was from Form 60 (an earlier version of the F Scale). There are two major sources of validity on the Fascism Scale (that is, the F Scale) that are easily accessible. First, there are the intercorrelations between the early versions of the F Scale with measures that were supposed (by the original authors of the F Scale) to be similar in content, and second, there are the same data in the several correlation tables offered by Gardner and Lambert which can be examined. According to Adorno, et al (1950, pp. 222-24) the original purpose in. developing the F Scale was to try to obtain a less obvious measure of 'antidemocratic potentiaf (p. 223, their italics) than was available in the E Scale (or Ethnocentrism Scale) which they had already developed. 122 MEASURING ATTITUDES AND MOTIVATIONS LANGUAGE TESTS AT SCHOOL recommended in studies oflahguage use and language learning? We have mentioned a few above; we will look more closely at a number ofthem below. They include the notoriously unreliable 'projective' techniques such as the Thematic Apperception Test (see A~astasi, 1976, pp. 565-7, and 579-85) which is a Rorschach-like~ test consisting of 19 deliberately vague pictures concerning which subjects are asked to tell stories. They also use the ranking of statements concerning how well they describe the respondent. More often attitude scales are of the more traditional agree-disagree type originated by Likert (1932). Recently, a variant on tQe Likert scale technique has been developed by Zirkel and Jackson (1974) for use with children from four different ethnic backgrounds. It employs line drawings offaces supposedly ranging from very happy to very sa<;l. First, we will consider three of the attitude scales of the Likert-type used by Gardner and Lambert (1972) - the F Scale, the E Scale, and the Anomie Scale. The instructions used for each scale were similar: The following statements are opinions which have often been expressed by students about your age. They cover a wide range of topics and it has been found that many people agree with each statement and many disagree. There are no right or wrol).g answers. You are asked to mark each statement in the left-hand margin according to your agreement or disagreement as follows: '. +1 : slight support, agreement +2: moderate support, agreement +3: strong support, agreement -1: slight opposition, disagreement -2: moderate opposition, disagreement -3: strong opposition, disagreement MODIFIED F SCALE 1. Obedience and respect for authority are the most important virtues children should learn. 2. What youth needs most is strict discipline, rllgged determination, and the will to work and fight for family and country. 3. Nowadays when so many different kinds of people move around and mix together so much, a person has to protect himself especially carefully against catching an infection or disease from them. 4. What this country needs most, more than laws and political programs, is a few courageous, tireless, devoted leaders in . whom the people can put their faith. 123 5. No weakness or difficulty can hold us back if we have )enough will power. 6. Human nature being what it is, there will always be war and conflict. . 7. A person who has bad manners, nabits, and breeding can hardly expect to get along with decent people. 8. People can be divided into two distinct classes: the weak and the strong. 9. There is hardly anything lower than a person who does not feel a great love, gratitude, and respect for his parents. 10. The true American way oflife is disappearing so fast that force may be necessary to preserve it. 11. Nowadays more and more people are prying into matters that should remain personal and private. 12. If people would talk less and work more, everybody would be better off. 13. Most people don't realize how much our lives are controlled by plots hatched in secret places. According to Gardner and Lambert (1972, p. 150), all of the foregoing statements 'reflect antidemocratic ideology'. In fact, the original authors of the scale developed items from research on 'the potentially fascistic individual' (Adorno, et aI, 1950, p. 1) which 'began with anti-Semitism in the focus of attention' (p. 2). Sources for the items were subject protocols from 'factual short essay questions pertaining to such topics as religion, war, ideal society, and so forth; early results from projective questions; finally, and by far the most important; material from the interviews and the Thematic Apperception Tests' (p. 225). The thirteen scales given above were selected from Forms 45 and 40 of the Adorno, et al (1950) F Scale consisting of some 46 items according to Gardner and Lambert (1972). Actually, item 10' given above was from Form 60 (an earlier version of the F Scale). There are two major sources of validity on the Fascism Scale (that is, the F Scale) that are easily accessible. First, there are the intercorrelations between the early versions of the F Scale with measures that were supposed (by the original authors of the F Scale) to be similar in content, and second, there are the same data in the several correlation tables offered by Gardner and Lambert which can be examined. According to Adorno, et al (1950, pp. 222-24) the original purpose is. developing the F Scale was to try to obtain a less obvious measure of 'antidemocratic potential' (p. 223, their italics) than was available in the E Scale (or Ethnocentrism Scale) which they had already developed. 124 LANGUAGE TESTS AT SCHOOL MEASURING ATTITUDES AND MOTIVATIONS " Immediately following is the Gardner and Lambert adaptation of a selected set of the questions on the E Scale which was used in much of their attitude research related to language use and language learning. Below, we return to the question of the validity of the F Scale in relation to the E Scale: footnote, p. 264). From these data the conclusion can be drawn that if either scale is tapping an 'authoritarian' outlook, both must be. However, the picture changes .radically when we examine the data from Gardner and Lambert (1972). In studies with their modified (in fact, shortened) versions of the F and E Scales, the correlations were .33 (for 96 English speaking high school students in Louisiana), .39 (for 145 English speaking high school students in Maine), .33 (for 142 English speaking high school students in Connecticut), .33 (for 80 French-American high school students in Louisiana), and .46 (for 98 French-American high school students in Maine). In none of these studies does the overlap in variance on the two tests exceed 22 %and the pattern is quite different from the true relationship posited by Adorno, et al between F and E (the variance overlap should be nearly perfect). What is the explanation? One idea that has been offeted previously (Liebert and Spiegler, 1974, and their references) and which seems to fit the data from Gardner and Lambert relates to a subject's tendency merely to supply what are presumed to be the most socially acceptable responses. If the subject were able to guess that the experinienter does in fact consider some responses more appropriate than others, this would create some pressure on sensitive subjects to give the socially acceptable responses. Such pressure would tend to result in positive correlations across the scales. Another possibility is that subjects merely seek to appear consistent from one answer to the next. The fact that one has agreed or disagreed with a certain item on yither the E or F Scale may set up a strong tendency to respond as one has responded on previous items -' a so-called 'response set'. If the response set factor were accounting for a large portion of the variance in measures like E and F, then this would also account for the high correlations observed between them. In either event, shortening the tests as Gardner and Lambert did would tend to reduce the amount of variance overlap between them because it would necessarily reduce the tendency of the scales to establish a response set, or it would reduce the saliency of socially desirable responses. All of this could happen even if neither test had anything to do with the personality traits they are trying to measure. Along this line, Crowne and Marlowe (1964) report: MODIFIED E SCALE 1. The worst danger to real Americanism during the last fifty years has come from foreign ideas and agitators. ' 2. Now that a new world organization is set up, America must be sure that she loses none of her independence and complete power as a sovereign nation. 3. Certain people who refuse to salute the.flag should be forced to conform to such a patriotic action, or else be imprisqned. 4. Foreigners are all right in their place, but they carry it too far . when they get too familiar with us. 5. America may not be perfect, but the American way has brought us about as close as human beings can get to a perfect society. 6. It is only natural and right for each person to think that his family is better than any other. 7. The best guarantee for our national security is for America to keep the secret of the nuclear bomb. " 20 These items were selected by Gardner and Lambert from original items recommended for the final form ofthe Adorno, et al E Scale. The original authors reason, 'the social world as most ethnocentrists see it is arranged like a series of concentric circles around a bull's-eye. Each circle represents an ingroup-outgroup distinction; each line serves as a barrier to exclude all outside groups from the center, and each group is in turd excluded by a slightly narrower one. A sample "map" illustrating the ever-narrowing ingroup would be the following: Whites, Americans, native-born Americans, Christians, Protestants, Californians, my family, and finally - l' (p. 148). Thus, the items on the E Scale are expected to reveal the degree to which the respondent is unable 'to identify with humanity' (p. 148). How well does the E Scale, and its more indirect counterpart the, F Scale, accomplish its purpose? One way of testing the adequacy of both scales is to check their intercorrelation. This was done by Adorno, et aI, and they found correlations ranging from a low of .59 to a high of .87 (1950, p. 262). They concluded that if the tests were lengthened, or corrected for the expected error of measurement in any such test, they should intercorrelate at the .90 level (see their ~--- 125 Acquiescence has been established as a major response determinant in the measurement of such personality variables as authoritarianism (Bass, 1955, Chapman and Campbell, 1957, Jackson and Messick, 1958). The basic method has been ~ ...- =-=--======----~-....- - - - - - - - - - - - - - - - - - - - r~------ 126 LANGUAGE TEs'nl AT SCHOOL to show, first of all, that a given questionnaire - say ~he California F Scale (Adorno, et aI, 1950) - has a large propor,tlOn of items keyed agree (or true or yes). Second, half the it.ems are reversed, now being scored for disagreement. CorrelatlO~s are then computed between the origin~l an? the rever~ed. IteJ?s. Failure to find high negative correlatlOns IS, then, an mdlcatlOn of the operation of response acquiescence. In one study of the F Scale, in fact, significant positive correlations - stroJ;lgly indicative of an acquiescent tendency - were found (Jackson, Messick, and Solley, 1957), (p. 7). ' Actually,the failure to find high negative correlations is not nece'ssarily indicative only of a response acquiescence tendency; there are a number of other possibilities, but all of them are fatal to~ the claims of validity for the scale in question. Another problem with scales like E and F involves the tendency forrespondents to differentiate factive and emotive aspects of stateme~ts with which they are asked to agree or disagree. One may agree with the factive content of a statement and disagree with the emotive tone (both of which in the case of written questionnaires are coded principally in choice of words). Consider, for instance, the factive content and emotive tone of the following statements (the first version is from the Anomie Scale which is discussed below): A. The big trouble with our country is that it relies, for the most part, on the law of the jungle: 'Get him before he gets you.' ' B. The most serious problem of our people is that too f~w of them practice the Golden Rule: 'Do unto others as you would have them do unto you.' C. The greatest evil of our country is that we exist, for the most part, by the law of survival: 'Speak softly and carry a big stick.' Whereas the factive content of the preceding statements is similar in, all cases, and though each might be considered a rough paraphrase of the others, they differ 'greatly in emotive tone, Concerning such differences (which they term 'content' and 'style' respectively), Crowne and Marlowe (1964) report: Applying this, differentiation to the assessment of personality characteristics or attitudes, Jackson and Messick (1958) contended that both stylistic properties of test items a~d habitual expressive or response styles of individuals may outweigh the importance of item content. The wayan item is " worded - its style of expression - may tend to increase its frequency of endorsement (p. 8). MEASURING ATTITUDES AND MOTIVATIONS 127 Their observation is merely a special case of the hypothesis which we discussed in Chapter 4 (p. '82f) on the relative importance of factive and emotive aspects of communication. Taking all of the foregoing into account, the validity of the E and F Scales is in grave doubt. The hypothesis that they are measures of the same basic configuration of personality traits (or at least of similar configurations associated with 'authoritarianism') is not the only hypothesis that will explain the available data - nor does it seem to be the most plausible of the available alternatives. Furthermore, if the validity of the E and F Scales is in doubt, their pattern of interrelationship with other variables - such as attained proficiency in a second language - is essentially uninterpretable. A third scale used by Gardner and Lambert is the Anomie Scale adapted partly from Srole (1951, 1956): I ANOMlE SCALE l. In the U.S. today, public officials aren't really very interested in the problems of the average man. 2. Our country is by far the best country in which to live. (The scale is reversed on this item and on number 8.) 3. The state of the world being what it is, it is very difficult for \ the student to plan for his career. 4. ~n spi~e of what some people say, the lot of the average man IS gettmg worse, not better. ' 5. These days a person doesn't really know whom he can count on. '6. It is hardly fair to bring children into the world with the way things look for the future. 7. No matter how hard I try, I seem to get a 'raw deal' in schooL 8. The opportunities offered young people in the United States are far greater than in any other country. 9. Having lived this long in this culture, I'd be happier moving to some other country now. 10. In this country, it's whom you know, not what you know, that makes for success. 11. The big trouble with our country is that it relies, for the most part, on the law of the jungle: 'Get him before he gets you.' 12. Sometimes I can't see much sense in putting so much time into education and learning. This test is intended to measure 'personal dissatisfaction or discouragement with one's place in society' (Gardner and Lambert, 1972, p. 21). 128 LANGUAGE TESTS AT SCHOOL MEASURING ATTITUDES AND MOTIVATIONS Oddly perhaps, Gardner and Lambert (1972, an? their other works reprinted there) have consistently predicted that hIgher sc~res on ~he preceding scale should correspond to higher performance In learnmg a second language - i.e., that degree of anomie .~nd attainment of proficiency in a second language should be p~sltIvely correlated however, they have predicted that the correlatlOns for the ~ and F Scales with attainment in a second language should be negatIve.,T~e difficulty is that other authors have argued.t~at s~ores on the Ano~e Scale and the E and F Scales should be pOSItiVely mtercorrelated ~Ith each other - that, in fact, 'anomia is a fa~tor ~elated to ~he formatlOn of negative rejective attitJ,ldes toward nnnonty groups (Srole, 1956, p.712). . ' Srole cites a correlation of .43 between a scale deSIgned to meas~r~ prejudice toward minorities and his 'Anomia S~ale'-(b~th 'anonne and 'anomia' are used in designating the scale m the hterature); as evidence that 'lostness is one of the basic conditions out of which some types of political authoritarianism eme~ge' (~. 714, footnote 20): Yet other authors have predicted no rel~tIonshlp at all betwe,en Anomie scores and F scores (Christie and Gels, 1970, p. 360). Again, we seem to be wrestling with hypotheses th~t are flavored more by the pr(;!ferences of a particular research tech~l~ue than they are by substantial research data. Even 'nominal' vahdity ~annot be i~voked when the researchers fail to agree on the meanmg of the name associated with a particular questionnaire. Gardner and Lambert (1972) report ,generally positive correlatio~s b.etween Anomie scores and E and F scores. This, rather than contnbutmg to a sense of confidence in the Anomie Scale, merely makes it, ~oo, suspect of a possible response set factor - or ~ tende~cy to glVe socIally acceptable responses, or possibly to gIve C~~slstent responses. to similarly scaled items (negatively toned or POSItiVely. toned). In bnef, there is little or no evidence to show that the scale m fact measures what it purports to measure. Christie and Geis (1970) suggest that the F Scale was possibly the most studied measure of attitudes for the preceding twenty year period (p. 38). One wonders how such a measu~e sur~ves in th~ f~ce of data which indicate that it has no substantial claIms to vahdlty. Further, one wonders why, if such a studied test has produces! such a . conglomeration of contradictory findings, any ~ne should expect to be able to whip together an attitude measure (WIth much le~s study) that will do any better. The problems are not merely techmcal ones, associated with test reliability and validity, they are also moral ones 129 having to do with the uses to which such tests are intended to be put. The difficulties are considered severe enough for Shaw and Wright (1967) to put the following statement in a conspicuous location at the end of the Preface to their book on Scales for the Measurement of Attitudes: The attitude scales in this book are recommended for research purposes and for group testing. We believe that the available information and supporting research does not warrant the application of many of these scales as measures of individual attitude for the purpose of diagnosis or personnel selection or for any other individual assessment process (p. xi). In spite of such disclaimers, application of such measurement techniques to the diagnosis of individual performances _ e.g., prognosticating the likelihood of individual success or failure in a course of study with a view to selecting students who are more likely to succeed in 'an overcrowded program' (Brodkey and Shore, 1976)is sometimes suggested: J 'A problem which has arisen at the University of New Mexico is one of predicting the language-learning behavior of students in an overcrowded program which may in the near future become highly selective. This paper is a progress report on the design of an instrument to predict good and poor language learning behavior on the basis of personality. Subjects are students in the English Tutorial Program, which provides small sized classes for foreign, Anglo-American, and minority students with poor college English skills. Students demonstrate a great range of linguistic styles, including English as a second language, English as a second dialect, and idiosyncratic problems, but all can be characterized as lacking skill in the literate English of college usage - a difficult 'target' language (p. 153). In brief, Brodkey and Shore set out to predict teacher ratings of students (on 15 positively scaled statements with which the teacher must agree or disagree) on the basis of the student's own preferential ordering of 40 statements about himself - some of the latter are listed below. The problem was to predict which students were likely to succeed. Presumably, students judged likely to succeed would be given preference at time of admission. Actually, the student was asked to sort the 40 statements twice _ first, in response to how he would like to be and second, how he was at the time of performing the task. A third score was derived by computing the difference between the first two scores. (There is no way to determine on the basis of information given by Brodkeyand i I I I, Ii II[I tI II II II 130 MEASURING ATTITUDES AND MOTIVATIONS LANGUAGE TESTS AT SCHOOL Shore how the items were scaled - that is, it cannot be determined whether agreeing with a particular statement contributed~ositively or negatively to the student's score.) THE Q-SORT STATEMENTS, (from Appendix B of Brodkey and Shore, 1976, pp. 161-2) 1. My teacher can probably see that I am an interesting person from reading my essays .. 2. My teachers usually enJoy readmg my essays. ' 3. My essays often make me feel good. 4. My next essay will be written mainly to please myself/ 5. My essays often leave me feeling confused about m~ own ideas. . 6. My writing will always be poor. -, 7. No matter how hard I try, my grades don't really improve much. 8. I usually receive fair grades on my assignments. io.· My next essay will be written mainly to please my teacher. 11. I dislike doing the same thing over and over again. \ 18. I often get my facts confused. ' 19. When I feellike doing something I go and do it now. 22. I have trouble remembering names and faces. 28. I am more interested in the details of ajob than just getting 29. 30. 31. 32. it done. I sometimes have trouble communicating witl;t others. I sometimes make decisions too quickly. . I like to get one project finished before starting another. I do my best work when I plan ahead and follow the plan. 34.. I try to get unpleasant tasks out of the way before I begin working on more pleasant tasks. 36. I always try to do my best, even if it hurts other people's feelings. ' , , 37. I sometimes hurt other people's feelings without knowing it. 38. I often let other people's feelings influence my decisions. 39. I am not very good at adapting to changes. 40. I am usually very aware of other people's feelings. On the basis of what possible theory of personality can th,e foregoing statements be associated with a definition of the successful student? Suppose that some theory is proposed which offers an unambiguous basis for scaling the items as positive or 131 negative. What is the relationship of an item like 37 to such a theory? On the basis of careful thought one might conclude that statement 37 is not a valid description of any possible person since if such a person hurt other people's feelings without knowing it, how would he know it? In such a case the score might be either positively or negatively related to logical reasoning ability - depending on whether the item is positively, or negatively scaled. Note also the tendency throughout to place the student in the position of potential double-binds. Consider item 28 about the details of a job versus getting it done. Agreeing or disagreeing may be true and false at the same time. Further, consider the fact that if the items related to the teacher's attitudes are scaled appropriately (that is in accord with the teacher's attitudes about what a successful learner is), the test may be a measure of the subject's ability to perceive the teacher's attitudes i.e., to predict the teacher's evaluation of the subject himself. This would introduce a high degree of correlation between the personality measure (the Q-Sort) and the teacher's judgements (the criterion of whether or not the Q-Sort is a valid measure of the successful student) - but the correlation would be an artefact (an artificially contrived result) of the experimental procedure. Or consider yet another possibility. Suppose the teacher's judgements are actually related to how well the student understands English - is it not possible that the Q-Sort task might in fact discriminate among more and less proficient users of the language? These possibilities might combine to produce an apparent correlation between the 'personality' measure and the definition of 'success'. No statistical correlation (in the sense of Chapter 3 above) is reported" by Brodkey and Shore (1976). They do, however, report a table of correspondences between grades assigned in the course of , study (which themselves are related to the subjective evaluatio,ns of teachers stressing 'reward for effort, rather than achievement alone', p. 154). Then they proceed to an individual analysis of exceptional cases: the Q-Sort task is judged as not being reliable for '5 Orientals, 1 reservation Indian, and 3 students previously noted as having serious emotional problems' (p. 157). The authors suggest, 'a general conclusion might be that the Q-sort is not reliable for Oriental students, who may have low scores but high grades, and is slightly less reliable for women than men .... [for] students 30 or older, ... Q-sort scores seemed independent of grades .. .' (p. 157). No explanations are offered for these apparently deviant cases, but the authors conclude nonetheless that 'the Q-sortis on the way to providing us o 132 MEASURING ATTITUDES AND MOTIVATIONS LANGUAGE TESTS AT SCHOOL 133 because it may involve some affectively based judgement (see especially the basis for course grades recommended by Brodkey and Shore, 1976, p. 154). We come now to the empathy measure used by Guiora, Paluszny, Beit-Hallahmi, Catford, Cooley, and Dull (1975) and by Guiora and others. In studying the article by Haggard and Isaacs (1966), where the original MME (Micro-Momentary Expression) test had its beginnings, it is interesting to note that for highly skilled judges the technique adapted by Guiora, et ai, had average reliabilities of only .50 and .55. The original authors (Haggard and Isaacs) suggest that 'it would be useful to determine the extent to which observers differ in their ability to perceive accurately rapid changes of facial expressions and the major correlates of this ability' (p. 164). Apparently, Guiora and associates simply adapted the test to their own purpose with little change in its form and without attempting (or at least without reporting attempts) to determine whether or not it constituted a measure of empathy. From their own description of the MME, several problems become immediately apparent. The subject is instructed to push a button, which is attached to a recording device, whenever he sees a change in facial expression on the part of a person depicted on film. The first obvious trouble is that there is no apparent way to differentiate between hits and misses - that is, there is no way to tell for sure whether the subject pushed the button when an actual change was taking place or merely when the subject thought a change was taking place. In fact, it is apparently the case that the higher the number of button presses, the higher the judged empathy of the subject. Isn't it just as reasonable to assume that an inordinately high rate of button presses might correspond to a high rate of false alarms? In the data reported :Qy Haggard and Isaacs, even highly skilled judges were not able to agree in many cases on when changes were occurring, much less on the meaning of the changes (the latter would seem to be the more important indicator of empathy). They observe, 'it is more difficult to obtain satisfactory agreement when the task is to identify and designate the impulse or affect which presumably underlies any particular expression or expression change' (p. 158). They expected to be able to tell more about the nature and meaning of changes when they slowed down the rate of the film. However, in that condition (a condition also used by Guiora and associates, see p. 51) the reliability was even lower on the average than it was in the normal speed condition (.50 versus .55, respectively). with a useful predictive tool for screening Tutorial Program applicants' (p. 158). In another study, reported in an earlier issue of the same journal, Chastain (1975) correlated scores on several personality measures with grades in French, German, and Spanish for students numbering 80, 72, and 77 respectively. In addition to the personality measures (which included scales purporting to assess anxiety, outgoingness, and creativity), predictor variables included the verbal and quantitative sub-scores on the Scholastic Aptitude Test, high school rank, and prior language experience. Chastain observes, 'surprising as it may seem, the direction of correlation was not consistent [for test anxiety], (p. 160). In one case it was negative (for 15 subjects enrolled in an audio-lingual French course, - .48), and in two others it was positive (for the 77 Spanish students, .37; and for the 72 Getman students, .21). Chastain suggests that 'perhaps some concern about a test is a plus while too much anxiety can produce negative results' (p. 160). Is his measure valid? Chastain's measure of Test Anxiety came from Sarason (1958, 1961). An example item given by Sarason (1958) is 'While taking an important examination, I perspire a great deal' (p. 340). In his ·1961 study, Sarason reports correlations with 13 measures of 'intellectual ability' and the Test Anxiety scale along with five other measures of personality (all of them subscales on the Autobiographical Survey). For two separate studies with 326 males~. and 412 females (all freshman or sophomore students at the University of Washington, Seattle), no correlations above .30 were reported. In fact, Test Anxiety produced the strongest correlations with high school grade averages (divided into six categories) and with scores on Cooperative English subtests. The highest correlation was - .30 between Test Anxiety and the ACE Q (1948, presumably a widely used test since the author gives only the abbreviation in the text of the article). These are hardly encouraging validity statistics. A serious problem is. that correlations of above .4 between the various subscores on the Autobiographical Survey may possibly be explained in terms of response set. There is no reason for concluding that Test Anxiety (as measured by the scale by the same name) isa substantial factor in variance obtained in the various 'intellectual' variables. Since in no case did Chastain's other personality variables account for as much as 10 % of the variance in grades, they are not discussed here. We only note in passing that he is probably correct in saying that 'course grade may not be synonymous with achievement' (p. 159) - in fact it may be sensitive to affective variables precisely 1 --~-------=,,- l34 - --,-----'( LANGUAGE TESTS-'AT SCHOOL Since it is axiomatic (though perhaps not exactly true for all empirical cases) that.the validity of a test cannot exceed the square of its reliability, the validity estimates for the MME would have to be in the range of .25 to .30 - this would be only for the case when the test is a measure of someone's ability to notice changes in facial expressions, or bett<;r, as a measure of interjudge agreement on the task of noticing changes in facial expressions. The extrapolation from such judgements to 'empathy' as the construct to be measured by the MME is a wild leap indeed. No validity estimates are possible on the basis of available data for an inferential jump of the latter sort. Another widely used measure of attitudes - one that is somewhat less direct than questions or statements concerning the attitudes of the subject toward the object or situation of interest - is the semantic differential technique which was introduced by Osgood, Suci, and Tannenbaum (1957) for a wider range of purposes. In fact, they were interested in the measurement of meaning in a broader sense. Their method, however, was adapted to attitude studies by Lambert and Gardner, and by Spolsky (1969a). Several follow up studies on the Spolsky research are discussed in Oller, Baca, and Vigil (1977). Gardner and Lambert (1972) reported the use of seven point scales of the following type (subjects were asked to rate themselve-s, Americans, how they themselves would like to be, FrenchAmericans, and their French teacher): L SEMANTIC DIFFERENTIAL SCALES, BIPOLAR VARIETY 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. l3. 14. 15. 16. 17. 18. 19. Interesting Prejudiced Brave Handsome Colorful Friendly Honest Stupid Kind Pleasant Polite Sincere Successful Secure Dependable Permissive Leader Mature Stable _:_:_:_:_:_:_ - : - : - :_:_:_:_ -:-:_:_:_:_:_ -:_:_:_:_:_:_ - :_:_:_:_:_:_ _:_:_:_:_:_:_ _:_:_:_:_:_:_ -:_:_:_:_:_:_ --'---:_:_:_:_:_:_ -:_:_:_:_:_:_ -:_:_ :_:_:_:_ -:_:_:_:_:_:_ -:-:_:_:_ :_:_ -:_:_:_:_:_:_ -:_:_:_:_:_:_ -:_:_:_:_ :_:_ -:_:_:_:_:_:_ -:-:_:_:_:_:_ -:_:_:_:_:_:_ Boring Unprejudiced Cowardly Ugly Colorless Unfriendly Dishonest Smart Cruel Unpleasant Impolite Insincere Unsuccessful Insecure Undependable Strict ' Follower Immature Unstable MEASURING ATTITUDES AND MOTIVATIONS 20. 21. 22. 23. Happy Popular Hardworking Ambitious _:_:_:_:_:_:_ _:_:_:_:_:_:_ _:_:_:_:_ :_:_ _:_:_:_:_:_:_ 135 Sad Unpopular Lazy Not Ambitious Semantic differential scales of a unipolar variety were used by Gardner and Lambert (1972) and by Spolsky (1969a) and others (see Oller, Baca, and Vigil, 1977). In form they are very similar to the, bipolar scales except that the points of the scales have to be marked with some value such as 'very characteristic' or 'not at all characteristic' or possibly 'very much like me' or 'not at all like me'. Seven point and five point scales have been used. In an evaluation of attitudes toward the use of a particular language Lambert, Hodgson, Gardner, and Fillenbaum (1960) used a 'matched guise' technique. Fluent French-English bilinguals recorded material in both languages. The recordings from several speakers were then presented in random order and subjects were asked to rate the speakers. (Subjects were, of course, unaware that each speaker was heard twice, once in English and once in French.) SEMANTIC DIFFERENTIAL SCALES, UNIPOLAR VARIETY 1. Heightvery little _:_:_:_:_:_:_ very much and so on for the attributes: good looks, leadership, thoughtfulness, sense of humor, intelligence, honesty, self-confidence, friendliness, dependability, generosity, entertainingness, nervousness, kindness, reliability, ambition, sociability, character, and general likability. Spolsky (1969a) and others have used similar lists of terms presumably defining personal attributes: helpful, humble, stubborn, businesslike, shy, nervous, kind, friendly, dependable, and so forth. The latter scales in Spolsky's studies, and several modeled after his, were referenced against how subjects saw themselves to be, how they would like to be, and how they saw speakersoftheir native language, and speakers of a language they were in the process of acquiring. How reliable and valid are the foregoing types of scales? Little information is available. Spolsky (1969a) reasoned that scales such as the foregoing should provide more reliable data than those which were based on responses to direct questions concerning a subject's agreement or disagreement with a statement rather bald-facedly presenting a particular attitude bias, or than straightforward questions about why subjects were studying the foreign language and the like. The semantic differential type scales were believed to be more 136 LANGUAGE TESTS AT SCHOOL indirect measures of subject attitudes and therefore more valid than more direct questions about attitudes. The former, ~was reasoned, should be less susceptible to distortion by sensitive respondents. Data concerning the tendency of scales to correlate in meaningful ways are about the only evidence concerning the validity of such scales. For instance, negatively valued scales such as 'stubborn' 'nervous' and 'shy' tend to cluster together (by correlation and factor analysis techniques) indicating at least that subjects are differentiating the semantic values of scales in meaningful ways. Similarly, scales concerning highly valued positive traits such as 'kind' 'friendly' 'dependable' and the like also tend to be more highly correlated with each other than with very dissimilar traits. There is also evidence that views of persons of different national, ethnic, or iinguistic backgrounds differ substantially in ways that are characteristic of known attitudes of certain groups. For instance, Oller, Baca, and Vigil (1977) report data showing that a group of Mexican American women in a Job Corps program in Albuquerque generally rate Mexicans substantially higher than they rate Americanos (Anglo-Americans) on the same traits. It is conceivable that such scales could be used to judge the strength of self-concept, attitude toward other groups, and siinilar constructs. However, much more research is needed before such measures are put forth as measures of particular constructs. Furthermore, they are subject to all of the usual objections concerning self-reported data. Little research has been done with the measurement of attitudes in children (at least this is ,true in relation to the questions an4 research interests disc)lssed in section B above). Recently, however, Zirkel and Jackson (1974) have offered scales intended for use with children of Anglo, Black, Native American, and Chicano heritages. These scales are of the Like~t-type (agree versus disagree on a five point scale with one position for 'don't know'). The innovation in their technique' involves the use ofline drawings of happy versus sad faces as shown in Figure 4. Strickland (1970) may have been the first to use such a method with children. It is apparently a device for obtaining scaled responses to attitude objects (such as, food, dress, games, well known personalities who are models of a particular cultural group, and symbols believed important in the definition of a culture). Th~scales are used for non-readers and preliterate children. The Cultural Attitude Scales exist in four forms (one for each of the above designated ethnic groups). The Technical Report indicates test-retest reliabilities ranging from MEASURING ATTITUDES AND MOTIVATIONS 137 Figure 4. Example of a Likert-type attitude scale intended for children. (From Zirkel (1973), Black American Cultural Attitude Scale. The scale has five points with a possible 'don't know' answer as represented at the extreme left - scale~ are referenced against objects presumed important to defining cultural attItudes and of a sort that can be pictured easily.) .52 to .61, and validity coefficients ranging from .15 to .46. These are not impressive if one co·nsiders that only about 4 % to 25 % of the variance in the scales is apt to be related to the attitudes they purport to measure. No figures on reliability or validity .are given in the Test Manual. The authors caution, 'the use of the Cultural Attitude Scales to diagnose the acculturation of individual children in the classroom is at this time very precarious' (p. 27). It seems to be implied that 'acculturation' is a widely accepted goal of educational programs, and this is questionable. Further, it seems to be suggested that the scales might someday be used to determine the level of acculturation of individual children - this implication seems unwarranted. There is not any reason to expect reliable measures of such matters ever to be forthcoming. Nonetheless, the caution is commendable. The original studies with the Social Distance Scale (Bogardus, 1925, 1933), from which the Cultural Attitude Scales very indirectly derive, suggest that stereotypes of outgroups are amon'g the most stable attitudes and that the original scale was sufficiently reliable and valid to use with some confidence (see Shaw and Wright, 1967). With increasing awareness of crosscultural sensitivities, it may be that measures of social distance would have to be recalibrated for today's societal norms (if indeed sugh norms exist and can be defined) but the original studies and several follow ups have indicated reliabilities in the range of .90 and 'satisfactory' validity according to Newcomb (1950, as cited by Shaw and Wright, 1967, p. 408). The original scales required the respondent to indicate whether he would marry into, have as close friends, as fellow workers, as speaking acquaintances, as 138' MEASURING ATTITUDES AND MOTIVATIONS LANGUAGE TESTS AT SCHOOL 139 proficiency but rather are caused by the degree of proficiency attained - though weakly perhaps. And, many other possibilities exist. Backman (1976) in a refreshingly different approach to the assessment of attitudes offers what she calls the 'chicken or egg' puzzle. Do attitudes in fact cause behaviors in some way, or are attitudes rather the result of behaviors? Savignon (1972) showed that in the foreign language classroom positive attitudes may well be a result of success rather than a cause. It seems quite plausible that success in learning a second language might itself give rise to positive feelings toward the learning situation and everything (or everyone) associated with it. Similarly, failure might engender less positive feelings. Yet another plausible ~lternative is that attitudes and behaviors may be complexly mterre1ated such that both of them influence each other. Bloom (1976) prefers this latter alternative. Another possibility is that attitudes are associated with the planning of actions and the perception of events in some way that influences the availability of sensory data and thus the options that are perceivable or conceivable to the doer or learner. The research of Manis and Dawes (1961) showing that cloze scores were higher for subjects who agreed with the .Content of a passage than for those who disagreed would seem to lend credence to this last suggestion. Manis and Dawes concluded that it wasn't just that subjects didn't want to give right answers to items on passages with which they disagreed, but that they were in fact less able to give right answers. Such an interpretation would also fit the data from a wide variety of studies revealing expectancy biases of many sorts. However, Doerr (in press) raises some experimental questions about the Manis and Dawes design. Regardless of the solution to the (possibly unsolvable) puzzle about what causes what, it is still possible to investigate the strength of the relationship of attitudes as expressed in response to questionnaires and scores on proficiency measures of second language ability for learners in different . contexts. It has been observed that the relationship is apparently stronger in contexts where learners can avail themselves of opportunities to talk with representatives of the target language group(s), than it is in contexts where the learners do not have an abundance of such opportunities. F or instance, a group of Chinese adults in the Southwestern United States (mostly graduate students on temporary visas) performed somewhat better on a cloze test in English if they rated Americans as visitors to his country, or would debar from visiting his country members of a specific minority or designated outgroup. The Bogardus definition of 'social distance', however, is considerably narrower than that proposed more recently by Schumann (1976). The latter is, at this point, a theoretical construct in the making and is therefore not reviewed here. D. Observed relationships to achievement and remaining puzzles The question addressed in this section is, how are affective variables related to educational attainment in general, and in particular to the acquisition of a second language? Put differently, what is the nature and the strength of observed relationships? Gardner (1975) and Gardner, Smythe, Clement, and Gliksman (1976) have argued that attitudes are somewhat indirectly related to attainment of proficiency in a second language. Attitudes, they reason, are merely one of the types of factors that give rise to motivations which are merely one of the types of factors which eventually result in attainment of proficiency in a second language. By this line of reasoning, attitudes are considered to be causally related to achievement of proficiency in a second language even though the relationship is not apt to be a very strong one. In a review of some 33surveys of 'six different grade levels (grades 7 to 12) from.seven different regions across Canada' (p. 32) involving no less than about 2,000 subjects, the highest average correlation between no less than 12 different attitude scales with two measures of French achievement in no case exceeded .29 (Gardner, 1975). Thus, the largest amount of variance in language proficiency scores that was predicted on the average by the attitude measures was never greater than 8! %. This result is not inconsistent with the claim that the relationship between attitude measures and attainment in a second language is quite indirect. However, such a result also leaves open anumber of alternative explanations. It is possible that the weakness of the observed relationships is due to the unreliability or lack of validity ofthe attitude measures. If this explanation were correct, there might be a much stronger relationship between attitudes and attained proficiency than has or ever would become apparent using those attitude measures. Another possibility is that the measures of language proficiency used are themselves low in reliability or validity. Yet another possibility is that attitudes do not cause attainment of b 140 LANGUAGE TESTS AT SCHOOL high on a factor defined by such posltlve traits as helpfulness, sincerity, kindness, reasonableness, and friendliness (Oller, Hudson, and Liu, 1977). Similarly, a group of Mexican-American women enrolled in a Job Corps program in Albuquerque, New Mexico scored somewhat higher on a cloze test if they rated themselves higher on a factor defined by traits such as calmness, conservativeness, religiosity, shyness, humility, and sincerity (Oller, Baca, and Vigil, 1977). The respective correlations between the proficiency measures and the attitude factors were .52 and .49. In the cases of two populations of Japanese subjects learning English as a foreign language in Japan, weak or i~si~nificant relationships between similar attitude measures and SImIlar proficiency measures were observed (Asakawa and Oller, 1978, and Chihara and Oller, 1978). In the first mentioned studies, where learners were in a societal context rich in occasions where English might be used, attitudinal variables seemed somewhat more closely related to attained proficiency than in the latter studies, where learners were presumably exposed to fewer opportunities to communicate in English with representatives of the target language culture(s). These. observed contrasts between second language contexts (such as the ones the Chinese and Mexican-American subjects were exposed to) and foreign language contexts (such as the ones the Japanese subjects were exposed to) are by no means empirically secure. They seem to support the hunch that attitudes may have a greater importance to learning in some contexts than in others, and the direction of the contrasts is consistently in favor of the second language contexts. However, the pattern of sociocultural variables in the situations referred to is sufficiently diverse to give rise to doubts about their comparability. Further, the learners in the second language contexts generally achieved higher scores in English. Probably the stickiest and most persistent difficulty in obtaining reliable data on attitudes is the necessity to rely on self-reports, or worse yet, someone else's evaluative and second-hand judgement. The problem is not just one of honesty. There is a serious question whether it is reasonable to expect someone to give an unbiased report of how they feel or think or behave when they are smart enough to know that such information may in some way be used against them, but this is not the only problem. There is the question of how reliable and valid are a person's jUdgements even when there is no aversive stimulus or threat to his security. How well do people know how MEASURING ATTITUDES AND MOTIVATIONS 141 they would behave in such and such a hypothetical situation? Or, how does someone know how they feel about a statement that may not have any relevance to their present experience? Are average scores on such tasks truly indicative of group tendencies in terms of attitudes and their supposed correlative behaviors, or are they merely indicative, of group tendencies in responding to what may be relatively meaningless tasks? The foregoing questions may be unanswerable for precisely the same reasons that they are interesting questions. However, there are other questions that can be posed concerning SUbjective self-ratings that are more tractable. For instance, how reliable are the self-ratings of subjects of their own language skills in a second language, say? Can subjects reliably judge their own ability in reading, writing, or speaking and listening tasks? Frequently in studies of attitude, the measures of attitude are correlated only with subject's own reports of how well they speak a certain language in a certain context with no objective measures of language skill whatsoever. Or, alternatively, subjects are asked to indicate when they speak language X and to whom and in what contexts. How reliable are suchjudgements? In the cases where they can be compared to actual tests oflanguage proficiency, the results are not too encouraging. In a study with four different proficiency measures (grammar, vocabulary, listening comprehension, and cloze) and four self-rating scales (listening, speaking, reading, and writing), in no case did a correlation between a self-rating scale and a proficiency test reach .60 (there were 123 SUbjects) - this indicates less than 36 %overlap in variance on any of the self-ratings with any of the proficiency tests (Chihara and Oller, 1978). In another study (Oller, Baca, Vigil, 1977) correlations between a single self-rating scale and a cloze test in English scored by exact and acceptable methods (see Chapter 12) were .33 and .37 respectively (subjects numbered 60). Techniques which require others to make judgements about the attitudes of a person or group seem to be even less reliable than selfratings. For instance, Jensen (1965) says of the widest used 'projective' technique for making judgements about personality variables, 'put frankly, the consensus of qualified judgement is that the Rorschach is a very poor test and has no practical worth for any of the purposes for which it is recommended by its devotees' (p. 293). Careful research has shown that the Rorschach (administered to about a million persons a year in the 1960s in the U.S. alone, according to Jensen, at a total cost of 'approximately 25 million 142 LANGUAGE TESTS AT SCHOOL dollars', p. 292) and other projective techniques like it (such as the Thematic Apperception Test, mentioned at the outset of this chapter) generate about as much variability across trained judges as they do across subjects. In other words, when trained psychologists or psychiatrists use projective interview techniques such as the TAT or Rorschach to make judgements about the personality of patients or clients, the judges differ in their judgements about as much as the patients differ in their judged traits. On the basis of such tests it would be impossible to tell the difference (even if there was one) between the level of, say anxiety, in Mr Jones and Mr Smith. In the different ratings ofMr Jones, he would appear about as different from himself as he would from Mr Smith. 2 In conclusion, the measurement of attitudes does not seem to be a promising field ~ though it offers many challenges. Chastain urges in the conclusion to his 1975 article that 'each teacher should do what he or she can to encourage the timid, support the anxious, and loose the creative' (p. 160). One might add that the teacher will probably be far more capable of determining who is anxious, timid, creative, and we may add empathetic, aggressive, outgoing, introverted, eager, enthusiastic, shy, stubborn, inventive, egocentric, fascistic, ethnocentric, kind, tender, loving, and so on, without the help of the existing measures that purport to discriminate between such types of personalities. Teachers will be better off relying on their own intuitions Qased on a compassionate and kind-hearted interest in their students. 2 In spite of the now well-known weaknesses of the Rorschach (Jensen, 1965) and the TAT (Anastasi, 1976), Ervin (1955) used the TAT to draw conclusions about contrasts between bilingual subject's performances on the test in each of their two languages. The variables on which she claimed they differed were such things as 'physical aggression', 'escaping blame', 'withdrawal', and 'assertions of independence' (p. 391). She says, 'it was concluded that there are systematic differences in the content of speech of bilinguals according to the language being spoken, and that the differences are probably related to differences in social roles and standards of conduct associated with the respective language communities' (p. 391). In view of recent studies on the validity of the TAT (for instance, see the remarks by Anastasi, 1976, pp. 565-587), it is doubtful that Ervin's results could be replicated. Perhaps someone should attempt a similar study to see if the same pattern of results will emerge. However, as long as the validity of the TAT and other projective techniques like it is in serious doubt, the results obtained are necessarily insecure. Moreover, it is not only the validity of such techniques as measures of something in particular that is in question - but their reliability as measures.of anything at all is in question. MEASURING ATTITUDES AND MOTIVATIONS 143 KEY POINTS 1. It is widely believed that attitudes are factors involved in the causation of behavior and that they are therefore important to success or failure in I schools. 2. Attitudes toward self and others are probably among the most important. 3. Attitudes cannot be directly observed, but must be inferred from behavior. 4. Usually, attitude tests (or attempts to measure attitudes) consist of asking a subject to say how he feels about or would respond to some hypothetical situation (or possibly merely a statement that is believed to characterize the attitude in question in some way). 5. Although attitude and personality research (according to Buros, 1970) received more attention from 1940-1970 than any other area of testing, attitude and personality measures are generally the least valid sort of tests. 6. One of the essential ingredients of successful research that seems to have been lacking in many of the attitude studies is the readiness to entertain multiple hypotheses instead of a particular favored viewpoint. 7. Appeal to the label on a particular 'attitude measure' is not satisfactory evidence of validity. 8. There are many ways of assessing the validity of a proposed measure of a particular hypothetical construct such as an attitude or motivational orientation. Among them are the development of multiple methods of assessing the same construct and checking the pattern of correlation between them; checking the attitudes of groups known to behave differently toward the attitude object (e.g., the institutionalized church, or possibly the schools, or charitable organizations); and repeated combinations of the foregoing. 9. For some attitudes or personality traits there may be no adequate behavioral criterion. For instance, if a person says he is angry, what . behavioral criterion will unfailingly prove that he is not in fact scared instead of angry? 10. Affective variables that relate to the ways people interact through language compass a very wide range of conceivable affective variables. II. It has been widely claimed that affective variables play an important part in the learning of second languages. The research evidence, however, is often contradictory. 12. Guiora, et al tried to predict the ease with which a new system of pronunciation would be acquired on the basis of a purported measure of empathy. Their arguments, however, probably have more intuitive appeal than the research justifies. 13. In two cases, groups scoring lower on Guiora's test of empathy did better in acquiring a new system of pronunciation than did groups scoring higher on the empathy test. This directly contradicts Guiora's hypothesis. 14. Brown (1973) and others have reported that self-concept may be an important factor in predicting success in learning a foreign or second language. 144 MEASURING ATTITUDES AND MOTIVATIONS LANGUAGE TESTS AT SCHOOL 15. Gardner and Lambert have argued that integratively motivated learners should perform better in learning a second language than instrumentally motivated learners. The argument has much appeal, but the data confirm every possible outcome - sometimes integratively motivated learners do excel, sometimes they do not, and sometimes they lag behind instrumentally motivated learners. Could the measures of motivation be invalid? There are other possibilities. 16. The fact that attitude 'scales require subjects to be honest about potentially damaging information makes those scales suspect of a builtin distortion factor (even if the subjects try to be honest). 17. A serious moral problem arises in the valuation of the scales. Who is to judge what constitutes a prejudicial view? An ethnocentric view? A bias"? Voting will not necessarily help. There is still the problem of determining who will be included, in the'vote, or who is a member of which ethnic, racial, national, religious, or mm-religious group. It is known that the meaning of responses to a particular scale is essentially uninterpretable ' unless the value of the scale can be determined in advance. 18. Attitude measures are all necessarily indirect measures (if indeed they" can be construed as measures at all). They may, however, be' straightforward questions such as 'Are you prejudiced?' or they may be cleverly designed scales that conceal their true purpose - for example, the F, or Fascism Scale by Adorno, et al. 19. A possible interpretation ofthe pattern of correlations for items on the F Scale and the E Scale (Adorno, e( al and, Gardner and Lambert) is response set. Since all of the· statements are keyed in the same direction, a tendency to respond consistently though independently of item content would produce some correlation among the items and overlap between the two scales.l;:1.owever? the correlations might have nothing to do With fascism or ethnocentrism which is what. the two scales purport to measure. 20. It has been suggested that the emotive tone of statements included in attitude questionnaires may be as important to the responses-of subjects as is the factive content of those same statements. 21. As long as the validity of purported attitude measures is in question, their pattern of interrelationship with any otlier criteria (say, proficiency attained in a second language) remains essentially uninterpretable. 22. Concerning the sense of lostness presumably measured by the Anomie Scale, all possible predictions have been made in relation to attitudes toward minorities and outgroups. The scale is moderately correlated with the E and F Scales in the Gardner and Lambert data which may merely indicate that the Anomie Scale too is subject to .a response set factor. 23. Experts rarely recommend the use of personality inventories as a basis for decisions that will affect individuals. 24. A study of anxiety by Chastain (1975) revealed conflicting results: for one group higher anxiety was correlated with better instead of worse grades. 25. Patterns of correlation between scales of the semantic differential type as applied to attitude measurement indicate some possible validity. 26. 27. 28. 29. 145 Clusters of variables that are distilled by factor analytic techniques show semantically similar scales to be more closely related to each other than to semantically dissimilar scales. Attitude scales for children of different ethnic backgrounds have recently been developed by Zirkel and Jackson (1974). No validity or reliability information is given in the Test Manual. Researchers and others are cautioned to use the scales only for group assessment - not for decisions affecting individual children. There may be no way to determine the extent to which attitudes cause behavior or are caused by it, or both. The nature of the relationship may differ according to context. There is some evidence that attitudes may be more closely associated with second language attainments in contexts that are richer in opportunities for communication in the target language than in contexts that afford fewer opportunities for such interaction. In view of all of the research, teachers are probably better off relying on their own compassionate judgements than on even the most highly researched attitude measures. DISCUSSION QUESTIONS 1. Reflect back over your educational experience. What factors would you identify as being most important in your own successes and failures in school settings? Consider the teachers you have studied under. How many of them really influenced you for better or for worse? What specific events can you point to that were particularly important to the inspirations, discouragements, and day-to-day experiences that have characterized your own education. In short, how important have attitudes been in your educational experience and what were the causes of those attitudes in your judgement? 2. Read a chapter or two from B. F. Skinner's Verbal Behavior (1957) or . Beyond Freedom and Dignity (1971) or better yet, read all of his writings on one or the other topics. Discuss his argument about the dispensability of intervening variables such as ideas, meaning, attitudes, feelings, and the like.-How does his argument apply to what is said in Chapter 5 of this book? 3. What evidences would you accept as indicating a belligerent attitude? A cheerful outgoing personality? Can such evidences be translated into more objective or operational testing procedures? ... 4. Discuss John Platt's claims about the need for disproof in science. Can you make a list, of potentially falsifiable hypotheses concerning the nature of the relationship between. attitudes and learning? What would you take as evidence that a particular view had indeed been disproved? Why do you suppose that disproofs are so often disregarded in the literature on attitudes? Can you offer an explanation for this? Are intuitiol)s concerning the nature and effects of attitudes apt to be less reliable than tests which purport to measure attitudes? 5. Pick a test that you know of that is used in your school and look it up in the compendia by Buros (see the bibliography at the end of this book). \ 146 , MEASURING ATTITUDES AND MOTIVATIONS LANGUAGE TESTS AT SCHOOL - 6. 7. 8. 9. 10. 11. 12. 13. ./ What do the reviewers say concerning the test? Is it relial;lle? Does it have a substantial degree ofvalidity? How is it used in your school? How does the use of the test affect decisions concerning children in the school? Check the school files to see what sorts of data are recorded there on the results of personality inventories (The Rorschach? The Thematic Apperception Test ?). Do a correlation study to determine what is the degree of relationship between scores on available personality scales and other educationall!leasures. . Discuss the questions posed by Cooper and Fishman (p. 118f, this chapter). What would you say they reveal about the state of knowledge concerning the nature an<i effects of language attitudes and their measurement? I Consider the moral problem associated with the valuatioh of attitude scales: Brodkey and Shore (1976) say that 'A student seems to exhibit an enjoyment of writing for its own sake, enjoyment of solitary work, a rejection of Puritanical constraints, a good deal of self-confidence, and sensitivity in personal relationships' (p. 158). How would you value the statements on the Q-sort p. 130 above with respect to each of the attitudinal or personality constructs offered by Brodkey and Shore? What would you recommend the English Tutorial Program do with respect to subjects who perform 'poorly' by someone's definition on.cthe Q-sort? The Native American? The subjects over 30? Orientals? Discuss with a group the valuation ofthe scale on the statement: 'Human nature being what it is, there will always be war and conflict.' Do you believe it is moral and just to say that a person who agrees strongly with this statement is to that extent fascistic or authoritarian or prejudiced? (See the F Scale, p. 122f.) Consider the meaning of disagreement with the statement: 'In this country, it's whom you know, not what you know, that makes for success.' What are some of the bases of disagreement? Suppose subjects think. success is not possible? Will they be apt to agree or disagree with the statement? Is their agreement or disagreement characteristic of Anomie in your view? Suppose a person feels that other factors besides knowing the right people are more important to success. What response would you expect him to give to the above statement on the Anomie scale? Suppose a person felt that success was inevitable for certain types of people (such a person might be regarded as a fatalist or an unrealistic optimist). What response would you predict? What would it mean? Pick any statement from the list given from the Q-sort on p. '130. Consider whether in your view it is indicative of a person who is likely to be a good student or not a good student. Better yet, take all of the statements given and rank them from most characteristic of good students to least characteristic. Ask a group of teachers to do the same. Compare the rank orderings for degree of agreement. Repeat the procedure of question 11 with the extremes of the scale defined in terms of Puritanism (or, say, positive self concept) versus nonPuritanism (or, say, negative self concept) and again compare the rank ' orders achieved. Consider the response you might give to a statement like: 'While taking 147 an important examination, I perspire a great deal.' What factors might influence your degree of agreement or disagreement independently ofthe degree of anxiety you mayor may not feel when taking an important examination? How about room temperature? Body chemistry (some people normally perspire a great deal)? Your feelings about such a subject? Your feelings about someone who is brazen enough to ask such a socially obtuse question? The humor of mopping your brow as you mark the spot labeled 'never' on the questionnaire? 14. What factor,s do you feel enter into the definition of empathy? Which of those are potentially measurable by the MME? (See the discussion in the _ text on pp. 133-4.) . 15. How well do you think semantic differential scales, or Likert-type (agree-disagree) scales in general can be understood by subjects who might be tested with such techniques? Consider the happy-sad faces and the neutral response of 'I don't know' indicated in Figure 4. Will children who are rion-literate (pre-readers) be able to perform the task meaningfully in your judgement? Try it out on some relatively nonthreatening subject such as recess (play-time, or whatever it is called at your school) compared against some attitude object that you know the children are generally less keen about (e.g., arithmetic?"reading? sitting quietly?). The question is can the children do the task in a way that reflects their known preferences (if indeed you do know their preferences) ? 16. How reliable and consistent are self-reports about skills, feelings, attitudes, and the like in general? Consider yourself or persons whom you know well, e.g., a spouse, sibling, room-mate, or close friend. Do you usually agree with a person's own assessment of how he feels about such and such? Are there points of disagreement? Do you ever feel you have been right in assessing the feelings of others, who claimed to feel , differently than you believed they were actually feeling? Has someone else ever enlightened you on how you were 'really' feeling? Was he correct? Ever? Do you ever say everything is fine when in fact it is lousy? What kinds of social factors might influence such statements? Is it a matter of honesty or kindness or both or neither in your view? 17. Suppose you have the opportunity to influence a school board or other decision-making body concerning the use or non-use of personality tests in schools. What kinds of decisions would you recommend? SUGGESTED READINGS 1. Anne Anastasi, 'Personality Tests,' Part 5 of the book by the same author entitled Psychological Testing (4th ed). New York: Macmillan, 1976, 493-621. (Much of the information contained in this very thorough book is accessible to the person not trained in statistics and research design. Some of it is technical, however.) 2. H. Douglas Brown, 'Affective Variables in Second Language Acquisition,' Language Learning 23, 1973, 231-44. (A thoughtful discussion of affective variables that need to be more thoroughly studied.) lA8 LANGUAGE TESTS AT SCHOOL . 3. Robert L. Cooper and Joshua A. Fishman, 'Some Issues)n the Theory and Measurement of Language Attitudes,' in L. Palmer and B. Spolsky (eds.) Papers on Language Testing: 1967-1974. Washington, D.C.: Teachers of English to Speakers of Other Languages, 187-98 . . 4. John Lett, 'Assessing Attitudinal Outcomes,' in June K. Phillips (ed.) The Language Connection: From the Classroom to the World: ACTFL Foreign Language Education Series. New York: National Textbook, in press. 5. Paul Pimsleur, 'Student Factors in Foreign Language Learning: A Review of the Literature,' Modern Language Journal 46, 1962, 1/50--9. 6. Sandra J. Savignon, Communicative Competence. Montreal: (Ma~cel Didier, 1972. ,( 7. John Schumann, 'Affective Factors and the Problem of Age in Second Language Acquisition,' Language Learning 25, 1975,209-35. (Follows up on Brown's discussion and reviews much of the recent second language literature on the topic of affective variables - espeCially the work of Guiora, et al.) ,PART TWO Theories and Methods of Discrete Point Testing \; I DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS Syntactic Linguistics \ as a Source for Discrete Point Methods A. From theory to practice, exclusively? B. Meaning-less structural analysis C. Pattern drills without meaning D. From discrete point teaching to discrete point testing E. Contrastive linguistics F. Discrete elements of discrete aspects of discrete components of discrete skills - a problem of numbers This chapter explores the effects of the linguistic theory that contended language was primarily syntax-based - that meaning could be dispensed with. That theoretical view led to methods of language teaching and testing that broke language down into ever sO many little pieces. The pieces and their patterns were supposed to be taught in language classrooms and tested in the discrete items of discrete sections of language tests. The question is whether language can be treated in this way without destroying its essence. Humpty Dumpty illustrated that some things, once they are broken apart,\are ' exceedingly difficult to put together again. Here we examine the theoretical basis of taking language apart to teach and test it piece by piece. In Chapter 8, we will return to the question of just how feasible this procedur~ is. A. From theory to practice, exclusively? Prevailing theories about the nature of language influence theories about language learning which in their turn influence ways of teaching and testing language. As Upshur observed (1969a) the 150 151 direction ofthe influence is usually from linguistic theory to learning theory to teaching methods and eventually to testing. As a result of the direction of influence, there have been important time lags changes in theory at one end take ~ long time to be realized in changes at the other end. Moreover, just as changes in blueprints are easier than changes in buildings, changes in theories have been made often without any appreciable change in tests. Although the chain of influence is sometimes a long and indirect one, with many intervening variables, it is possible to see the unmistakable marks of certain techniques of linguistic analysis not only on the pattern drill techniques of teaching that derive from those methods of analysis,but also on a wide range of discrete point testing techniques. The unidirectional influence from theory to practice is not healthy. As John Dewey put it many years ago: That individuals in every branch of human endeavor be experimentalists engaged in testing the findings of theorists is the sole guarantee for the sanity of the theorist (1916, p. 442). Language theorists are not immune to the bite of the foregoing maxim. In fact, because it is so easy to speculate about the nature of language, and because it has been such an immensely popular pastime with philosophers, psychologists, logicians, linguists, educators and others theories of language - perhaps more than other the~ries -'-- need t~ be constantly challenged and put to the test in _ every conceivable laboratory. Surely the findings of classroom teachers (especia1ly language teachers or teachers who have learned the importance oflanguage to all aspects of an educational curriculum) are as important to the theories oflanguage as the theories themselves are to what happens in the classroom. Unfortunately, the direction of influence has been much too one-sided. Too often the teacher is merely handed untried and untested materials that some theory says ought to work - too often the materials don't work at all and teachers are left to invent their own curricula while at the same time they are expected to perform the absorbing task of delivering it. It's something like trying to write the script, perform it, direct the production, and operate the theater all at the same time. The incredible thing is that some teachers manage surprisingly well. If the direction of influence between theory and practice were mutual , the interaction would be fatal to many of the existing 152 LANGUAGE TESTS AT SCHOOL theories. This would wound the pride of many a theorist,:but it would generally be a healthy and happy state of affairs. As we have seen, empirical advances are made by disproofs (Platt, 1964, citing Bacon). They are not made by supporting a favored position - perhaps by refining a favored position some progress is made, but what progress is there in empirical research that merely. supports a favored view while pretending that there are no plau~ble competing alternatives? The latter is not empirical research at all. It is a form of idol worship where the theory is enshrined and the pretence of research is' merely a form - a ritual. Platt argues that a theory which cannot be 'mortally endangered' is not alive. We may add that empirical research that does not mortally endanger the hypotheses (or theories) to which it is addressed is not empirical research at all. How have linguistic theories influenced theories about language learning and subsequently (or simultaneously in some cases) methods of language teaching and testing? Put differently, what are some of the salient characteristics of theoretical views that have influenced practices in language teaching and testing? How have discrete point testIng methods, in particular, found justification in language teaching methods and linguistip theories? What are the most imp~rtant differences between pragmatic testing methods and discrete point testing methods? . The crux of the issue has to do with meaning. People use language to inform others, to get information from others, to express their feelings and emotions, to analyze and characterize their own thoughts in words, to explain, cajole, reply, explore, incite, disturb, encourage, plan, describe, promise, play, and much much more . .The crucial question, therefore, for any theory that claims to be a theory of natural language (and as we have argued in Part One, for any test that purports to assess a person's ability to use language) is how it addresses this characteristic feature of language - the fact that language is used in meaningful ways. Put somewhat differently, language is used in ways that put utterances in pragmatic correspondences with extra-linguistic contexts. Learning a language involves discovering how to create utterances in accord with such pragmatic correspondences. B. Meaning-less structural analysis We will briefly consider how the structural linguistics of the Bloomfieldian era dealt with the question of meaning and then we will DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS 153 consider how language teaching and eventually language testing methodologies were subsequently affected. Bloomfield (1933, p. l39) defined the meaning of a linguistic form as the situation in which the speaker utters it, and the response it calls forth in the hearer. The behavioristic motivation for such strict attention to observables will be obvious to anyone familiar with the basic tenets of behaviorism (see Skinner, 1953, and 1957). There are two major problems, however, with such a definition of meaning. For one it ignores the inferential processes that are always involved in the association of meanings with linguistic utterances, and for another it fails to take account of the iI)1portance of situations and contexts that are part of the history of experience that influence the inferential connection of utterances to meanings. The greatest difficulties of the Bloomfie1dian structuralism, however, arise not directly from the definition of meaning that Bloomfield proposed, but from the fact that he proposed to disregard meaning in his linguistic theory altogether. He argued that 'in order to give a scientifically accurate definition of meaning we should have to have a scientifically accurate knowledge of everything in the speaker's world' (p. l39). Therefore, he contended that meaning should not constitute any part of a scientific linguistic analysis. The implication of his definition was that since the situations which prompt speech are so numerous, the number of meanings of the linguistic units which occur in them must consequently be so large as to render their description infeasible. Hence, Bloomfieldian linguistics tried to set up inventories of phonemes, morphemes, and certain syntactic patterns without reference to the ways in which those units were used in normal communication. What Bloomfield appeared to overlook was the fact that the communicative use of language is systematic. If it were not people could not communicate as they do. While it may be impossible to describe each of an infinitude of situations, just as it is impossible to count up to even the lowest order of infinities, it is not impossible in principle to characterize a generative system that will succeed where simple enumeration fails. The problem of language description (or better the characterization of language) is not a problem of merely enumerating the elements of a particular analytical paradigm (e.g., the phonemes or distinctive sounds of a given language). The problem of characterizing language is precisely the one that Bloomfield ruled out of bounds - namely, how people learn and use language meaningfully. 154 LANGUAGE TESTS AT SCHOOL DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS What effect would such thinking have on language teaching and eventually on language testing? A natural prediction would be that it ought to lead to a devastating disregard for meaning in the pragmatic sense of the word. Indeed it did. But, before we examine critically some of the debilitating effects on language teaching it is necessary to recognize that Bloomfield's deliberate excision of meaning from linguistic analyses was not a short-lived nor narrowly parochial suggestion - it was widely accepted and persisted well into the 1970s as a definitive characteristic of American linguistics. Though Bloomfield's limiting assumption was certainly not accepted by all American linguists and was severely crititized or ignored in certain European traditions of considerable significance,l his particular variety of structural linguistics was the one that unfortunately was to pervade the theories and methods oflanguage teaching in the United States for the next forty or so years (with few exceptions, in fact, until the present). The commitment to a meaning-less linguistic analysis was strengthened by Zellig Harris (1951) whose own thinking was apparently very influential in certain similar assumptions of Noam Chomsky, a student of Harris. Harris believed that it would be possible to do linguistic analys<;s on the basis of purely formal criteria having to do with nothing except the observable relationships between linguistic elements and other linguistic elements. He said: 155 work for one unidentified element, how can it possibly work for all of them? The fact is that it cannot work at all. Nor has anyone ever successfully applied the methods Harris recommended. It is not a mere procedural difficulty that Harris's proposals run aground on, it is the procedure itself that creates the difficulty. It is intrinsically unworkable and viciously circular (Oller, Sales, and Harrington, 1969). Further, how can unidentified elements be defined in terms of themselves or in terms of their relationships to other similarly undefined elements? We will see below that Harris's recommendations for a procedure to be used in the discovery of the grammatical elements of a language, however indirectly and through whatever chain of inferential steps, has been nearly perfectly translated into procedures for teaching languages in classroom situations - procedures that work about as well as Harris's methods worked in linguistics. Unfortunately, Bloomfield's limiting assumption about meaning did not end its influence in the writings of Zellig Harris, but persisted right on through the Chomsky an revolution and into the early 1970s. In fact, it persists even today in teaching methods and standardized instruments for assessing language skills of a wide variety of sorts as we will see below. Chomsky (1957) found what he believed were compelling reasons for treating grammar as an entity apart from meaning. He said: . The whole schedule of procedures ... which is designed to begin with the raw data and end with a statement of grammatical structure, is essentially a twice made application of two major steps: the setting up of elements and the statement of the distribution of these elements relative to each other ... The elements are determined relatively to each other, and on the basis of the distributional relations among them (Harris, 1951, p.61). I think that we are forced to conclude that grammar is autonomous and independent of meaning (p. 17). and again at the conclusion to his book Syntactic Structures: Grammar is best formulated as a self-contained study independent of semantics (p. 106). He was interested in There is a problem, however, with Harris's method. How will the first element be identified? It is not possible to identify an unidentified element on the basis of its yet to be discovered relationships to other yet to be discovered elements. Neither is it conceivable as Harris recommends (1951, p. 7) that 'this operation can be carried out .... only if it is cat-hed out for all elements simultaneously'. If it cannot attempting to describe the structure oflanguage with no explicit reference to the way in which this instrument is put to use (p. 103). Although Chomsky stated his hope that the syntactic theory he was elaborating might eventually have 'significant interconnections with a parallel semantic theory' (p. 103), his early theorizing made no provision for the fact that words and sentences are used for meaningful purposes - indeed that fact was considered, only to be summarily disregarded. Furthermore, he later contended that the 1 For instance, Edward Sapir (1921) was one of the Americans who was not willing to accept Bloomfield's limiting assumption about meaning. The Prague School of linguistics in Czechoslovakia was a notable European stronghold which was little enamored with Bloomfield's formalism (see Vachek, 1966 on the Prague group). • 7 f 156 DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS LANGUAGE TESTS AT SCHOOL .' communicative function of language was subsidiary and derivative that language as a syntactically governed~ system had its real essence in some kind of'innettotality' (1964, p. 58)~that native speakers ofa language are capable of producing 'new sentences ... that are immediately understood by other speakers although they bear no physical resemblance to sentences which are "familiar'" (Chomsky, 1966, p. 4). The hoped for 'semantic theory' which Chomsky alluded to in several places seemed to have emerged in 1963 when Katz and Fodor published 'The Structure of a Semantic Theory'. However, they too contended that a speaker was capable of producing and understanding indefinitely many sentences that were 'wholly novel to him' (their italics, p." 481). This idea, inspired by Chomsky's thinking, is an exaggeration of the creativity of language - or an understatement depending on how the coin is turned. If everything about a particular utterance is completely new, it is not an utterance in any natural language, for one of the most characteristic facts about utterances in natural languages is that they conform to certain systematic principles. By this interpretation, Katz and Fodor have overstated the case for creativity. On the other hand, for everything "about a"particUlar utterance to be completely novel, that utterance would conform to none of the pragmatic constraints or lower order phonological rules, syntactic patterns, semantic values and the like. By this rendering, Katz and Fodor have underestimated the ability of the language User: to be creative within the li1l!its set by his language. The fact is that precisely because utterances are used in communicative contexts in particular correspondences to those contexts, practically everything about them is familiar - their newness consists in the fact that they constitute new combinations of known lexical elements and known sequences of grammatical categories. It i.s in this sense that Katz and Fodor's remark. can be read as an understatement. The meaningful use of utterances in discourse is always new and is constantly a source of information and meaning that would otherwise remain undisclosed. The continuation of Bloomfield's limiting assumption about meaning was only made quite clear in a footnote to An Integrated Theory of Linguistic Descriptions by Katz and Postal (1964). In spite of the fact that they claimed to be integrating Chomsky's syntactic theory with a semantic one, they mentioned in a footnote that 'we exclude aspects of sentence use and comprehension that are not . 157 explicable through the postulation of a generative mechanism as the reconstruction of the speaker's ability to produce and understand sentences. In other words, we exclude conceptual features such as the physical and sociological setting of utterances, attitudes, and beliefs of the speaker and hearer, perceptual and memory limitations, noise level of the settings, etc.' (p. 4). It would seem that in fact they excluded just about everything of interest to an adequate theory of language use and learning, and to methods of language teaching and testing. It is interesting to note that by 1965, Chomsky had waivered from his originally strong stand on the separation of grammar and meaning to the position that 'the syntactic and semantic structure of natural languages evidently offers many mysteries, both offact and of principle, and any attempt to delimit these domains must certainly be quite'tentative' (p. 163). In 1972, he weakened his position still further (or made it stronger from a pragmatic perspective) by saying, 'itis not clear at all that it is possible to distinguish sharply between the contribution of grammar to the determination of meaning; and the contribution of so-called "pragmatic considerations", question of fact and belief and context of utterance' (p. 111). From the argument that grammar and meaning were clearly autonomous and independent, Chomsky had come a long way indeed. He did not correct the earlier errors concerning Bloomfield's assumption about meaning, but at least he came to the position of questioning the correctness of such an assumption. It remains to be seen how long it will take for his relatively recently acquired skepticism concerning some of his own widely accepted views to filter back to the methods of teaching and testing for which his earlier views served as after-the-fact supports ifnot indeed foundational pillars. It may not have been Chomsky'S desire to have his theoretical thinking applied as it has been (see his remarks at the Northeast Conference, 1966, reprinted in 1973), but can anyone deny that it has been applied in such ways? Moreover, though some of his arguments may have. been very badly misunderstood by some applied linguists, his argument about the autonomy of grammar is simple enough not to be misunderstood by anyone. C. Pattern drills without meaning What effects, then, have the linguistic theories briefly discussed above had on methods of language teaching and subsequently on methods 158 DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS LANGUAGE TESTS AT SCHOOL 159 Furthermore, his attention is drawn to the changes, which are stimulated by pictures, oral substitutions, etc., and thus the pattern itself, the significant framework of the sentence, rather than the particular sentence, is driven intensively into his habit reflexes. It would be false to assume that Pattern Practice, because it aims at habit formation, is unworthy of the educated mind, which, it might be argued, seeks to control language through conscious understanding. There is no disagreement on the value of having the human mind understand in order to be at its learning best. But nothing could be more enslaving and therefore less worthy of the human mind than to have it chained to the mechanics of the patterns of the language rather than free to dwell on the message conveyed through language. It is precisely because of this view that we discover the highest purpose of pattern pr(1ctice: to reduce to habit what rightfully belongs to habit in the new language, so that the mind and the personality may be freed to dwell in their proper real).1l, that is, on the meaning of the communication rather than the mechanics of the grammar (pp. xv-xvi). of language testing? The effects are direct, obvious, and unmistakable. From meaning-less linguistic analysis comes meaning-less pattern drill to instill the structural patterns or the distributional 'meanings' of linguistic forms as they are strung together in utterances. In the Preface (by Charles C. Fries) to the first edition of English Sentence Patterns (see Lado and Fries, 1957), we read: The 'grammar' lessons here set forth ... consist basically of exercises to develop habits . .. What kinds of habits? Exactly the sort of habits that Harris believed were the essence of language structure and that his 'distributional analysis' aimed to characterize. The only substantial difference was that in the famed Michigan approach to teaching English as a second or foreign language, it was the learner in the classroom who was to apply the distributional discovery procedure (that is the procedure for putting the elements oflanguage in proper perspective in relation to each other). Fries continues: The habits to be learned consist of patterns or molds in which the 'words' must be grasped. 'Grammar' from the point of view of these materials is the particular system of devices which a language uses to signal one of its various layers of meaning structural meaning ( ... ). 'Knowing' this grammar for practical use means being able to produce and to respond to these signals of structural meaning. To develop such habits efficiently demands practice and more practice, especially oral practice. These lessons provide the exercises for a sound sequence of such practice to cover a basic minimum of production patterns in English (p. v). And just how do these pattern practices work? An example or two will display the principle adequately. For instance, here is one from Lado and Fries (1957): they represent a new theory of language learning, the idea that to learn a new language one must establish orally the patterns of the language as subconscious habits. These oral practices are directed specifically tothat end. (His emphasis, Lado and Fries, 1958, p. xv) Exercise lc.I. (To produce affirmative short answers.) Answer the questions with YES, HE IS; YES, SHE IS; YES, IT IS; ... For example: Is John busy? YES, HE IS. Is the secretary busy? YES, SHE IS. Is the telephone busy? YES, IT IS. Am I right? YES, YOU ARE. Are you and John busy? . YES, WEARE. Are the students homesick? YES, THEY ARE. Are you busy? YES, I AM. (Continue:) 1. Is John busy? 8. Is Mary tired? 2. Is the secretary busy? 9. Is she hungry? 3. Is the telephone busy? 10. Are you tired? 4. Are you and John busy? 11. Is the teacher right? 5. Are the students 12. Are the students busy? homesick? 13. Is the answer correct? 14. Am I right? 6. Are you busy? 7. Is the alphabet 15. Is Mr. Brown a doctor? important? ... in these lessons, the student is lead to practise a pattern, changing some element of that pattern each time, so that normally he never repeats the same sentence twice. Suppose that the well-meaning student wants to discover the meaning of the sentences that are presented as Lado and Fries suggest. How could it be done? How, for instance, will it be possible In his Foreword to the later edition, Lado suggests, The lessons are most effective when used simultaneously with English Pattern Practices, which provides additional drill for the patterns introduced here (Lado and Fries, 1957, p. iii). In his introduction to the latter mentioned volume, Lado continues, concerning the Pattern Practice Materials, that • 'r -----...---~---------------------- 160 [Mechanical type of practice] , Teaching Point: Contrast WITH + N/BY + N Model:~ [Cue] He used a plane to go there. , R [Response] He went there by plane. C He used his teeth to open it. ~ R He opened it with his teeth. He used a telegram to answer it. T [Teacher] He answered it by telegram. S [Student] He used a key to unlock it. T He unlocked it with a key. & . He used a phone to contact her. T He contacted her by phone. S He used a smile to calm them. T He calmed them with a smile. S He used a radio to talk to them. T S He talked to them by radio.- Ml [Meaningful drill according to the authors] Teaching Point: Use O/HOW and Manner Adverbials Model: C [Cue open a bottle Rl [one possible response] How do you open a bottle? M2 ---- DISCRETE POINT METHODS IN SYNTACTIC LINGmSTICS LANGUAGETEsTSATSCHOOL for the learner to discover the differences between a phone being busy and a person being busy? Or between being a doctor and being correct? Or between being right and being tired? Or between the appropriateness of asking a question like, 'Are you hungry?' on certain occasions, but not 'Are you the secretary busy?' While considerip.g these questions, consider a series of pattern drills selected more or less at random from a 1975 text entitled From Substitution to Substance. The authors purport to take the learner from 'manipulation to communication' (Paulston and Bruder, 1975, p. 5). This particular series of drills (supposedly progressing from more manipulative and mechanical types of drills to more meaningful and communicative types) is designed to teach adverbs that involve the manner in which something is done as specified by with plus a noun phrase: He opened the door with a key. Model:c [Cue] can/church key He opened the can with a church key. R [Response] T [Teacher says] s [Student responds]' He opened the bottle with an opener. bottle/opener - He opened the QOx with his teeth. box/his teeth He opened the letter with a knife. letter/knife - - R2 [another] C [Cue] Rl R2 T 161 (With an opener.) finance a car How do you finance a car? (With a loan from the bank.) (By getting a loan.) [Teacher] light a fire sharpen a pencil make a sandwich answer a question [Communicative drill according to the authors1 Teaching Point: Communicative Use Model: C [Cue] How do you usually send letters to your country? R (By airmail.) (By surface mail.) C How does your friend listen to your problems? _ R (Patiently.) (With a smile.) How do you pay your bills here? T [Teacher] How do you find your apartment here? How will you go on your next vacation? How can I find a good restaurant? C In the immediately foregoing pattern drills the point of the exercise is specified in each case. The pattern that is being drilled is the only motivation for the collection of sentences that appears in a particular drill. That is why at the outset of this section we used the term 'syntactically-based pattern drills'. The drills that are selected are not exceptional in any case. They are in fact characteristic of the major texts in English as a second language and practically all of the texts produced in recent years for the teaching of foreign languages (in ESL/EFL see National Council of Teachers of English, 1973, Bird and Woolf, 1968, Nadler, Marelli, and Haynes, 1971, Rutherford, 1968, Wright and McGillivray, 1971, Rand, 1969a, 1969b, and many others). They all present learners with lists of sentences that are similar in structure (though not always identical, as we will see shortly) but which are markedly different in meaning. How will the learner be able to discover the differences in meaning ~ between such similar forms? Of course, we must assume that the learner is not already a native speaker - otherwise there would be no point in studying ESL/EFL or a foreign language. The native speaker knows the fact that saying, 'He used a plane to go there,' is a little less 162 LANGUAGE TESTS AT SCHOOL natural than saying, 'He went there in an airplane,' or 'He flew,' but how will the non-native speaker discover such things on the basis of the information that can be made available in the drill? Perhaps the authors of the drill are expecting the teacher to act out each sentence in some way, or creatively to invent a meaningful context for each sentence as it comes up. Anyone who has tried it knows that the hour gets by before you can make it through just a few sentences. Inventing contexts for sentences like, 'Is the alphabet important T and 'Is the telephone busyT is like trying to write a story where the characters, the plot,and all of the backdrops change from one second to the next. It is not just difficult to conceive of a context in which'sttidents can be homesick, alphabe~ important, the telephone, the secretary, and John busy, Mary tired, me right, Brown a doctor and so on, but before you get to page two the task is impossible. If it is difficult for a native speaker to perform the task of inventing contexts for such bizarre collections of utterances, why should anyone expect a non-native speaker who doesn't know the language to be able to do it? The simple truth is that they cannot do it. It is no more possible to learn a language by such a method- than it is to analyze a language by the form of distributional analysis proposed by H~rris. It is necessary to get ,some data in the foim of pragmatic mappings~f utterances onto 'meaningful contexts - failing that it is not possible either to analy~e a language adequately or to learn one at all. ' Worse yet, the typi~l (not the exc~ptional, but the ordinary everyday garden variety) pattern drill is bristling with false leads about Ilimilarities that are only superficial and will lead aimost immediately to unacceptable forms - the learner of course won't know that 'they are'Unacceptable because he is not a native speaker of the language and has little or no chance of ever discovering where he went wrong. The bewildered learner will have no way of knowing that for a person to be busy is not like a telephone being busy. What information is there to prevent the learner from drawing the, reasonable conclusion that if telephones can be busy in the way-that people can, that televisions, vacuum cleaners, telegraphs,' and typewriters can be busy in the same sense? What will keep the learner from having difficulty distinguishing the meanings of alphabet, telephone, secretary, and doctor if he doesn't already know the meanings of those words? If the learner doesn't already know tlrat'a doctor, a secretary, and a teacher are phrases that refer to people with different occupational statuses, how will the drill help him to discover DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS 163 this information? What, from the learner's point oCview is different about homesick and a doctor? What will prevent the learner from saying Are the students doctor and Is the alphabet busy? Meaning would prevent such absurdities, but the nature of the pattern drill encourages them. It is an invitation to confusion. Without meaning to help keep similar forms apart, the result is simple. They cannot be kept apart - they become mixed together indiscriminately. This is not because learners are lacking in intelligence, rather it is because they are in fact quite intelligent and they use their intelligence to classify similar things together and to keep different things apart. But how can they keep different things apart when it is only the superficial similarities that have been constantly called to their attention in a pattern drill ? The drills proposed by Paulston and Bruder are more remarkable than the ones by Lado and Fries, because the Paulston-Bruder drills' are supposed to become progressively more meaningful- but they do not. They merely become less obviously structured. The responses and the stimuli to elicit them don't become any more meaningful. The responses merely become lessptedictable as one progresses through the series of drills concerning each separate point of grammatical structure, Who but a native speaker of English will know that opening a door with a key is not very much like opening a can with a church key? The two sentences are alike in fact primarily in terms of the way they sound. If the non-native knew the meanings before doing the drill there would be no need for doing the drill- but if he does need the drill iL will do him .absolutely no good and probably some harm: What will keep him from saying, He opened the can with a hand? Or, He opened the bottle with a church key? Or, He opened the car with a door? Or, He opened thefaucet with a wren~h? Ifhecan say, He opened the letter with a letter opener, why not; He opened the box with a box opener? Or, He opened the plane with a plane opener? If the learner is encouraged to say, He used a key to unlock it, why not, He used a letter to write her, or He used a call to phone her? In the drill above labelled M 1 , which the authors describe as a mechanical drill, the object is to contrast phrases like with a knife and by telegram. Read over the drill and then try to say what will prevent the learner from coming up with forms like, He went there with a plane, He contacted her with a phone, He unlocked it by key, He calmed them by smile, He used radio to talk to them, He used phone to contact her, He contacted her with telegram, etc. 164 LANGUAGE TEsTS AT SCHOOL DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS In the drill labelled M 2 , which is supposed to be somewhat more meaningful, additional traps are neatly laid for the non-native speaker who is helplessly at the mercy of the patterns laid out in the drill. When asked how to light a fire, sharpen a pencil, make a sandwich, or answer a question, he has still fresh in his memory the keyed answers to the question abogt how to openYa bottle or finance a car. He has heard that you can finance l\. car by getting a loan or that you can open a bottle with an. opener. What i~ to prevent the unsuspecting and naive learner from saying <that you can open a bottle by getting an opener - structurally the answer is flawless and in line with what he has just been taught. The answer is eve~ creative. But pragmatically it is not quite right. Because ofa quirk of the language that native speakers have learned, gett'ing an open~r does not imply using it, though with an opener in response to the question How do you open a bottle? does imply the required use of the opener. What will keep the learner from creatively inventing forms like, by a match in response to the question about starting a fire? Or by a pencil sharpener in answer to how you sharpen pencil? Would it not be perfectly reasonabfe if when asked How do you answer a question? the learner r7plied with an answ~r?orby~answering? The so-called 'CommnniC'ii:tive' lJ se' drill offers even more interesting traps.' How do· yo~ send your letters? By an airplane of. cOJJrse. Or sometiI11es I send them with a ship. When I'm in a hurry though I always send them with a plane. How does your friend listen to yourproblems? BYl'atience mostly, but sometimes he does so by smiling. Bills? Well, I almost always bill them by mail or with a carsometimes in an airmail. My,apartment? Easy. I found it by a friend. We went there with a car. My next vacation? With an airplane. My girl friend is wanting to gq too, but she goes by getting a loan with a bank. A good restaurant ? You can go with a taxi. Is there any need to say more? Is there an English teacher alive anywhere who cannot write reams on the topic? What then, Oh Watchman of the pattern drill? The pattern drill without meaning, my Son, is as a door opening into darkness and leading nowhere but to ' confusion. If the preceding examples were exceptional, there might be reason to hope that pattern drills -of the sort illustrated above might be (transformed into more meaningful exercises. There is no reason for such a hope, however. Pattern drills which are unrelated and int:t:insically unrelatable to meaningful extralinguistic contexts 'are confusing precisely because they are well written - that is, in the sense a 165 that they conform to the principles of the meaning-less theory of linguistic analysis on which they were based. They are unworkable as teaching methods for the same reason that the analytical principles on which they are based are unworkable as techniques of linguistic analysis. The analytical principles that disregard meaning are not just difficult to apply, but they are fundamentally inapplicable to the objects of interest - namely, natural languages. D. From discrete point teaching (meaning-less pattern drills) to discrete point testing One might have expected that the hyperbole of meaning-less language was fully expressed in the typical pattern drills that characterized the language teaching of the 1950s and to a lesser extent is still characteristic of most published materials today. However, a further step toward conwlete meaninglessness was possible and was advocated by two leading authorities of the 1960s. Brooks (1964) and Morton (1960, 1966) urged that the minds of the learners who were manipulating the pattern drills should be kept free and unencumbered by the meanings of the forms they were practicing. Even Lado and Fries (1957, 1958) at least argued that the main purpose of pattern drills was not only to instill 'habits' but was to enable learners to say meaningful things in the language. But, Brooks and Morton developed the argument that skill in the purely manipUlative use of the language, as taught in pattern drills, would have to be fully mastered before proceeding to the capacity to use the language for communicative purposes. The analogy offered was the practicing of scales and arpeggios by a novice pianist, before the novice could hope to join in a concerto or to u.se the newly acquired habits expressively. Clark (1972) apparently accepted this two stage-model in relation to the acquisition of listening comprehension in a foreign language. Furthermore, he extended the model as a justification for discrete point and integrative tests: Second-level ability cannot be effectively acquired unless firstlevel perception of grammatical cues and other formal interrelationships among spoken utterances has become so thoroughly,learned and so automatic that the student is able to turn most of his listening attention to 'those elements which seem to him to contain the gist of the message' (Rivers, 1967, p. 193, as quoted by Clark, 1972, p. 43). Clark continues: 166 LANGUAGE TESTS AT SCHOOL DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS Testing information of a highly diagnostic type wbuld be useful during the 'first stage' of, instruction, in which sound discriminations, basic patterns of spdken grammar, items of functional vocabulary, and so forth were being formally taught and practised.... As the instructional emphasis changes from formal work in discrete aspects" to more extensive and less controlled listening practice, the utility (and also the possibility) of diagnostic testing is reduced ill" favor of evaluative procedures which test primarily'the students' comprehension of the 'general message' rather than the ,apprehtmsion of certain specific sounds or sound patterns (p. 43). -," . pattern drills was created. It was oriented toward the teaching of 'pronunciation'," especially the minimal phonemic contrasts of various target languages. For instance, Lado and Fries (1954) suggested: A very simple drill for practicing the recognition of : .. distinctive differences can be made by arranging minimal pairs of words on the blackboard in columns thus: (The words they used were offered in phonetic script but are presented here in their normal English spellings.) How successful has this two stage dichotomy proved to be in language teaching? Stockwell and Bowen hinted at the core of the difficulty in their introduction to Rutherford (1968): The most difficult transition in learning a language is going from mechanical skill in reproducing patterns acquired by repetition to the construction of novel but appropriate sentences in natural social contexts. Language teachers ... not infrequently ... fumble and despair, when confronted with the challenge of leading students comfortably over this hurdle (p. ~O· . What if the hurdle were an un~ecessary one? What'if it were a mere artefact of the atte~pt to separate the learning of the grammatical patterns of the language from the communicative use of the language? If we asked how often children are exposed to meaningless non-contextualized language of the sort that second language learners are so frequently expected to master in foreign language classrooms, the answer would be, never. Are pattern drills, therefore, necessary to language learning? The answer must be that they are not. Further, pattern drills of the non-contextualized and noncontextualizable variety are probably about as confusing as they are informative. If as we have already seen above, pattern drills are associated with the 'first stage' of a two stage process of teaching a foreign language and if the so-called 'diagnostic tests' (or discrete point tests) are also associated with that first stage, it only remains to show the connection between the pattern drills and the discrete poitit items themselves. Once this is accomplished, we will have illustrated each link in the chain from certain linguistic theories to discrete point methods of language testing. Perhaps the area of linguistic analysis which developed the most rapidly was the level of phonemics. Accordingly, a whole tradition of ( I 167 man lass lad pan bat sat men less led pen bet 'set The authors continue: The teacher pronounces pairs of words in order to make the student aware of the contrast. When the teacher is certain that the students are beginning to hear these distinctions he can then have them actively participate in the exercise (p. iv). In a footnote the reader is reminded: Care must be taken to pronounce such contrasts with the same intonation on both words so that the sole difference between the words will be the sound under study (op cit). It is but a short step to test items addressed to minimal phonological contrasts. Lado and Fries point out in fact that a possible test item is "a picture of a woman watching a baby versus a woman washing a baby. In such a case, the examinee might hear the statement, The woman is washing the baby, and point to or otherWise ' indicate the picture to which the utterance is appropriate. Harris (1969) observes that the minimal pair type of exercise, of the sort illustrated above, 'is, in reality, a two-choice" objective test", and most sound discrimination tests are simply variations and expansions of this common classroom technique' (pp. 32-3). Other variations for which Harris offers examples include heard pairs of words where the learner (or in this case, the examinee) must indicate whether the two words are the,same or different; or a heard triplet where the learner must indicate which of three words (e.g., jump, chump, jump) , was different from the other two; or a heard sentence in which either 168 \ LANGUAGE TESTS AT SCHOOL DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS , , 169 d member of a minimal pair might occur (e.g., It was a' large ship, versus, It was a large sheep) where the examinee must inaicatl~ either a picture of a large ship orra large sheep depending on what was heard. Harris refers to the last case as an example of testing minimal, pairs 'in context' (p. 33-4). It is not difficult to see, however, that the types of contexts in which one might expect to find both ships and sheep are relatively few in number - certainly a very small minority of possible contexts in which one might efCpect to find ships without.sheep'or sheep without ships. Vocabulary teaching by discrete point methods also leads rather directly to discrete point vocabulary tests. For ~instance, Bird and Woolf (1968) include a substitution drill set in a sentence frame of That's a ___, or This is a ___ with such items as chair, pencil, table, book, and door. It is a short step from such a drill to a series of corresponding test items. For example, Clark (1972) suggests a test item where a door, chair, table, and bed are pictured. Associated with each picture is a letter which the student may mark on an answer sheet for easy scoring. The learner hears in French, Voici une chaise, and should correspondingly mark the letter of the picture of the chair on the answer sheet. Other item types, suggested by Harris (1969) include: a word followed by several brief definitions from which the examinee must select the one that corresponds to the meaning of the given word; a definition followed by several words from which the examinee must select the one closest in meaning to the given definition; a sentence frame witlr an underlined word and several possible synonyms from which the examinee must choose the best alternative; and a sentence frame with a blank to be filled by one of several choices and where all but one of the choices fail to agree with the meaning requirements of the sentence frame. Test items of the discrete point type aimed at assessing particular grammatical rules have often been derived directly from pattern drill formats. For'example, in their text for teaching English as a foreign language in Mali, Bird and Woolf (1968) recommend typicll;l transformation drills from singular statements to plural ones (e.g., ,Is this a book? to Are these books? and reverse, see p. 14a); ftom negative to negative interrogative (e.g., John isn't here, to Isn't John here? see p. 83); from interrogative to negative interrogative (e.g., Are we going? to Aren't we going? see p. 83) ; statement to question (e.g., He hears about Takamba, to What does he hear about ?seep. 131); and so forth. There are many other types of possible drills in relation to syntax, but the fact that drills of this type can and have been translated more or less directly into test items is sufficient perhaps to illustrate the trend. Spolsky, Murphy, Holm, and Ferrel (1972, 1975) give examples of test items requiring transformations from affirmative form to negative, or to question form, from present to past, or from present to future as part of a 'functional test of oral proficiency' for adult learners of English as a second language. Many other examples could be given illustrating the connection between discrete point teaching and discrete point testing, but the foregoing examples should be enough to indicate the relationship, which is simple and fairly direct. Discrete point testing derives from the pattern drill methods of discrete point teaching and is therefore subject to many of the same difficulties. E. Contrastive linguistics , One of the strongholds of the structural linguistics of the 1950s and perhaps to a lesser extent the 1960s was contrastive analysis. It has had less influence on work in the teaching of English as a second language in the United States than it has had on the teaching of English as a. foreign language and the teaching of other foreign languages. There is no way to apply contrastive analysis to the preparation of materials for teaching English as a second language when the language backgrounds of the students range from Mandarin Chinese, to Spanish, to Vietnamese, to Igbo, to German, etc. It would 'be impossible for any set of materials to take into account all of the 'contrasts between all of the languages that are represented in many typical college level classes for ESL in the U.S. However the claims of contrastive analysis are still relatively strong in the te~ching of foreign languages and in recent years have been ' reasserted in relation to the teaching of the majority variety of English as a second dialect to children who come to school speaking some other variety. The basic idea of contrastive analysis was stated by Lado (1957). It is, the assumption that we can predict and describe the p~tterns ' that will cause difficulty in learning, and those that will not cause difficulty, by comparing systematically the language and culture to be learned with the native language and culture of the' student. In our view; the preparation of up-to-date pedagogical T···· .. t 170 LANGUAGE TESTS AT SCHOOL and experimental materials must be based on this kipd of comparison (p. vii). Similar claims were offered by Politzer and Staubach (1961), Lado (1964), Strevens (1965), Rivers (1964, 1968), Barrutia (1969), and Bung (1973). All of these authors were concerned with the teaching of foreIgn languages. . More recently, the claims of contrastive analysis have been extended to the teaching of reading in the schools. Reed (1973) says, the more 'radically divergent' the non-standard dialect (i.e., the greater the structural contrast and historical autonomy vis-a-vis standard English), the greater the need for a second language strategy in teaching Standard English (p. 294). Farther on she reasons that unless the learner is enabled to bring to the level of consciousness, i.e., to formalize his intuitions about his dialect, it is not likely that he will come to understand and recognize the systematic points of contrast and interference between his dialect and the Standard English he must learn to control (p. 294). Earlier, Goodman (1965) offered a similar argument based on contrastive analysis. He said, the more divergence there is between the dialect of the learner and the dialect oHearning, the more difficult will be the task of learning to read (as cited by Richards, 1972, p. 250). (Goodman has since changed his mind; see Goodman and Buck, 1973.) If these remarks were expected to have the same sort of effects on language testing that other notions concerning language teaching have had, we should expect other suggestions to be forthcoming about related (in fact derived) methods oflanguage testing. Actually, the extension to language testing was suggested by Lado (1961) in what was perhaps the first major book on the topic. He reasoned that language "tests should focus on those points of difference between the language of the learners and the target language. First, the :linguistic problems' were to be determined by a 'contrastive analysis' of the structures of the native and target languages. Then, the test ... will have to choose a few sounds and a few structures at random hoping to give a fair indication of the general achievement of the student (p. 28). More recently, testing techniques that focus on discrete points of difference between two languages or two dialects have generally fallen into disfavor. Exceptions however can be found. One example DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS 171 is the test used by Politzer, Hoover, and Brown (1974) to assess degree of control of two important dialects of English. Such items of difference between the majority variety of English and the minority variety at issue included the marking of possessives (e.g., John's house versus John house), and the presence or absence of the copula in the surface form (e.g., He goin' to town versus He is going to town). Interestingly, except for the manifest influence of contrastive analysis and discrete point theory on the scoring of the test used by Politzer, et ai, it could be construed as a pragmatic test (i.e., it consisted of a sequential text where the task set the learner was to repeat sequences of material presented at a conversational rate). Among the most serious difficulties for tests based on contrastive linguistics is that they should be suited (in theory at least) for only one language background - namely, the language on which the contrastive analysis Was performed. Upshur (1962) argues that this very fact results in a peculiar dilemma for contrastively based tests: either the tests will not differentiate ability levels among students with the same native language background and experience, 'or the contrastive analysis hypothesis is invalid' (p. 127). Since the purpose of all tests is to differentiate success and failure or degrees of one or the other, any contrastively based test is therefore either not a test or not contrastively based. A more practical problem for contrastively based tests is that learners from different source languages (or dialects) would require different tests. If there are many source languages contrastively based tests become infeasible. Further, from the point of view of pragmatic language testing as discussed in Part One above, the contrastive analysis approach is irrelevant to the determination of what constitutes an adequate language test. It is an empirical question as to whether tests can be devised which are more difficult for learners from one source language than for learners from other source languages (where proficiency level is a controlled variable). Wilson (in press) discusses this problem and presents some evidence suggesting that contrastive analysis is not helpful in determining which test items will be difficult for learners of a certain language background. In view of the fact that contrastive analysis has proved to be a poor basis for predicting errors that language learners will make, or for hierarchically ranking points of structure according to their degree of difficulty, it seems highly unlikely that it will ever provide a substantial basis for the construction of language tests. At best, contrastive analysis provides heuristics only for certain discrete point 172 DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS LANGUAGE TESTS AT SCHOOL test items. Any such items must then be subjected to the same sorts of validity criteria as are any other test items. Language Skills Components One of the serious difficulties of a thoroughgoing analytical model of discrete point testing is that it generates an inordinately large number of tests. If as Lado (19,6 1) claimed, we 'need to test the elen;tents and the skills separately' (p. 28), ,and if as he further argued we need , separate tests for supposedly separate components of language ability, and for both productive and receptive aspects of those components, we wind up needing a very large number of tests indeed. It might seem odd to insist on pronunciation tests for speaking and separate pronunciation tests for listening, but Lado did argue that the 'linguistic problems' to be tested 'will differ somewhat for production and for recognition' and that therefore 'different lists are necessary to test the student's pronunciation in speaking and in listening' (p. 45). Such distinctions, of course, are also common to t~sts in speech pathology. Hence, what is required by discrete point testing theory <is a set of items for testing the elements of phonology, lexicon (or vocabulary), syntax~ and possibly an additional component of semantics (depending on the theory one selects) times the number of aspects one recognizes (e.g., productive versus receptive) times the number of separate skills one recognizes (e.g., listening, speaking, reading, writing, and possibly others). In fact, several different models have been proposed. Figure 5 below illustrates the componential analysis suggested by Harris (1969); Figure 6 shows the analysis suggested by Cooper (1972); Figure 7 shows a breakdown offered by Silverman, N oa, Russell, and Molina (1967); and finally, Figure 8 shows a slightly different model of discrete point categories proposed by Oller (1976c) in a discussion of possible research recommended to test certain crucial hypotheses generated by discrete point and pragmatic theories of language testing. Harris's model would require (in principle) sixteen separate tests or subtests; Cooper's would require twenty-four separate tests or subtests; the model of Silverman et al would require sixteen; and the . model of Oller would require twelve. What classroom teacher has time to develop so many different Speaking Listening j F. Discrete elements of discrete aspects of discrete components of discrete skills - a problem of numbers 173 Reading Writing Phonology/ Orthography . Structure Vocabulary Rate and general fluency Figure 5. Componential breakdown of language proficiency proposed by Harris (1969, p. 11). '.l -ox\e\'1 ~ Knowledge Phonology Syntax Semantics Total 7 Listening ~ Speaking S Reading k i I I' ~ ~ v, ~ Writing , .----- s ~ ~ ~ ./" ./" ~ ---,.,,- ~ Figure 6. Componential analysis of language skills as a framework for test construction from Cooper (1968, 1972, p. 337). tests? There are other grave difficulties in the selection of items for each test and in determining how many items should appear in each test, but these are discussed in Chapter 7 section A. The question is, are all the various tests or subtests necessary? We don't normally use 174 LANGUAGE TESTS AT SCHOOL DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS ~ MODE SensoryMoto{ MODALITY RECEPTIVE PRODUCTIVE Listening Speaking 175 AUDITORY/ ARTICULATORY Phon- Struc- Vocab- Phon- Struc- . Vocabology ture ulary ture ulary ology Reading Writing VISUAL/ MANUAL I Figure 7. 'Language assessment domains' as defined by Silverman et al (1976, p.21). , . only our phonology, or vocabulary, or grammar; why must they be taught and tested separately? An empirical question which must be answered in order to justify such comporrentialization of language skill is whether tests that purport to ~easure the same component of language skill (or the same aspect, modality, or whatever) are in fact more highly correlated with each other than with tests that purport to measure different components. Presently available research results show many cases where tests that purport to measure different components or aspects (etc.) correlate as strongly or even more strongly than do tests that purport to measure the same components. These results are devastating to the claims of construct validity put forth by advocates of discrete point testing. For example, Pike (1973) found that scores on an essay task correlated more strongly with the Listening Comprehension subscore on the TOEFL than with the Reading Comprehension subscore for three different groups of subjects. This sort of result controverts the prediction that tasks in the readi~g-writing modalty ought to be more highly intercorrelated with each oth~r than with tasks in the listeni.ng- Graph- Struc- Vocab- Graph- Struc- Vocabulary ology ture ulary ology ture Figure 8. Schematic representation of constructs posited by a componential analysis of language skills based on discrete point test theory, from Oller (l976c, p. 150). speakirig modality (of course, the TOEFL Listening Comprehension subtest does require reading). Perhaps more surprisingly, Darnell (1968) found that a cloze task was more highly correlated with the Listening Comprehension section on the TOEFL than with any of the other subscores. Oller and Conrad (1971) got a similar result with the UCLA ESL Placement Examination Form 2C. Oller and Streiff (1975) found that a dictation task was more strongly correlated with each other part of the UCLA ESLP E Form 1 than any of the other parts witl1 each other. This was particularly surprising to discrete point theorizing in view of the fact that the dictation was the only section of the test that required substantial listening comprehension. Except for the phonological discrimination task, which required distinguishing minimal pairs in sentence sized contexts, no other subsection required listening comprehension at all. In conclusion, if tasks that bear a certain label (e.g., reading .xs 176 LANGUAGE TESTS AT SCHOOL comprehension) correlate as well with tasks that bear different labels (e.g., listening comprehension, or vocabulary, or oral intervie~, etc.) as they do with each other, what independent justification can be offered for their distinct labels or for the positing of separat,e skills, aspects, and components of language r- The only justification that comes to mind is the questionable theoretical bias of discrete 'point theory. Such a justification is a variety, a not too subtle variety in fact, of validity by fiat, or nominal validity - for instance, the statement that a 'listening comprehension test' is a test of 'listening comprehension' because that is what it was named by its author(s); or that a 'reading' test is distinct from a 'grammar' test because they were assigned different labels by their creators. A 'vocabulary' test is different from a 'grammar' or 'phonology' test because they were dubbed differently by the theorists who rationalized the distinction in the first place. There is a better basis for labeling tests. Tests may be referred to, not in terms of the hypothetical constructs that they are supposed to measure, but rather in terms of the sorts of operations they actually require oflearners or examinees. A cloze test req~ires learners to fill in blanks. A dictation requires them to write down phrases, clauses or other meaningful segments of discours~. An imitation task req~ires them to repeat or possibly rephrase material that is heard. A reading aloud task requires reading aloud. And so forth. A synonym matching task is the most popular form of task usually called a 'Vocabulary Test'. If tests were consistently labeled according to what they require learners to do, a great deal of argumentation concerning relatively meaningless and misleading test labels could be avoided. Questions about test validity are difficult empirical questions which are only obscured by the careless assignment oftest labels on the basis of untested theoretical (i.e., theory based) biases. In the final analysis, the question of whether or not language skill can be split up into so many discrete elements, components, and so forth, is an empirical question. It cannot be decided a priori by fiat. It is true that discrete point tests can be made from bits and pieces of language, just as pumpkin pie can be made from pumpkins. The question is whether the bits and pieces can be put together again, and the extent to which they are characteristic of the whole. We explore this question more thoroughly in Chapter 8 below. Here, the intent has been merely to show that discrete point theories of testing are derived from certain methods of teaching which in their turn derive from certain methodsoflinguistic analysis. DISC;::RETE POINT METHODS IN SYNTACTIC LINGUISTICS 177 KEY POINTS 1. Theories of language influence theories of language learning which in their tum influence theories of language teaching which in their tum linfl.uence theories and methods oflanguage testing. 2. A unidirectinal influence from theory to practice (and not from practical findings to theories) is unhealthy. 3. A theory that cannot be mortally endangered is not alive. 4. Language is typically used for meaningful purposes - therefore, any theory of language that hopes to attain a degree of adequacy must encountenance this fact. 5. Structural analysis less meaning led to pattern drills without meaning. 6. Bloomfield's exclusion of meaning from the domain of interest to linguistics was reified in the distributional discovery procedure recommended by Zellig Harris. 7. The insistence on grammar without meaning was perpetuated by Chomsky in 1957 when he insisted on a grammatical analysis of natural languages with no reference to how those languages were used in normal communication. S. Even when semantic notions were incorporated into the 'integrated theory' of the mid sixties, Katz and Fodor, and Katz and Postal insisted that the speaker's knowledge of how language is used in relation to extralinguistic contexts should remain outside the pale of interest. 9. Analysis of language without reference to meaning or context led to theories of language learning which similarly tried to get learners to internalize grammatical rules with little or no chance of ever discovering the meanings of the utterances they were encouraged to· habitualize through manipulative pattern drill. 10. Pattern drills, like the linguistic analyses from which they derive, focussed on 'structural meaning' - the superficial meaning that was associated with the distributional patterns of linguistic elements relative to each other and largely independent of any pragmatic motivation for uttering them. 11. Typical pattern drills based on the syntactic theories referred to above are essentially noncontextualizable - that is, there is no possible context in which all of the diverse things that are included in a pattern drill could actually occur. (See the drills in most any ESLjEFL text - refer to the list of references given on p. 161 above.) i 12. It is quite impossible, not just difficult, for a non-native speaker to infer pragmatic contexts for the sentences correctly in a typical syntax based I pattern drill unless he happens to know already what the drill purports to be teaching. 13. Syntactically motivated pattern drills are intrinsically structured in ways that will necessarily confuse learners concerning similar forms with different meanings - there is no way for the learner to discover the pragmatic motivations for the differences in meaning. 14. The hurdle between manipUlative drills and communicative use of the utterances in them (or the rules they are supposed to instill in the1earner) is an artefact of meaning-less pattern drills in the first place. 178 DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS LANGUAGE TESTS AT SCHOOL 15. Discrete point teaching~ particularly the syntax based patte~n drill approach, has been more or less directly translated into discrete point testing. 16. Contrastive linguistics contended that the difficult patterns of a particular target language could be determined in advance by a diligent and careful comparison of the native language of the learner with the target language. " 17. The notion of contrastive linguistics was extended to the teaching of reading and to language tests in the claim that the points of difficulty (predicted by contrastive analysis) should be th~ main targets for teaching and for test items. 18. A major difficulty for contrastive linguistics is that it has never provided a very good basis for prediction. In many cases where the predictions are clear they are wrong, and in others, they are too vague to be of any empirical value. 19. Another serious difficulty is that every different native language background theoretically requires a different language test to asse.ss knowledge of the same target language (e.g., English). Further, there seems to be no reason to expect a comparison of English and Spanish to provide a good test of either English or Spanish; rather, what appears to be required is something that can be arrived at independently of any comparison of the two languages - a test of one or the other language. 20. Many problems of test validity can be avoided if tests are labeled according to what they require examinees to do instead of according to what the test author thinks the test measures. 21. Finally, discrete point test theories require many subtests which are of questionable validity. Whether there exists a separable (and separately testable) component of vocabulary, another of grammar, and separable skills of listening, reading, writing, and so forth must be determined on an empirical basis. DISCUSSION QUESTIONS 1. Take a pattern drill from any text. Ask what the motivation for the drill was. Consider the possibility (or impossibility) of contextualizing the drill by making obvious to the learner how the sentences of the drill might relate to realistic contexts of communication. Can it be done? If so, demonstrate how. If not, explain why not. Can contextualized or contextuallzable drills be written? If so, how would the motivation for such drills differ from the motivation for noncontextualizable drills? 2. Examine pragmatically motivated drills such as those included in El espanol por el mundo (Oller, 1963-65). Study the way in which each sentence in the drill is relatable (in ways that can be and in fact an:: made obvious to the learner) to pragmatic contexts that are already established in the mind of the learner. How do such drills differ in focus from the drills recommended by other authors who take syntax as the starting point rather than meaning (or pragmatic mapping of utterance onto context)? 3. Work through a language test that you know to be widely used. Consider 4. , 5. 6. 7. e 179 which of the subparts of the test rely on discrete point theory for their justification. What sorts of empirical studies could you propose to see if the tests in question really measure what they purport to measure? What specific predictions would you make concerning-the intercorrelations of tests with the same label as opposed to tests with different labels? What if the labels are not more than mere labels? Consider the sentences of a particular pattern drill in a language that you know well. Ask whether the sentences of the drill would be apt to occur in real life. For instance, as a starter, consider the likelihood of ever having to distinguish 'between She's watching the baby versus She's washing the baby. Can you conceive of more likely contrasts? What contextual factors would cause you to prefer She's giving the baby a bath, or She's bathing the infant, etc. ? Following up on question 4, take a particular sentence from a pattern drill and try to say all that you know about its form and meaning. For instance; consider what other forms it calls to mind and what other meanings it excludes or calls to mind. Compare the associations thus developed for one sentence with the set of similar associations that come to mind in the next sentence in the drill, and the next and the next. Now, analyze some of the false associations that the drill encourages. Try to predict some ofthe errors that the drill will encoUl;age learners to make. Do an observational study where the pattern drill is used in a classroom situation and see if the false association (errors) in fact arise. Or alternatively, record the errors that are committed in response to the sentences of a particular drill and see if you can explain them after the fact in terms of the sorts of associations that are encouraged by the drill. As an exercise in distributional analysis, have someone give you a simple list of similarly structured sentences in a language that you do not know. Try to segment those sentences without reference to meaning. See if you can tell where the word boundaries are, and see if it is possible to determine what the relationships between words are - i.e., try to discover the structural meanings ofthe utterances in the foreign language without referring to the way those utterances are pragmatically mapped onto extralinguistic contexts. Then, test the success of your attempt by asking an informant to give you a literal word for word translation (or possibly a word for morpheme or phrase translation) along with any less literal translation into English. See how close your understanding of the units of the language was to the actual structuring that can be determined on the basis of a more pragmatic analysis. Take any utterance in any meaningful linguistic context and assign to it the sort of tree structure suggested by a phrase structure grammar (or a more sophisticated grammatical system if you like). Now consider the question, what additional sorts of knowledge do I normally bring to bear on the interpretation of such sentences that is not captured by the syntactic analysis just completed? Extend the question. What techniques might be used to make a learner aware ofthe additional sorts of clues and information that native speakers make use of in coding and decoding such utterances? The extension can be carried one step further. What techniques could be used to test the ability oflearners to utilize the kinds 180 LANGUAGE TESTS AT SCHOOL '\of clues and information available to native speakers in the coding and decoding of such utterances? "8. Is there a need for pattern drills without meaning in language teaching? If one chose to dispense with them completely, how could pattern drills be constructed so as to maximize awareness of the pragmatic consequences of each formal change in the utterances in the pattern drill? Consider the naturalness constraints proposed f-or pragmatic language tests. Could not similar naturalness constraints be imposed on pattern drills? What kinds of artificiality might be tolerable in such a system and what kinds would be intolerable? 9. Take any discrete test item from any discrete test. Embed it in a context (if that is possible, for some items it is very difficult to conceive of a realistic context). Consider then the degree of variability in· possible choices for the isolated discrete item and for the same item in context. Which do you think would produce results most commensurate with genuine ability to communicate in the language in question? Why? Alternatively, take an item from a cloze test and isolate it frQm its contqt (this is always possible). Ask the same questions. SUGGESTED READINGS 1. J. Stanley Ahman, and Marvin D. Glock, Measuring and Evaluating Educational Achievement 2nd Ed. Boston: Allyn and Bacon, 1975. See , especially Chapters 2,3, and 4. '2. John L. D. Clark, -Foreign Language Testing: Theory and Practice. Philadelphia: Center for Curriculum Development, 1972, 25-113. 3. David P. Harris, Testing English as a Second Language. New York: McGraw Hill, 1969, 1-11,24--54. 4. J. B. Heaton, Writing English Language Tests. London: Longman, 1975. 5. Robert J. Silverman, Joslyn K. Noa, Randall H. Russell, and John Molina, Oral Language Tests for Bilingual Students: An Evaluation of Language Dominance and Proficiency Instruments. Portland, Oregon: Center for Bilingual Education (USOE, Department of HEW), ,18-28. 7 Statistical Traps A. Sampling theory and test construction B. Two common misinterpretations of correlations C. Statistical procedures as the final criterion for item selection D. Referencing tests against nonnative performance There is an old adage that 'Figures don't lie', to which some sage added, 'But liars do figure'. While in the right contexts, both ofthese statements are true, both oversimplify the problems faced by anyone who must deal with the sorts of figures known as statistics produced by the figurers known as statisticians. Although most statistics can easily be misleading, most statisticians probably do not intend to be misleading even when they produce statistics that are likely to be misinterpreted. The book by Huff (1954), How to Lie with Statistics, would probably be more widely read if it were on the topic, How to Lie without Statistics. Statistics pet: se is a dry and forbidding subject. The difficulty is probably not related to deliberate deception but to the difficulty of the objects of interest - namely, certain classes of numbers known as statistics. This chapter examines several common but misleading applications of statistics and statistical concepts. We are thinking here of language testing, but the problems are in fact very general in educational testing. A. Sampling theory and test construction One of the misapplications of statistical notions to language testing relates to the construction of tests. Discrete point theorists have suggested that 181 182 LANGUAGE TESTS AT SCHOOL the various 'parts' of the domain of 'language proficiency' must be defined and represented in apptopriate' proportions on the test ... Item analysiS statistics take on a different meaning ... The concern is ... with how well the items on the test represent the content domain (Petersen and Cartier, 1975, p. 108t) ... At first glance, it might appear that, in principle, a test of general proficiency in ,a foreign languag~ should ~e a samp~e <?f the entire language at large. In practIce, ObVlO\lsly, this IS neither necessary nor desirable. The average native speaker gets along quite well knowing only a limited sample of the language at large, so our course and test really only need to sample that sample. (Further discussion on what to sample will follow after touching on the problem of how to sample [po 111]). The reader will begin to appreciate the difficulty of determining just what do~ains to sample (within the sample of what the native speaker knows) a couple of paragraphs farther on in the Petetsen~ Cartier argument. Most language tests, including DU's tests [i.e., the tests of the Defense Language Institute], therefore make a kind ot stratified random sample, assuring by plan that some items test grammatical features, some test phonological features, some test vocabulary, and so on. Thus, for example, the DLI's English Comprehension Level tests are constructed according to a fairly complex sampling matrix which requires that specific percentages of the total number of 120 items be devoted to vocabulary, sound discrimination, grammar, idioms, listening comprehension, reading comprehension, and so on. The authors continue to point out that there was once an, attempt to determine the feasibility of -establishing a universal itemselection matrix of this sort for all languages, or perhaps for all languages of a family, so that the problem of making a stratified sample for test construction purposes could be reduced to a somewhl;lt standard procedure. However, such a procedure has not, as yet, been found, and until it is, we must use some method for establishing a rational sample of a language in our tests (p. 112). The solution arrived at is to 'consider the present DU courses as rational samples of the language and to sample them ... for the item objectives in our tests of general ability' (p. 113).1 The authors indicate in personal communication that they reached this decision only 'with great reluctance', They did not in their words 'arrive at these conclusions naively but, in fact, only with some considerable pain'. Therefore, the foregoing and following remarks should be interpreted accordingly. 1 STATISTICAL TRAPS 183 Several crucial questions arise: What would a representative proportion of the phonology of the language be? How many items would be required to sample the component of phonology if such a component is in fact considered to be part of the larger domain of language proficiency? What percentage of the items on a language test should address the component of vocabulary (in order to be representative in the way that this component is sampled)? How many items should address the grammar component? What is a representative sampling of the grammatical structures of a language? How many noun phrases should it contain? Verb phrases?' Left embedded clauses? Right,branching relative clauses? One is struck by the possibility that in spite of what anyone may say, there may be no answers to these questions. The suspicion that \ no answers exist is heightened by ~he admissio,n that no procedure has yet been found which will provide a rational basis for 'making a stratified sample'. Indeed, if this is so, on what possible rational basis could specific percentages of a certain total number of items be determined? Clearly, the motivation for separate items aimed at 'phonological features, vocabulary, grammar, idioms, listening comprehension, reading comprehension' and so forth is some sort of discrete point analysis that is presumed to have validity independent of any particular language test or portion thereof. However, the sort of questions that must be answered concerning the possible e,xistence of such components of language proficiency cannot be answered by merely invoking the theoretical biases that led to the hypothesized components oflanguage proficiency in the first place. The theoretical biases themselves must be tested - not merely assumed. The problem for anyone proposing a test construction method that relies on an attempt to produce a representative (or even a 'rationa1') 'sampling of the language' is not merely a matter of how to parse up the universe of 'language proficiency' into components, aspects, skills, and so forth, but once having arrived at some division by whatever methods, the most difficult problem still remains - how to recognize a representative or rational sample in any particular portion of the defined universe. It can be shown that any procedure that might be proposed to assess the representativeness of a particular sample of speech is doomed to failure because by any possible methods, all samples of real speech are either equally representative or equally unrepresentative of the universe of possible utterances. This is a natural 184 LANGUAGE TESTS AT SCHOOL and inevitable consequence of the fact that speech is intrinsically nonrepetitive. Even an attempt to repeat a communicative exchange by using the same utterances with the same meanings is bound to fail because the context in which utterances are created is constantly changing. Utterances cannot be perfectly repeated. Natural discourse of the sort that is characteristic of the normal uses of human languages are even less repeatable - by the same logic. The universe to be sampled from is not just very large, it is infinitely large and nonrepetitive? To speak of a rational sampling of possible utterances is like speaking of a rational sampling of future events or even of historical happenings. The problem is not just where to dipjnto the possible sea pf experience, but also how to know when to stop dipping - i.e., when a rational sample has been achieved. . To make the problem more meaningful, consider sampling the possible sentences of English. It is known that the number of sentences must be infinite unless some upper bound can be placed on the length of an allowable sentence. Suppose we exclude all sentences greater than twenty words in length. (To include them only strengthens the argument, but since we are using the method of reduction to absurdity tying our hal1ds in this way is a reasonable way to start.) Miller (1964) conservatively estimated that the number of grammatical twenty-word sentences in English is roughly 10 29 . This figure derives direytly from the fact that if we interrupt someoneiwho is speaking, on'the average there are about 10 words that can be used to form an appropriate grammatical continuation. The number 10 20 exceeds the number of seconds in 100,000,000 centuries. Petersen and Cartier (1975) suggested that one way around the sampling problem was to 'sample the sample' of language that happened to appear in the particular course of study at the Defense Language Institute. Rather than trying to determine what would be a representative sample of the course, suppose we just took the whole course of an estimated '30 hours a week for 47 weeks' (p. 112). Suppose furthe~ that we make the contrary to fact assumption that in the course students are exposed to a minimum of one twenty-word sentence each second. Ifwe then considered the entire course of study 2 This is not the same as saying that every sentence is 'wholly novel' or 'totally unfamiliar' - a point of view that was argued against in Chapter 6 above. To say thatnormal communicative events cannot be repeated is like saying that you cannot relive yesterday or even the preceding moment of time. This does not mean that there is nothing familiar about today or that there will be nothing familiar about tomorrow. It islike saying, however, that tomorrow will not be identical to today, nor is today quite like yesterday, etc. STATISTICAL TRAPS 185 as a kind of language test, it would only be possible for such a test to cover about five milli9n sentences -less than .000000000001 percent (one trillionth of one percent) of the possible twenty-word sentences in English. In what realistic sense could this 'sample' be considered representative of the whole? Weare confronted here not just with a difficulty of how to select a 'sample', because it doesn't make any difference how we select the sample. It can never be argued that any possible sample of sentences is representative of the whole. If we take a larger unit of discourse as our basic working unit, the difficulty of obtaining a representative sample becomes far greater since the number of possible discourses is many orders of magnitude larger than the number of possible twenty-word sentences. We are thus forced to the conclusion that the discrete point sampling notion is about as applicable to the problem of constructing a good language test as the method of listing sentences is to the characterization ofthe grammar of a particular natural language. No list will ever cover enough of the language to be interesting as a theory of grammar. It is fundamentally inadequkte in principle - not just incomplete. Such a list could never be completed, and a test modeled after the same fashion could never be long enough even if we extended it for the duration of a person's life expectancy. In this vein, consider the fact that the number of twenty-word sentences by" itself is;;tl:>5'ut~. , thousand times larger than the estimated age of the earthA (Miller, ' 1964). The escape from the problem of sampling theory is suggested in the statement by Petersen and Cartier (1975) that 'the average native speaker gets along quite well knowing only a limited sample of the language at large, so our course and test really only need to sample that sample' (p. 111). Actually, the native speaker knows far more than he is ever observed to perform with his language. The conversations that a person has with others are but a small fraction of conversations that one could have if one chose to do so. Similarly, .the utterances that one comprehends are but an infinitesimal fraction of the utterances that one could understand if they were ever presented. Indeed, the native speaker knows far more of his language than he will ever be observed to use - even if he talks a lot and if we observe him all the time. This was the original motivation for the distinction between competence and performance in the Chomskyan linguistics of the 1950s and 1960s. . The solution to the difficulty is not to shift our attention from many utterances (say, all of the ones uttered or understood by a given native 186 J LANGUAGE TESTS AT SCHOOL speaker) to fewer utterances (say, all of those utterances presented to a language learner in a particular course of study at the D LI), but to remove our attention to an entirely different sort of object - namely, the grammatical system (call it a cognitive network, generative grammar, expectancy grammar,. or interlanguage system) that the learner is in the process of internalizing and which when it is mature (i.e., like that of the native speaker) will generate: not only the few meaningful utterances that happen to occur in a particular segment of time, but the many that the native speaker can say and understand. Instead of trying to construct a language test that will 'representatively' or 'rationally sample' the universe of 'language', we should simply construct a test that requires the language learner to do what native speakers do with discourse (perhaps any discourse will do). Then the interpretation of the test is related not to the particular discourse that we happened to select, nor even to the universe of possible discourses in the sense of sampling theory. But thus it is related to the efficiency of the learner's internalized grammatical system in processing discourse. The validity of the test is related to. how "i'ell it enabl,es. us to predict the learner's performance in other discourse processing tasks. We. can differentiate between segments of discourse that are easy to process and segments that are more difficult to process. Futther, we can differentiate between segments of discourse that would be appropriate to the subject matter of mathematics, as opposed to say, . Keography, or gardenin~ as opposed ·to architecture. The selection of segments of discourse appropriate to a particular learner or group of learners would depend largely on the kinds of things that would be expected later on of the same learner or group. Perhaps in this sense it would be possible to 'sample' types of discourse - but this is a very different sort of sampling than 'sampling' the phonological contrasts of a language, say, in appropriate proportion to the vocabulary items, grammatical rules, and the like. A language user can select portions of college level tests for cloze tests, but this is not the sort of 'sampling' that we have been arguing against. To make the selection analogous to the sort of sampling we have been arguing against is in principle impossible not just indefensible; one would have to search for prose with a specific proportion of certain phonological contrasts, vocabulary items,grammatical rules and the like in relatiol} to a course syllabus for teaching the language of the test. In short, sampling theory is either inapplicable or not needed. Where it might seem to apply, e.g., in the case of phonological STATISTICAL TRAPS 187 contrasts, or vocabulary,it is not at all cleaf how to justify the weighting of subtests in relation to each other. Where sampling theory is clearly inapplicable, e.g., at the sentence or discourse levels, it is also obviously not needed. No elaborate sampling technique is needed to determine whether a learner can read college texts at a defined level of comprehension - nor is sampling theory necessary to the definition of the degree of comprehension that a given learner may exhibit. B. Two common misinterpretations of correlations Correlations between language tests have sometimes been misinterpreted in two ways: first, low correlations have sometimes been taken to mean that two tests with different labels are in fact measuring different skills, or aspects or components of skills; and, second, high correlations between dissimilar language processing tasks have sometimes been interpreted as indicating mere reliability or even a lack of validity. Depending upon the starting assumptions of a particular theoretical viewpoint, the same statistics may yield different interpretations - or may seem to support different, even mutually contradictory conclusions. When such contradictions arise, either one or both of the contradictory viewpoints must be wrong or poorly articulated. For instance, it is not reasonable to interpret.a low correlation between two tests as an indication that the two tests are measuring different skills and also that neither of them is reliable (or that one of them is not reliable). Since reliability is a prerequisite to validity, a given statistic cannot be taken as an indication of low reliability and high validity (yet this is sometimes suggested in the literature as we will see below). Similarly, a high correlation between two dissimilar language tests cannot be dismissed as a case of high reliability but low validity all in the same breath. This latter argument, however, is more complex, so we will take the simpler case first. Suppose that the correlation between a particular test that bears the label 'Grammar' (or 'English Structure') and another language test that bears the label 'Vocabulary' (or 'Lexical Knowledge') is observed to be comparatively low, say, .40. Can it be concluded from such a low correlation that the two tests are therefore measuring different skills? Can we say on the basis of the low correlation between them that the so-called 'Grammar' test is a test of a 'grammar' component of language proficiency while the so-called 188 STATISTICAL TRAPS LANGUAGE TESTS AT SCHOOL 189 continue. In a test such as the TOEFL, this word could appear as an item in the Vocabulary subsection. It might appear in a sentence frame something like, The people want to 'continue'. The examinees might be required to identify a synonymous expression such as keep on from a field of distractors. On the other hand, the word continue might just as easily appear in the English Structure section as part of a verb phrase, or on some other part of the test in a different grammatical form - e.g., continuation, continual, continuity, discontinuity, continuous, or the like. If the word appeared in the English Structure section it might be part of an item such as the following: 'Vocabulary' test is a test of a 'lexical' component? The answer to both questions is a simple, no. The observed low correlation could result if both tests were in fact measures of the same basic factor but were both relatively unreliable measures of that factor. It could also result if one of the tests were unreliable; or if one of them were poorly calibrated with respect to the tested subjects (i.e., too easy or too difficult); or if one of the tests were in fact a poor measure of what it purported to measure even though it might be reliable; and so forth. In any case, a low correlation between two tests (even if it is expected on the basis of some theoretical reasoning) is relatively uninformative. It certainly cannot be taken as an indication of the validity of the correlated tests. Consider using recitation of poetry as a measure of empathy and quality of artistic taste as a measure of independence - would a low correlation between the two measures be a basis for claiming that one or both must be valid? Would a low correlation justify the assertion that the two tests are measures of different factors? The point is that low correlations would be expected if neither test were a measure of anything at all. A low correlation between a 'Grammar' test and a 'Vocabulary' test might well be the product of poor tests rather than an independence of hypothesized components of proficiency. In fact, many failures to achieve very high correlations would not prove by any means that very high correlations do not in fact exist between something we might call 'knowledge of vocabulary' and something else that we might call 'knowledge of grammar'. Indeed the two kinds of knowledge might be one kind in reality and no number of low correlations between language tests labeled 'Grammar' tests on the one hand and 'Vocabulary' tests on the other would suffice to exclude such a possibility - and so on for all possible dichotomies. In fact the observation of low correlations between language tests where high correlations might be expected are a little like fishing trips which produce small catches where large catches were expected. Small catches do not prove that big catches do not exist. The larger catches may simply be beyond the depths of previously used nets. To carry the example a little farther, consider the following remark by Bolinger (1975) in his introductory text on linguistics: 'There is a vast amount of grammatical detail still to be dug out of the lexiconso much that by the time we are through there may be little point in talking about grammar and lexicon as if they were two different things' (p. 299). Take a relatively common word in English like SPEAKER A: SPEAKER B : 'But do you think they'll go on building?' 'Yes, I do because the contractor has to meet 'his deadline. I think _ _ _ _ _ _ _ _ __ (A) the people continue to will (B) will they want to continue (C) to continue the people will (D) they will want to continue In an important sense, knowing a word is knowing how to use it in a meaningful context, a context that is subject to the normal syntactic (and other) constraints of a particular language. Does it make sense then to insist on testing word-knowledge independent of the constraints that govern the relationships between words in discourse ? Is it possible? Even if it does turn out to be possible, the proof that it has been accomplished will have to come from sources of evidence other than mere low correlations between tests labeled 'Grammar' tests and tests called 'Vocabulary' tests. In particular, it will have to be shown that there exists some unique and meaningful variance associated with tests of the one type that is not also associated with tests of the other type - this has not yet been done. Indeed, many attenipts to find such unique variances have failed (see the Appendix and references cited there). In spite of the foregoing considerations, some researchers have contended that relatively low correlations between different language tests have more substantial meaning. For instance, the College Entrance Examination Board and Educational Testing Service recommended in their 1969 Manual for Studies in Support of Score Interpretation (for the TOEFL) that it may be desirable to 'study the intercorrelations among the parts [of the TOEFL] to determine the extent to which they are in fact measuring different abilities for the group tested' (p. 6) . • 190 LANGUAGE TESTS AT SCHOOL The hint that low correlations might be taken as evidence that different subtests are in fact measuring different components of language proficiency or different skills is confirmed in two other reports. For instance, on the basis of the data in Table 1, the authors of the TOEFL Interpretive Information (Revised 1968) conclude: 'it appears that Listening Comprehension is measuring some aspect of English proficiency different from that measured by the other four parts, since the correlations of Listening Comprehension with each of the others are the four lowest in the table' (p. 14). TABLE 1 Intercorrelation of the Part Scores on the Test of English as a Foreign Language. Averaged over Forms Administered through April 1967. (From Manualfor TOEFL Score Recipients. Copyright © 1973 by Educational Testing Service. All rights reserved. Reprinted by permission.) Subscores (1) (1) Listening Comprehension (2) English Structure the Listening test (i.e., poor sound reproduction in some testing centers, or merely inconsistent procedures) then it is not reasonable to use the same low correlations as evidence that the Listening test is validly measuring something that the other subtests are not measuring. What might that something be? Clearly, if it is lack of TABLE 2 Intercorrelations of the Part Scores on the Test of English as a Foreign Language. Averaged over Administrations from October 1966 through June 1971. (From Manualfor TOEFL Score Recipients. Copyright © 1973 by Educational Testing Service. All rights reserved. Reprinted by permission.) Subscores (1) (1) (2) (3) (4) (5) .62 .53 .63 .55 Listening Comprehension (2) English Structure (3) (2) (3) (4) (5) .64 .56 .65 .60 .72 .67 .78 .69 .74 .64 .56 .72 .65 .67 .69 .60 .78 .74 Vocabulary .62 .73 (3) .53 .73 (4) .63 .66 .70 .55 .79 .77 .66 .79 .70 .77 Vocabulary Reading Comprehension (5) Writing Ability 191 STATISTICAL TRAPS (4) Reading Comprehension (5) Writing Ability .72 .72 .72 .72 Later, in an update of the same interpretive pUblication, Manual for TOEFL Score Recipients (1973) on the basis ofa similar table (see Table 2 below), it is suggested that 'the correlations between Listening Comprehension and the other parts of the test are the lowest. This is probably because skill in listening comprehension may be quite independent of skills in reading and writing; also it is not possible to standardize the administration of the Listening Comprehension section to the same degree as the other parts of the test' (p. 15). Here the authors offer what amount to mutually exclusive explanations. Both cannot be correct. If the low correlations between Listening Comprehension and the other subtests are the product of unreliability in the techniques for giving consistency in the procedure that produces the low correlations it is not listening ability as distinct from reading, or vocabulary knowledge, or grammar knowledge, etc. On the other hand if the low correlations are produced by real differences in the underlying skills presumed to be tested, the administrative procedures for the Listening test must have substantial reliability. It just can't go both ways. In the 1973 manual, the authors continue the argument that the tests are in fact measuring different skills by noting that 'none of the correlations ... [in our Table 2] are as high as the reliabilities of the part scores' from which they conclude that 'each part is contributing something unique to the total score' (p. 15). The question that is still unanswered is what that 'something unique' is, and whether in the case of each subtest it is in any way related to the label on that subtest. Is the Reading Comprehension subtest more of a measure of reading ability than it is of writing ability or grammatical knowledge or , r,' 192 LANGUAGE TESTS AT SCHOOL vocabulary knowledge or mere test-taking ability ora general proficiency factor or intelligence? The fact that the reliability coefficients are higher than the correlations between different part scores is no proof that the tests are measuring different kinds of knowledge. In fact, they may be measuring the same kinds of knowledge and their low intercorrelations may indicate merely that they are not doing it as well as they could. In any event, it is axiomatic that validity cannot exceed reliability - indeed the general rule of thumb is that validity coefficients are not expected to exceed the square of the reliabilities of the intercorrelated tests (Tate, 1965). If a certain test has some error variance in it and a certain other test also has some error variance in it, the error is apt to be compounded in their intercorrelation. Therefore, the correlation between two tests can hardly be expected to exceed their separate reliabilities. It can equal them only in the very special case that the tests are measuring exactly the same thing. From all of the foregoing, it is possible to see that low (or at least relatively low) correlations between different language tests can be interpreted as indications of low reliability or validity but hardly as proof that the tests are measuring different things. If ~:me makes the mistake of interpreting low correlations as evid~nce that the tests in question are measuring different things, how will one interpret higher correlations when they are in fact observed between equally diverse tests? For instance, if the somewhat lower correlations between TOEFL Listening Comprehension subtest and the Reading Comprehension subtest (as compared against the intercorrelations between the Reading Comprehension subtest and the other subtests) represented in Tables 1 and 2 above, are taken as evidence that the Listening test measures some skill that the Reading test does not measure, and vice versa, then how can we explain the fact that the Listening Comprehension subtest correlates more strongly with ~ cloze test (usually regarded as a reading comprehension measure) than the latter does with the Reading Comprehension subtest (see Darnell, 1968)? Once high correlations between apparently diverse tests, are discovered, the previous interpretations of low correlations as indicators of a lack of relationship between whatever skills the tests are presumed to measure are about as convincing as the argumen.ts of, the unsuccessful fisherman who said there were no fish to be caught. The fact is that the fish are in hand. Surprisingly high correlations have been observed between a wide variety of testing techniques with a F STATISTICAL TRAPS 193 wide range of tested populations. The techniques range from a whole family ofprocedures under the general rubric of cloze testing, dictation, elicited imitation, essay writing, and oral interview. Populations have ranged from children and adult second language learners, to children' and adults tested in their native language (see the Appendix). What then can be made of such high correlations? Two interpretations have been offered. One of them argues that the strong correlations previously observed between cloze and dictation, for instance, are merely indications of the reliability of both procedures and proof in fact that they are both measuring basically the same thing - further, that one of them is therefore not needed since they both give the same information. A second interpretation is that the high correlations between diverse tests must be taken as evidence not only of reliability but also of substantial test validity. In the first case, it is argued that part scores on a language proficiency test should produce low intercorrelations in order to attain validity, and in the second that just the reverse is true. It would seem that both positions cannot be correct. It is easy to see that the expectation of low correlations between tests that purport to measure different skills, components, or aspects of language proficiency in accord with discrete point test philosophy necessitates some method of explaining away high correlations when they occur. The solution of treating the correlations merely as indications oftest reliability, as Rand (1972) has done, runs into very serious logical trouble. Why should we expect a dictation which requires auditory processing of sequences of spoken material to measure the same thing as a cloze test which requires the learner to read prose and replace missing words? To say that these tests are tests of the same thing, or to interpret a, high correlation between them as an indication of reliability (alone, and not something more) is to saw off the limb on which the whole of discrete point theory is perched. If cloze procedure is not different from dictation, then what is the difference between speaking and listening skills? What basis could be offered for distinguishing a reading test from a grammar test? Are such tasks more dissimilar than cloze and dictation? If we were to follow this line of reasoning just a short step further, we would be forced to conclude that low correlations between language tests of any sort are indicators of low reliabilities per force. This is a conclusion that no discrete point theorist, however, could entertain as it obliterates all of the distinctions that are crucial to discrete point testing. 194 LANGUAGE TESTS AT SCHOOL What has been proposed by discrete point theorists, however, is that tests should be constructed so as to minimize, deliberately, the correlations between parts. Ifdiscrete point theorizing has substance to it, such recommendation IS not entirely without support. However, if the competing viewpoint of pragmatic or integrative test philosophy turns out to be more correct, test constructors should interpret low correlations as probable indicators oflow validities and should seek to construct language tests that maximize the intercorrelation of part scores. This does not mean that what is required is a test that consists of only one sort of task (e.g., reading without speaking, listening, or writing). On the contrary, unless one is interested only in ability to read with comprehension or something of the sort, t6 learn how well an individual understands, speaks, reads, and writes a language, it may well be necessary (or at least highly desirable) to require him to do all four sorts of performances. The question here is not which or how many tasks should be included on a comprehensive language test, but what sort of interrelationship between performances on language tasks in general should be expected. If an individual happens to be much better at listening tasks than at speaking tasks, or at reading and writing tasks than at speaking and listening tasks, we would be much more apt t6 discover this fact with valid language tests than with non-valid ones. However, the case of a particular individual, who may show marked differences in ability to perform different language tasks, is not an argument against the possibility of a very high correlation between those same tasks for an entire population of subjects, or for subjects in general. What if we go down the other road? What' if we assume that part scores on language tests that intercorrelate highly are therefore redundant and that one of the highly correlated test scores should be eliminated from the test? The next step would be to look for some new subtest (~r to construct one or more than one) which would assess some different component oflanguage skill not included in the redundant tests. In addition to sawing the limb out from under the discrete point test philosophy, we would be making a fundamental error in the definition of reliability versus validity. Furtherm'ore, we would be faced with the difficult (and probably intrjnsically insoluble) problem oftrying to decide how much weight to assign to which subskill, component, or aspect, etc. We have discussed this latter difficulty in section A above. The matter of confusing reliability and validity is the second point a / / t STATISTICAL TRAPS 195 to be attended to in this section. Among the methods for assessing test reliability are: the test-retest method; the technique of correlating one half ofa test with the other half of the same test (e.g., correlating the average score on even numbered items with the average on odd numbered items for each presumed homogeneous portion of a test); and the alternate forms method (e.g., correlating different forms of the same test). In all of these cases, and in fact in all other measures of reliability (Kuder-Richardson formulas and other internal consistency measures included), reliability by definition has ts> do with tests or portions of tests that are in some sen~e the same or fundamentally similar. To interpret high correlations between substantially different tests, or tests that require the performance of substantially different tasks, as mere indicators of reliability is to redefine reliability in an unrecognizable way. If one accepts such a definition, then how will one ever distInguish between measures of reliability and measures of validity? The distinction, which is a necessary one, evaporates. In the case of language tests that require the performance of substantially different tasks, to interpret a high correlation between them as an indication of reliability alone is to treat the tasks as the same when they are not, and to ignore the possibility that even more diverse pragmatic language tasks may be equally closely related. In the case oflanguage tests high correlations are probably the result of an underlying proficiency factor that relates to a psychologically real grammatical system. If such a factor exists, the ultimate test of validity of any language test is whether or not it taps that factor, and how well it does so. The higher the correlations obtained between diverse tasks, the stronger the confidence that they are in fact tapping such a factor. The reasoning may seem circular, but the circularity is only apparent. There are independent reasons for postulating the underlying grammatical system (or expectancy grammar) and there are still other bases for determining what a particular language test ' measures (e.g., error analysis). The crucial empirical test for the existence of a psychologically real grammar is in fact performance on language tests (or call them tasks) of different sorts. Similarly, the .chief criterion of validity for any proposed language test is how well it assesses the efficiency of the learner's internalized grammatical system (or in the terms of Part One of this book, the learner's expectancy system). On the basis of present research (see Oller and Perkins, 1978) it seems likely that Chomsky (1972) was correct in arguing that , 196 LANGUAGE TESTS AT SCHOOL language abilities are central to human intelligence. Further, as is discussed in greater detail in the Appendix, it is apparently the case that language ability, school achievement, and IQ all constitute a relatively unitary factor. However, even if this latter conclusion were not sustained by further research, the discrete point interpretations of correlations as discussed above will still have to be radically revised. The problems there are logical and analytical whereas the unitary factor hypothesis is an empirical issue that requires experimenta1 study. c. Statistical procedures as the final criterion for item selection Perhaps because of its distinctive jargon, perhaps because of its bristling mathematical formulas, or perhaps out of blind respect for things that one does not understand fully, statistical procedures (like budgets and curricula as we noted in Chapter 4) are spmetimes elevated from the status of slaves to educational purposes to the status of masters which define the purposes instead of serving them. In Chapter 9 below we return to the topic of item analysis in greater detail~ Here it is necessary to define briefly the two item statistics on which the fate of most test items is usually decided - i.e., whether to i'oclude-a particular item, exclude it, or possibly rewrite it and pretest it again. The first item statistic is the simple percentage of students answering correctly (item facility) ·or the percentage answering incorrectly (item difficulty). For the sake of parsimony we will speak only of item facility (IF) - in fact, a little reflection will show that item difficulty is merely another way of expressing the same numerical value that is expressed as IF. The second item statistic that is usually used by professional testers in evaluating the efficiency of an item is item discrimination (ID). This latter statistic has to do with how well the item tends to separate the examinees who are more proficient at the task in question from those examinees who are less proficient. There are numerous formulas for computing IDs, but all of them are in fact methods of measuring the degree of correlation between the tendency to get high and lowscores on the total test (or subtest) and the tendency to answer correctly or incorrectly on a particular test item. The necessary assumption is that the whole test (or subtest) is apt to be a better measure of whatever the item attempts to measure (and the item can be considered a kind of miniature test) than any single item. If a given item is valid (by this criterion) it must correlate positively and significantly with the total r .. STATISTICAL TRAPS 197 test. If it does so, it is possible to conclude that high scorers on the test will tend to answer the item in question correctly more frequently than do low SCOll:t;IS; isreversing the ttc:~• •J._i:8l!*Wltt'''''~!1.,~t:U~~~.J the i,tem·'and the tel!;ttilll!rm.tWliJ''',mcNlll\)_ _~Y~._II_ high ~cores on t"1at",".tQ ,us,. t,e item more frequently exammees wlro;~t!:t law scores on the te\t. For reasons that are discussed in more detail in Chapter 9, items with very low or very high IFs and/or items with very low or negative IDs are usually discarded from tests. Very simply, items that are too easy or too hard provide little or no information about the range of proficiencie~ in a particular group of examinees, and items that have nil or negative IDs either contribute nothing to the total amount of meaningful variance in the test (i.e., the tendency of the test to spread the examinees over a scale ranging from less proficient to more proficient) or they in fact tend to depress the meaningful variance by cancelling out some of it (in the case of negative IDs). Probably any multiple choice test, or other test that is susceptible to item analysis, can be significantly improved by the application of the above criteria for eliminating weak items. In fact, as is argued in Chapter 9, multiple choice tests which have not been SUbjected to the requirements of such item analyses should probably not be used for the purposes of making educational decisions - unfortunately, they are used for such purposes in many educational contexts. The appropriate use of item analysis then is to eliminate (or at least to flag for revision) items that are for whatever reason inconsistent with the test as a whole, or items that are not calibrated appropriately to the.level of proficiency of the population to be tested. But what about the items that are left unscathed by such analyses? What about the items that seem to be appropriate in IF and ID? Are they necessarily, therefore, valid items? If such statistics can be used as methods for eliminating weak items, why not use them as the final criteria for judging the items which are not eliminated as valid - once and for all? There are several reasons why acceptable item statistics cannot be used as the final basis for judging the validity of items to be included in tests. It is necessary that test items conform to minimal requirements ofIF and ID, but even if they do, this is not a sufficient tiasis forjudging the items to be 'valid' in any fundamental sense. One of the reasons that item statistics cannot be used as final 198 STATISTICAL TRAPS LANGUAGE TESTS AT SCHOOL criteria for item selection - and perhaps the most fundamental reason - relates to the assumption on which ID is based. Suppose that a certain Reading Comprehension test (or one that bears the label) is a fairly poor test of what it purports to measure. It follows that even items that correlate perfectly with the total score on such a test must also be poor items. For instance, if the test were really a measure of the learner's ability to recall dates or numbers mentioned in the reading selection and to do simple arithmetic operations to derive new dates, the items with the highest IDs might in actuality be the ones with the lowest validities (as tests of reading comprehension). Another reason that item statistics cannot be relied on for the final selection of test items is that they may in fact push the test in a direction that it should not go. For example, suppose that one wants to test knowledge of words needed for college-level reading of texts in mathematics. (We ignore for the sake of the argument at this point the question of whether a 'vocabulary' test as distinct from other types of tests is really more of a measure of vocabulary knowledge than of other things. Indeed, for the sake of the argument at this point, let us assume that a valid test of 'vocabulary knowledge' can be constructed.) By selecting words from mathematics texts, we might .construct a satisfactory test. But suppose that for whatever reason certain words like ogle, rustle, shimmer, sheen, chiffonier, spouse, fettered, prune, and pester creep into the examination, and suppose further that they all produce acceptable item statistics. Are they therefore acceptable items for a test that purports to measure the vocabulary necessary to the reading of mathematics texts? Or change the example radically, do words like ogle, and chiffonier, belong in the TOEFL Vocabulary subtest? With the right field of distractors, either of these might produce quite acceptable item statistics - indeed the first member of the pair did appear in the TOEFL. Is it a word that foreign students applying to American universities need to know? As a certain Mr Jones pointed out, it did produce very acceptable item statistics. Should it therefore be included in the test? If such items are allowed to stand, then what is to prevent a test from gravitating further and further away from common forms and usages to more and more esoteric terms that produce acceptable item statistics? To return to the example concerning the comprehension of words in mathematics texts, it is conceivable that a certain population of very capable students will know all the words included in the vocabulary test. Therefore, the items would turn out to be too easy by 199 item statistics criteria. Does this necessarily mean that the items are not sufficiently difficult for the examinees? To claim this is like arguing that a ten foot ceiling cannot be measured with a ten foot tape. It is like arguing that a ten foot tape is the wrong instrument to use for measuring a ten foot ceiling because the ceiling is not high enough (or alternatively because the tape is too short). Or, to look at a different case, suppose that all of the subjects in a certain population perform very badly on our vocabulary test. The item statistics may be unacceptable by the usual standards. Does it necessarily follow that the test must be made less difficult? Not necessarily, because it is possible that the subjects to be tested do not know any of the words in the mathematics texts - e.g., they may not know the language of the texts. To claim therefore that the test is not valid and/or that it is too difficult is like claiming that a tape measure is not a good measure of length, and/or that it is not short enough, because it cannot be used to measure the distance between adjacent points on a line. For all of the foregoing reasons, test item statistics alone cannot be used to select or (in the limiting cases just discussed) to reject items. The interpretation of test item statistics must be subordinate to higher considerations. Once a test is available which has certain independent claims to validity, item analysis may be a useful tool for refining that test and for attaining slightly (perhaps significantly) higher levels of validity. Such statistics are by no means, however, final criteria for the selection of items. D. Referencing tests against non-native performance The evolution of a test or testing procedure is largely determined by the assumptions on which that test or procedure is based. This is particularly true of institutional or standardized tests because of their longer survival expectancy as compared against classroom tests that are usually used only once and in only one form. Until now, discrete point test philosophy has been the principal basis underlying standardized tests of all sorts. Since discrete point theories of language testing were generally articulated in relation to the performance of non-native speakers of a . given target language, most of the language tests based on such theorizing have been developed in reference to the performance of non-native speakers of the language in question. Generally, this has been justified on the basis of the assumption that native speakers 1 200 LANGUAGE TESTS AT SCHOOL ShQUld perfQrm flawlessly, Qr nearly sO', Qn language t~sks that are nQrmally included in such tests. HQwever, native' speakers Qf a language dO' vary in their ability to' perfQrm language related tasks. A child Qf six years may be just as much a native speaker Qf English as an adult Qf age twenty-five, but we dO' nQt expect the child to' be able to' dO' all Qf the things with English that we may expect Qf the adult hence, there are differences in skill attributable to' age Qr maturatiQn. Neither dO' we expect an illiterate farmer whO' has nQt had the educatiQnal QPPQrtunities Qf an urbanite Qf cQmparable abilities to' be able tQ"read at the same grade level and with equal cQmprehensiQnhence, there are differences due to' educatiQn and experience. FurthermQre, recent empirical ,research, especially Stump (1978) has shQwn that nQrmal native speakers dO' vary greatly in proficiency and that this variance may be identical with what has fQrmerly been called IQ and/or achievement. Thus, we must cQnclude that there is a real chQice: language tests can either be referenced against the perfQrmance Qf native speakers Qr they may be referenced against the perfQrmance Qf nQn-native speakers. Put mQre cQncretely, the effectiveness Qf a test itetn (Qr a subtest, Qr an entire battery Qftests) may be judged in terms QfhQW it furictiQns with natives Qr nQn-natives in producing a range Qf SCQresQr in prQducing meaningful variance between better perfQrl)1ers and WQrse performers. If a nQn-native reference PQPulatiQn is used, test writers will tend to' prepare items that maximize the variability within that PQPulatiQn. If native speakers are selected as a reference PQPulatiQn, test writers will tend,tQ arrange items SO' as to' maximize the variability within that PQPulatiQn. Or mQre accurately, the test writers will attempt to' make the testes) in either case as sensitive as PQssible to' the variance in language prQficiency that is actually characteristic Qf the PQPulatiQn against which the test is referenced. In general, tl)e attempt to' maximize the sensitivity Qf a test to' true variabilities in tested PQPulatiQns is desirable. This is what test validity is abQut. The rub CQmes frQm the fact that in the case Qf language tests, the ability Qf nQn-nativespeakers to' answer certain discrete test items cQrrectly may be unrelated to' the kinds Qf ability that native speakers display when they use language in nQrmal CQntexts Qf cQmmunicatiQn. There are a number Qf salient differences between the perfQrmance Qf native speakers Qf a given language and the perfQrmance Qf nQnnatives whO' are at variQus stages Qf develQpment in acquiring the T STATISTICAL TRAPS 201 same language as a secQnd Qr fQreign language. AmQng the differences is the fact that native speakers generally make fewer errors, less severe errQrs, and errors which have nO' relatiQnship to' anQther language system (i.e., native speakers dQnQt have fQreign accents, nQr dO' they tend to' make errQrs that Qriginate in the syntactic, semantic, Qr pragmatic system Qf a cQmpeting language). Native speakers are typically able to' prQcess material in their native language that is richer in QrganizatiQnal cQmplexities than the typical nQn-native can handle (Qther things such as age, educatiQnal experience, and the like being equal). NQn-natives have difficulty in achieving the same level Qf skill that native speakers typically exhibit in handling jQkes, puns, riddles, irony, sarcasm, facetiQus humQr, hyperbQle, dQuble entendre, subtle inuendQ, and SO' fQrth. Highly skilled native speakers are less susceptible to' false analQgies (e.g., pronounciation fQr pronunciation, ask it to him Qn analQgy with fQrms like tell it to him) and are mQre capable Qf making theapprQpriate analQgies affQrded by the richness Qttheir Qwn linguistic system (e.g., the use Qf parallel phrasing acrQSS sentences, the use Qf metaphQrs, similes, cQntrasts, cQmparisQns, and SO' Qn). Because Qf the CQntrasts in native and nQn-native perfQrmance which we have nQted abQve, and many Qthers, the effect Qf referencing a test against Qne PQPulatiQn rather than the Qther may be quite significant. SupPQse the decisiQn is made to' use nQn-natives as a reference PQPulatiQn - as, fQr instance, the TOEFL test writers decided tQd~ in the early 1960s. What will be the effect Qn the eventual fQrm Qf the test items? HQW will they be apt to' differ frQm test items that ,are re(erenced against the perfQrmance Qf native speakers? If the variance in the perfQrmance Qf natives is nQt cQmpletely similar to' the variance in the perfQrmance Qf nQn-natives, it fQllQWS that items which wQrk well in relatiQn to' the variance in Qne will nQt necessarily wQrk well in relatiQn to' the variance in the other. In fact, we shQuld predict that some Qf the items that are easy fQr native speakers shQuld be difficult fQr nQn~natives and vice versa - SQme Qf ' the items that are easy fQr nQn-natives shQuld be mQre difficult fQr native speakers. This last predictiQn seems anQmalQus. Why shQuld nQh-native speakers be able to' perfQrm better Qn any language test item than native speakers? FrQm the PQint Qf view Qf a SQurid theQry. Qf psychQlinguistics, the fact is that native speakers shQuld always QutperfQrm nQn-natives, Qther things being equa1. HQwever, if a given test Qf language prQficiency is referenced against the 202 LANGUAGE TESTS AT SCHOOL performance of non-native speakers, and ,if t~ variance in their performance is different from the variance in the performance of natives, it follows tl;l.at some of the items 'in the test will tend to gravitate toward portions of variance in the reference population that are not characteristic of normal language use by native, speakers. Hence, some of the items on a test referenced against non-native performance will be more difficult for natives than for non-natives, and many of the items on such tests may have little or nothing to do with actual ability to communicate in the tested language. Why is there reason to expect variance in the language skills of nonnative speakers to be somewhat skewed as compared against the variance in native performance (due to age, education, and the like)? For one thing, many non-native speakers - perhaps most non-natives who were the reference populations for tests like the TOEFL, the Modern Language Association Tests, and many other Joreign language tests - are exposed to the target language primarily in somewhat artificial classroom contexts. Further, they are exposed principally to materials (syntax based pattern drills, for instance) that are founded on discrete point theories of teaching and analyzing languages. They are encouraged to form generalizations about the nature of the target language that would be very uncharacteristic of native speaker intuitions about how to say and mean things in that same language. ,No native speaker, for example, would be apt to confuse going to a dance in a car with going to a dance with a girl, but non-natives may be led into such confusions. Forms like going to a foreign country with an airplane and going to a foreign country in an airplane are often confused due to classroom experience ~ see Chapter 6, section C. The kinds of false analogies, or incorrect generalizations that nonnatives are lured into by poorly conceived materials combined with good teaching might be construed as the basis for what could be called a kind of freak grammar - that is, a grammar that is suited only for the rather odd contexts of certain teaching materials and that is quite ill-suited for the natural contexts of communication. Ifa test then is aimed at the variance in performance that is generated by more ·of less effective intermilization of such a freak grammar, it should not he surprising that some of the items whi<:h are sensitive to th<1 knowledge that such a grammar expresses would be impervious to the knowledge that a more normal (i.e., native speaker) grammar specifies. Similarly, tests that are sensitive to the variance in natural grammars might well be insensitive to some of the kinds of discrete STATISTICAL TRAPS 203 point knowledge characteristically taught in language materials. If the foregoing predictions were correct, interesting contrasts in native and non-native performance on tests such as the TOEFL should be experimentally demonstrable. In a study by Angoff and Sharon (1971), a group of native speaking college students at the University of New Mexico performed less well than non-natives on 21 %of the items in the Writing Ability section of that examination. The Writing Ability subtest of the TOEFL consists primarily of items aimed at assessing knowledge of discrete aspects of grammar, style, and usage. The fact that some items are harder for natives than for non-natives draws into question the validity of those items as measures of knowledge that native speakers possess. Apparently some of the items are in fact s91sitive to things that non-natives are taught but that native speakers do not normally learn. If the test Were normed against the performance of native speakers in the first place, this sort of anomaly could not arise. By this logic, native performance is a more valid criterion against which to judge the effectiveness of test items than non-native performance is. Another sense in which the performance of non-native speakers may be skewed (i.e., characteristic of unusual or freak grammars) is in the relationship between skills, aspects of skills, and components of skills. For instance, the fact that scores on a test of listening comprehension correlate less strongly with written tests of reading comprehension, grammatical knowledge (of the discrete point sort), and so-called writing ability (as assessed by the TOEFL, for instance), than the latter tests correlate with each other (see Tables 1 and 2 in section B above) may be a function of experiential bias rather than a consequence of the factorial structure oflanguage proficiency. Many non-native speakers who are tested on the TOEFL, for example, may not have been exposed to models who ~peak fluent American English. Furthermore, it maywell be that experience with such fluent models is essential to the development of listening skill hand in hand with speaking, reading, and writing abilities. Hence, it is possible that the true correlation between skills, aspects, and com'ponents of skills is much higher under normal circumstances than has often been assumed. Further, the best approach if this is true would be to make the items and tests maximally sensitive to the I meaningful variance present in native speaker performance (e.g., that sort of variance that is due to normal maturation and experience). In short, referencing tests against the performance of non-native speakers, though statistically an impeccable decision, is hardly 202 LANGUAGE TESTS AT SCHOOL performance of non-native speakers, and if the variance in their performance is different from the variance in the performance of natives, it follows that some of the items in the test will tend to gravitate toward portions of variance in the reference population that are not characteristic of normal language use by native speakers. Hence, some of the items on a test referenced against non-native performance will be more difficult for natives than for non-natives, and many of the items on such tests may have little or nothing to do with actual ability to communicate in the tested language. Why is there reason to expect variance in the language skills of nonnative speakers to be somewhat skewed as compared against the variance in native performance (due to age, education, and the like)? For one thing, many non-native speakers - perhaps most non-natives who were the reference populations for tests like the TOEFL, the Modern Language Association Tests, and many other foreign language tests - are exposed to the target language primarily in somewhat artificial classroom contexts. Further, they are exposed principally to materials (syntax based pattern drills, for instance) that are founded on discrete point theories of teaching and analyzing languages. They are encouraged to form generalizations about the nature of the target language that. would be very uncharacteristic of native speaker intuitions about how to say and mean things in that same language. No native speaker, for example, would be apt to confuse going to a dance in a car with going to a dance with a girl, but non-natives may be led into such confusions. Forms like going to a foreign country with an airplane and going to a foreign country in an airplane are often confused due to classroom experience - see Chapter 6, section C. The kinds of false analogies, or incorrect generalizations that nonnatives are lured into by poorly conceived materials combined with good teaching might be construed as the basis for what could be called a kind of freak grammar - that is, a grammar that is suited only for the rather odd contexts of certain teaching materials and that is . quite ill-suited for the natural contexts of communication. If a test then is aimed at the variance in performance that is generated by more or Jess effective internalization of such a freak grammar, it should not be surprising that some of the items which are sensitive to the. knowledge that such a grammar expresses would be impervious to the knowledge that a more normal (i.e., native speaker) grammar specifies. Similarly, tests that are sensitive to the variance in natural grammars might well be insensitive to some of the kinds of discrete T • STATISTICAL TRAPS 203 point knowledge characteristically taught in language materials. If the foregoing predictions were correct, interesting contrasts in native and non-native performance on tests such as the TOEFL should be experimentally demonstrable. In a study by Angoff and Sharon (1971), a group of native speaking college students at the University of New Mexico performed less well than non-natives on 21 %of the items in the Writing Ability section of that examination. The Writing Ability subtest of the TOEFL consists primarily of items aimed at assessing knowledge of discrete aspects of grammar, style, and usage. The fact that some items are harder for natives than for non-natives draws into question the validity of those items as measures of knowledge that native speakers possess. Apparently some of the items are in fact sepsitive to things that non-natives are taught but that native speakers do not normally learn. If the test Were normed against the performance of native speakers in the first place, this sort of anomaly could not arise. By this logic, native performance is a more valid criterion against which to judge the effectiveness of test items than non-native performance is. Another sense in which the performance of non-native speakers may be skewed (i.e., characteristic of unusual or freak grammars) is in the relationship between skills, aspects of skills, and components of skills. For instance, the fact that scores on a test of listening comprehension correlate less strongly with written tests of reading comprehension, grammatical knowledge (of the discrete point sort), and so-called writing ability (as assessed by the TOEFL, for instance), than the latter tests correlate with each other (see Tables 1 and 2 in section B above) may be a function of experiential bias rather than a consequence of the factorial structure oflanguage proficiency. Many non-native speakers who are tested on the TOEFL, for example, may not have been exposed to models who speak fluent American English. Furthermore, it may well be that experience with such fluent models is essential to the development of listening skill hand in hand with speaking, reading, and writing abilities. Hence, it is possible that the true correlation between skills, aspects, and comp~nents of skills is much higher under normal circumstances than has often been assumed. Further, the best approach if this is true would be to make the items and tests maximally sensitive to the meaningful variance present in native speaker performance (e.g., that sort of variance that is due to normal maturation and experience). In short, referencing tests against the performance of non-native speakers, though statistically an impeccable decision, is hardly 204 STATISTICAL TRAPS LANGUAGE TESTS AT SCHOOL defensible from the vantage point of deeper principles of validity and practicality. In a fundamental and indisputable sense,native speaker performance is the criterion against which alllanguage tests must be validated because it is the only observable criterion in terms of which language proficiency can be defined. To choose .non-native performance as a criterion whenever native performance can be obtained is like using an imitation (even ifit is a good one) when the genuine article is ready to hand. The choice of native speaker performance as the criterion against which to judge the validity of language proficiency tests, and as a basis for refining and developing them, guarantees greater facility in the interpretation of test scores, and more meaningful test sensitivities (i.e., variance). Another incidental benefit of referencing tests against native performance is the exertion of a healthy pressure on materials writers, teachers, and administrators to teach non-native speakers to do what natives do - i.e., to communicate effectively - instead of teaching them to perform discrete point drills that have little or no relation to real communication. Of course, there is nothing surprising to the successful classroom teacher in any of these observations. Many language teachers have been devoting much effort to making all of their classroom activities as meaningful, natural, and relevant to the normal communicative uses of language as is possible, and that for many years. 7. 8. 9. 10. KEY POINTS 1. Statistical reasoning is sometimes difficult and can easily be misleading. 2. There is no known rational way of deciding what percentage of items on a discrete point test should be devoted to the assessment of a particular skill, aspect, or component of a skill. Indeed, there cannot be any basis for componential analysis oflanguage tests into phonology, syntax, and vocabulary subtests, because in normal uses oflanguage all components work hand in hand and simultaneously. 3. The difficulty of representatively sampling the universe of possible sentences in a language or discourses in a language is insurmountable. 4. The sentences or discourses in a language which actually occur are but a small portion (an infinitesimally small portion) of the ones which could occur, and they are non-repetitive due to the very nature of human experience. 5. The discrete point method of sampling the universe of possible phrases, or sentences, or discourses, is about as applicable to the fundamental pro blem oflanguage testing as the method oflisting examples of phrases, sentences, or discourses is to the basic problem of characterizing language proficiency - or psychologically real grammatical systems. 6. The solution to the grammarian's problem is to focus attention at a 11. 12. 13. 14. 15. t 205 deeper level - not on the surface forms of utterances, but on the underlying capacity which generates not only a particular utterance, but utterances in general. The solution to the tester's problem is similar namely to focus attention not on the sampling of phrases, sentences, or discourses per se, but rather on the assessment of the efficiency of the developing learner capacity which generates sequences of linguistic elements in the target language (i.e., the efficiency of the learner's psychologically real grammar that interprets and produces sequences of elements in the target language in particular correspondences to extralinguistic contexts). Low correlations have sOlll.etimes been interpreted incorrectly as showing that tests with different labels are necessarily measures of what the labels name. There are, however, many other sources of low correlations. In fact, tests that are measures of exactly the same thing may correlate at low levels if one or both are unreliable, too hard or too easy, or simply not valid (e.g., if both are measures of nothing). It cannot reasonably be argued that a low correlation between tests with different labels is due to a lack of validity for one of the tests and is also evidence that the tests are measuring different skills. Observed high correlations between diverse language tests cannot be dismissed as mere indications of reliability - they must indicate that the proficiency factor underlying the diverse performances is validly tapped by both tests. Furthermore, such high correlations are not ambiguous in the way that low correlations are. The expectation of low correlations between tests that require diverse language performances (e.g., listening as opposed to reading) is drawn from discrete point theorizing (especially the componentializing of language skills), but is strongly refuted when diverse language tests (e.g., cloze and dictation, sentence paraphrasing and essay writing, etc.) are observed to correlate at high levels. To assume that high correlations between diverse tests are merely an indication that the tests are reliable, is to treat different tests as if they were the same. If they are not in fact the same,and if they are treated as the same, what justification remains for treating any two tests as different tests (e.g., a phonology test compared against a vocabulary test)? To follow j>uch reasoning to its logical conclusion is to obliterate the possibility of recognizing different skills, aspects, or components of skills. Statistical procedures merit the position of slaves to educational purposes much the way hammers and nails merit the position of tools in relation to building shelves. If the tools are elevated to the status of procedures for defining the shape of the shelves or what sort of books they can hold, they are being misused. Acceptable item statistics do not guarantee valid test items, neither do unacceptable item statistics prove that a given item is not valid. Tests must have independent and higher claims to validity before item statistics per se can be meaningfully interpreted. Language tests may be referenced against the performance of native or non-native speakers. 206 LANGUAGE TESTS AT SCHOOL STATISTICAL TRAPS 16. Native and non-native performance on la~guage tasks contrast ina number of ways including the frequency and severity of errors, type of errors (interference errors being characteristic only of non-native speech), and the subtlety and complexity of the organizationjll constraints that can be handled (humor, sarcasm, etc.). 17. Non-native performance is also apt to be skewed by the artificial contexts of much classroom experience. 18. If the foregoing generalizations were correct, we should expect natives to perform more poorly on some of the items aimed at assessing the knowledge that non-natives acquire in artificial settings. 19. Angoffand Sharon (1971) found that natives did more poorly than nonnatives on 21 %ofthe items on the Writing Ability section ofthe TOEFL (a test referenced against the performance of non-natives to start with). 20. If, on the other hand, native performance is set as the main criterion, language tests can easily be~made more interpretable and necessarily would achieve higher validity (other things being equal). 21. Another advantage of referencing language tests against the perfol'mance of native speakers is to place a healthy pressure on what happens in classrooms - a pressure toward more realistic uses of language for communicative purposes. 207 constrained by cognitive factors th~n lower order.units? Can you. show by logic that some of the constraints on phrases SImply do not eXIst for words and that some of the constraints on sentences do not exist for phras~s and so forth? If so, what implications does your reasoning hold for language tests? What, for instance-, is the difference between a vocabulary item without any additional context as opposed to say a vocabulary item in the context of a sentence, or in the context of a paragraph, or in the context of an entire essay or speech? Which sort of test is more apt to mirror faithfully what native speakers do when they use words - perhaps even when they use the word in question in the item? Consider the problem of trying to decide what percentage of items on a language test should be devoted to phonology as opposed to vocabulary or syntax. What proportion of normal speech is represented by a strict attention to phonology as opposed to vocabulary? 5. Consider the meaning of a word in a particular context. For instance, suppose someone runs up to you and tells you that'your closest friend has just been in an automobile accident. Is the meaning associated with the friend's name in that context the same as the meaning in a context where you are told that this friend will be leaving for Australia within a month? In what ways are the two uses of the friend's name (a proper noun in these cases) similar? Different? What if the same person is standing nearby and you call him by name. Or suppose that he is not nearby and you mention his name to some other person. Or perhaps you refer to your friend by name in a letter addressed to himself, or to . someone else. In what sense are all of these uses of the name the same, and in what sense are they different? If we extend the discussion then to other classes of words which are pragmatically more complex than proper nouns, or to grammatical forms that are more complex still, in what sense can any utterance be said to be a repetition of any other utterance? Discuss the implications for language teaching and language testing. If utterances and communicative events in general are nonrepetitive, what does this imply for language tests? Be care~ul not to overlook the similarities between utterances on different occaSlOns. 6. What crucial datum must be supplied to prove (or adeast support) the notion that a test labeled a 'Vocabulary' test is in fact more a measure of vocabulary knowledge than of grammatical knowledge? i/'. Why not interpret low correlations as proof that the intercorrelated tests are valid measures of what their labels imply? 8. Why not generally assume that high correlations are mere indications of test reliability? In what cases is such a claim justified? When is it not justified? What is the crucial factor that must be cons~dered? 9. What evidence is there that syntactic and lexical knowledge may be more closely interrelated than was once thought? What does a learner have to know in 'order to select the best synonym for a given word when the choices offered and the given word are presented in isolation from any particular context? Suppose that all of the items in a given test or subtest produce acceptable statistics. What additional criteria need to be met in order to prove that the test is a good test of whatever it purports to measure? Or, considered V{. DISCUSSION QUESTIONS v 1. Estimate the number of distinctive sounds there are in English (i.e., phonemes). Then, estimate the number of syllables that can be constructed from those sounds. Which number is greater? Suppose that there were no limit on the length or structure of a syllable (or say, a sequence of sounds). How many syllables would there be in English? Expressions like John and Mary and Bill and ... can be extended indefinitely. Therefore, how many possible sequences are there of the type? What other examples of recursive strings can y~u exemplify (i.e., strings whose length can always be increased by reapplying a principle already used in the construction of the string)? How many such strings are there in English? In any other language? Discuss the applicability of sampling theory to the problem of finding a representative set of such ) structures for a language test or subtest. , 2. How could you demonstrate' by logic that the number of possible sentences in a language must be smaller than the number of possible conversations? Or that the number of possible words must be smaller than the number of possible sentences? Or that in general (i.e., in all cases) the number of possible units of a lower order of structure must be smaller than the number of possible units ofa higher order? Further, ifit is impossible to sample representatively.the phrases of a language, what does this imply with respect to the sentences? The discourses? J 3. What unit of discourse is more highly constrained by rules of an expectancy system, a word or a phrase? A phrase or a sentence? A sentence or a dialogue? A chapter or an entire novel? Can you prove by logic that larger units of language are necessarily more highly vlo. .i 208 .LANGUAGE TESTS AT SCHOOL from a different angle, what criteria right obtain which would make the / test unacceptable in spite of the statistics? . vII. Suppose the statistics indicate that some of the items oIl: ~ test ar~ n~t valid, or possibly that none are acceptable. What addItional cntena should be considered before the test is radically revised? ' 12. Give a cloze test to a group of non-native speakers and to a group of native speakers. Compare the diversity of errors and the general degree of agreement on response choices. For in~tance, do nati,:es tend to.show greater or lesser diversity of responses on Items that reqwre words lIke to, for, if, and, but, however, not, and so f~rth than non-natives? Wh~t ~bout blanks that require content words lIke nouns, verbs, and adJectives? What are the most obvious contrasts between native and non-native , . responses? 13. Analyze the errors of native speakers on an essay task (or any other pragmatic task) and compare them to those of a group of non-natives. 14. In what ways are tests influenced by classroom procedures and conversely how do tests influence what happens in classrooms? Has the, TOEFL for instance, had a substantial influence on the teaching of EFL abroad'; Or consider the influence of tests like the SAT, or the American College Tests, or the Comprehensive Tests of Basic Skills, or IQ tests in general. SUGGESTED READINGS 1. Anne Anastasi, 'Reliability,' Chapter 5 in Psychological Testing, 4th ed., New York: Macmillan, 1976, 103-133. 2. Anne Anastasi, 'Validity: Basic Concepts,' Chapter 6 in Psychological Testing, 4thed., New York: Macmillan, 1976, 134--161. 3. Lee J. Cronbach and P. E. Meehl, 'Construct Validity in Psychological Tests.' Psychological Bulletin 1955, 52,281-302. 4. Robert L. Ebel, 'Must All Tests Be Valid?, in G. H. Bracht, Kenneth D. Hopkins, and Julian C. Stanley (eds.) Perspectives in Educational and Psychological Measurement. Englewood Ciffs. New Jersey: PrenticeHall, 1972,74--87. 5. Calvin R. Petersen and Francis A. Cartier, 'Some Theoretical Problems and Practical Solutions in Proficiency Test Validity,' in R. L. Jones and B. Spolsky (eds.) Testing Language Proficiency. Arlington, Va.,: Center for Applied Linguistics, 1975, 105-118. Discrete Point Tests A. What they attempt to do B. Theoretical problems in isolating pieces of a system c. Examples of discrete point items D. A proposed reconciliation with pragmatic testing theory Here several of the goals of discrete point theory are considered. The theoretical difficulty of isolating the pieces of a system is considered along with the diagnostic aim of specific discrete point test items. It is concluded that the virtues of specific diagnosis are preserved in pragmatic tests without the theoretical drawbacks and artificiality of discrete item tests. After all, the elements of language only express their separate identities normally in full-fledged natural discourse. A. What they attempt to do Discrete point tests attempt to achieve a number of desirable goals. Perhaps the foremost among them is the diagnosis of learner difficulties and weaknesses. The idea is often put forth that if the teacher or other test interpreter is able to learn from the test results exactly what the learner's strengths and weaknesses are, he will be better able to prescribe remedies for problems and will avoid wasting time teaching the learner what is already known. Discrete point tests attempt to assess the learner's capabilities to handle particular phonological contrasts from the point of view of perception and production. They attempt to assess the learner's capabilities to produce and interpret stress patterns and intonations on longer segments of speech. Special subtests are aimed at knowledge of vocabulary and syntax. Separate tests for speaking, listening, reading, and writing may be devised. Always it is correctly 209 , .. 210 ..................................... . « DISCRETE POINT TESTS LANGUAGE TESTS AT SCHOOL 211 phonological contrasts; vocabulary exercises focussing on the expansion of receptive or productive repertoires (or speaking or listening repertoires); syntax drills designed to teach certain patterns of structure for speaking, and others designed to teach certain patterns for listening, and others for reading and yet others for writing; and so on until all components and skills were exhausted. These three goals, that is, diagnosing learner strengths and weaknesses, prescribing curricula aimed at particular skills, .and developing specific teaching strategies to help learners overcome particular weaknesses, are among the laudable aims of discrete point testing. It should be noted, however, that the theoretical basis of discrete point teaching is no better than the empirical results of discrete point testing. The presumed components of grammar are no more real for practical purposes than they can be demonstrated to be by the meaningful and systematic results of discrete point tests aimed at differentiating those presumed components of grammar. Further, the ultimate effectiveness of the whole philosophy of discrete point linguistic analysis, teaching, and testing (not necessarily in any particular order) is to be judged in terms of how well the learners who are subjected to it are thereby enabled to communicate information effectively in the target language. In brief, the whole of discrete point methodology stands or falls on the basis of its practical results. The question is whether learners who are exposed to such a method (or family of methods) actually acquire the target language. The general impotence of such methods can be attested to by almost any student who has studied a foreign language in a classroom situation. Discrete point methods are notoriously ineffective. Their nearly complete failure is demonstrated by the paucity of fluent speakers of any target language who have acquired their fluency exclusively in a classroom situation. Unfortunately, since classroom situations are predominantly characterized by materials and methods· that derive more or less directly from discrete point linguistic analysis, the verdict seems inescapable: discrete point methods don't work. The next obvious question is why. How is it that methods which have so much authority, and just downright rational analytic appeal, fail so widely? Surely it is not for lack of dedication in the profession. It cannot be due to a lack of talented teachers and bright students, nor that the methods have not been given a fair try. Then, why? assumed that individuals will differ, some being better in certain skills and components of knowledge while others are better in other skills and components. Moreover, it is assumed that a given individual (or group) may show marked differences in, say, pronunciation skills as opposed to listening comprehension, or in reading and writing skills as opposed to listening and speaking skills. A second goal implicit in the first is the prescription of teaching remedies for the weaknesses in learner skills as revealed by discrete point tests. If it is possible to determine precisely what is the profile of a given learner with respect to the inventory of phonological contrasts that are possible in a given language, and with respect to each other skill, aspect or component of a skill as measured by some subtest which is part of a battery of discrete point tests, then it should be possible to improve course assignments, specific teaching objectives, and the like. For instance, if tests reveal substantial differences in speaking and listening skills as opposed to reading and writing skills, it might make sense to split students into two streams where in one stream learners are taught reading and writing skills while in the other they are taught listening and speaking skills. Other breakdowns might involve splitting instructional groups according to productive versus receptive skills, that is, by putting speaking and writing skills into one course curriculum (or a series of course curricula advancing from the beginning level upward), or according to whatever presumed components all learners must acquire. For instance, phonology might be taught in one class where the curriculum would concentrate on the teaching of pronunciation or listening discrimination, or both. Another class (or period of time) might be devoted to enhancing vocabulary knowledge. Another could be devoted to the teaching of grammatical skills (pattern drills and syntax). Or, if one were to be quite consistent with the spirit of discrete point theorizing there should be separate classes for the teaching ofvoc;abulary (and each of the other presumed components oflanguage proficiency) for reading, writing, speaking, and listening. Correspondingly, there would be phonology for speaking, and phonology for listening, sound-symbol instruction for reading, and sound-symbol instruction for writing, and so on. A third related goal for discrete point diagnostic testing would be to put discrete point teaching on an even firmer theoretical footing. Special materials might be devised to deal with precisely the points of difficulty encountered by learners in just the areas of skill that need attention. There could be pronunciation lessons focussing on specific • 212 LANGUAGE TESTS AT SCHOOL B. Theoretical problems in isolating pieces of a system Discrete point theories are predicated on the notion that it is possible to separate analytically the bits and pieces of language and then to teach and/or test those elements one at a time without reference to the contexts of usage from which those elements were, excised. It is .an undeniable fact, however, that phonemes do not exist in isolation. A child has to go to school to learn that he knows how to handle phonemic cont:t:asts - to learn that his language has. phonemic contrasts. It may be true that he unconsciously makes use of the phonemic contrast between see and.say, for instance, but he must go to school to find out that he has such skills or that his language requires them. Normally, the phonemic contrasts of a language are no more consciously available to the language user than harmonic intervals are to a music lover, or than the peculiar properties of chemical elements are to a gourmet cook. Just as the relations between harmonics are important to a music lover only in the context of a musical piece (and probably not at all in any other context), and just as the properties of chemical elements are of interest to the cook only in terms of the gustatory effects they produce in a'roast or dish of stew, phonemic contrasts are principally of interest to the language user only in terms of their effects in communicative exchanges - in discourse. Discrete point analysis necessarily breaks the elements oflanguage apart and tries to teach them (or test them) separately witlliittle orno attention to the way those elements interact in a larger context of communication. What makes it ineffective as a basis for teaching or testing languages is that crucial properties of language are lost when its elements are separated. The fact is that in any system where the parts interact to produce properties and qualities that do not exist in the parts separately, the whole is greater than the sum of its parts. Ifthe parts cannot just be shuffied together in any old order - if they must rather be put together according to certain organizational constraints - those organizational constraints themselves become crucial properties of the system which simply cannot be found in the parts separately. An example of a discrete point approach to the construction of a test of 'listening grammar' is offered by Clark (1972): Basic to the growth of student facility in listening comprehension is the development of the ability to isolate and appropriately interpret important syntactical and morphological aspects of the spoken utterance such as tense, number, DISCRETE POINT TESTS 213 person, subject-object distinctions, declarative and imperative struct.ures,. attribut~ons, and so forth. The student's knowledge oflexicon is not at issue here; and for that matter, a direct way of testing the aural identification of grammatical functions would be to use nonsense words incorporating the desired morphological elements or syntactic patterns. Given a sentence such as 'Le muglet a he candre par la friblonne,' [roughly translated from French, The muglet has been candered by the friblun, where muglet, cander, andfriblun are nonsense words] _ t~e student might be tested on his ability to determine: 1) the time aspect of the utterance (past time), 2) the 'actor' and 'acted upon' (,friblonne' and 'muglet', respectively), and the action involved ('candre') [po 53f]. First, it is assumed that listening skill is different from speaking skill, or reading skill, or writing skill. Further, that lexical knowledge as-related to the listening skill is one thing while lexical knowledge as related to the reading skill is another, and further still that lexical knowledge is different from syntactic (or morphological) knowledge as each pertains to listening skill (or 'listening grammar'). On the basis of such assumptions, Clark proposes a very logical extension: in order to separate the supposedly separate skills for testing it is necessary to eliminate lexical knowledge from consideration by the use of nonsense words like muglet, cander, andfriblun. He continues, If such elements were being tested in the area _of reading comprehension, it would be technically feasible to present printed nonsense sentences of this sort upon which the student would operate. In a listening comprehens~on situation, however, the difficulty of retaining in mem6ry the various strange w.ords involved in the stimulus sentence would pose a listening comprehension problem independent of the student's ability to interpret the grammatical cues themselves. Instead of nonsense words (which would in any event be avoided by some teachers on pedagogical grounds), genuine foreign-language vocabularyis more suitably employed to convey the grammatical elements intended for aural testing [po 54]. ' Thus, a logical extension of discrete point testing is offered for reading comprehension tests, but is considered inappropriate for listening tests. Let us suppose that such items as Clark is suggesting were used in reading comprehension tests to separate syntactic knowledge from lexical knowledge. In what ways would they differ from similar sentences that might occur in normal conversation, prose, or discourse? Compare The muglet has been candered by the friblun with The money has been squandered by the freeloader. Then consider the question whether the relationships that hold between the 214 DISCRETE POINT TESTS LANGUAGE TESTS AT SCHOOL 215 Syllables have properties in discourse that they do not have in isolation and sentences have properties in discourse that they do not have in isolation and discourse has properties in relation to everyday experience that it does not have when it is isolated from such experience. In fact, discourse cannot really be considered discourse at all if it is not systematically related to experience in ways that can be inferred by speakers of the language. With respect to syllables, consider the stress and length of a given syllable such as /rsd/ as in He read the entire essay in one sitting and in the sentence He read it is what he did with it, (as in response to What on earth did you say he did with it ?). Can a learner be said to know a syllable on the basis of a discrete test item that requires him to distinguish it from other similar syllables? If the learner knew all the syllables of the language in that sense would this be the same as knowing the language? For the word syllable in the preceding questions, substitute the words sound, word, syntactic pattern, but one must not substitute the words phrase, sentence, or conversation, because they certainly cannot be adequately tested by discrete item tests. In fact it is extremely doubtful that anything much above the level of the distinctive sounds (or phonemes) of a language can be tested one at a time as discrete point theorizing requires. Furthermore, it is entirely unclear what should be considered an adequate discrete point test of knowledge of the sounds or sound system of a language. Should it include all possible pairs of sounds with similar distributions? Just such a pairing would create a very long test if it only required discrimination decisions about whether a heard pair of sounds was the same or different. Suppose a person could handle all of the items on the test. In what sense could it be said that he therefore knows the sound system of the tested language? The fact is that the sounds of a language are structured into sequences that make up syllables which are structured in complex ways into words and phrases which are themselves structured into sentences and paragraphs or higher level units of discourse, and the highest level of organization is rather obviously involved in the lowest level of linguistic unit production and interpretation. The very same sequence of sounds in one context will be taken for one syllable and in another context will be taken for another. The very same sound in one context may be interpreted as one word and in another as a completely different word (e.g., 'n, as in He's 'n ape, and in This 'n that 'n the other). A given sequence of words in one context may be taken grammatical subject and its respective predicate in each case is the same. Add a third example. The pony has been tethered by the barnyard. It is entirely unclear in the nonsense example whether the relationship between the muglet and the friblun is similar to the relationship between the money and the freeloader or whether it is similar to the relationship between the pony and the barnyard. How could such syntactic relationships and the knowledge of them be tested by such items? One might insist that the difference between squandering something and tethering something is strictly a matter of lexical knowledge, but can one reasonably claim that the relationship between a subject and its predicate is strictly a lexical relationship? To do so would be to erase any vestige of the original distinction between syntax and vocabulary. The fact that the money is in some sense acted upon by the freeloader who does something with it, namely, squanders it, and that the pony is not similarly acted upon by the barnyard is all bound up in the syntax and in the lexical items of the respective sentences not to mention their potential pragmatic relations to extralinguistic contexts and their semantic relations to other similar sentences. It is not even remotely possible to represent such intrinsically rich complexities with nonsense items of the sort Clark is proposing. What is the answer to questions like: Can fribluns be candered by muglets ? Is candering something that can be dqne to fribluns ? Can it be done to muglets ? There are no answers to such questions, but there are clear and obvious answers to questions like: Can barnyards be tethered by ponies? Is tethering something that can be done to barnyards? Can it be done to ponies? Can freeloaders be squandered by money? Is squandering something that can be done to freeloaders? Can it be done to money? The fact that the latter questions have answers and that the former have none is proof that normal sentences have properties that are not present in the bones of those same sentences stripped of meaning. In fact, they have syntactic properties that are not present if the lexical items are not there to enrich the syntactic organization of the sentences. The syntax of utterances seems to be just as intricately involved in the expression of meaning as the lexicon is, and to propose testing syntactic and lexical knowledge separately is like proposing to test the speed of an automobile with the wheels first and the enginylater. It makes little difference to the difficulties that discrete point testing creates if we change the focal point of the argument from the sentence level to the syllable level or to the level of full-fledged discourse. • ·216 LANGUAGE TESTS AT SCHOOL to mean exactly the opposite of what they mean in another context, (e.g., Sure you will, meaning either No, you won't or Yes, you will.) A'll of the foregoing facts and many others that are not mentioned here make the problems of the discrete item writer not just difficult but insurmount~ble in principle. There is no way that the normal facts of language can adequately be taught or tested by using test items or teaching materials that start out by destr?Ylng the very properties oflanguage that most need to be grasped by learners. How can a person learn to map utterances pragmatically onto extralinguisti~ contexts in a language that he does not yet ~n~w (th~t is, to express and interpret information in words about expenence) If he is forced to deal with words and utterances that are never related to extralinguistic experience in the required ways? The answer is that no one can learn a language on the basis of the principles advocated by discrete point theorists. This is not because it is very difficult to learn a language by experiencing bits and pieces of it in isolation from pragmatic contexts, it is because it is impossible to learn a language by experiencing bits and pieces of it in that way. For the same reason, discrete test items that aim to test the knowledge of language independent of the use of that knowledge in normal contexts of communication must also fail. No one has ever proposed that instead of running races at the Olympics conte~tants should be subjected to a battery of tests including the analysis of individual muscle potentials, general quickness, speed of bending the leg at the knee joint, and the like - rather the speed of runners is tested by having them run. Why should the case be so different for language testing? Instead of asking, how well can a particular language learner handle the bits and pieces of presumed analytical components of grammar, why not ask how well the learner can use all of the components (whatever they are) in dealing with discourse? In addition to strong logic, there is much empirical evidence to show that di~crete point methods of teaching fail and that discrete point methods of testing are inefficient. On the other hand, there are methods of teaching (and learning languages) that work - for instance, methods of teaching where the pragmatic mapping of utterances onto extralinguistic contexts is made obvious to the learner. Similarly, methods of testing that require the examinee to perform such mapping of sequences of elements in the target language are quite efficient. t < I DISCRETE POINT TESTS 217 C. Examples of discrete point items The purpose of this section is to examine some examples of discrete point items and to consider the degree to which they produce the kinds of information they are supposed to produce - namely, diagnostic information concerning the mastery of specific points of linguistic structure in it particular target language (and for some testers, learners from a particular background language). In spite of the fact that vast numbers of discrete point test categories are possible in theory, they always get pared down to manageable proportions even by the theorists who advocated the more proliferate test designs in the first place. For example, under the general heading of tests of phonology a goodly number of subheadings have been proposed including: subtests of phonemic contrasts, stress and intonation, subclassed still further into subsubtests of recognition and production not to mention the distinctions between word stress versus sentence stress and so on. In actuality, no one has ever devised a test that makes use of all of the possible distinctions, nor is it likely that anyone ever will since the possible distinctions can be multiplied ad infinitum by the same methods that produced the commonly employed distinctions. This last fact, however, has empirical consequences in the demonstrable fact that no two discrete point testers (unless they have iInitated each other) are apt to come up with tests that represent precisely the same categories (i.e., subtests, subsubtests, and the like). Therefore, the items used as examples here cannot represent all of the types of items that have been proposed. They do, however, represent commonly used types. First, we will consider tests of phonological skills, then vocabulary, then grammar (usually liInited to a narrow definition of syntax - that is, having to do with sequential relations between words or phrases, . or clauses). 1. Phonological items. Perhaps the most often reco~mended and widest used technique for assessing 'recognition' or 'auditory discriInination' is the minimal pair technique or some variation of it. Lado (1961), Harris (1969), Clark (1972), Heaton (1975), Allen and Davies (1977), and Valette (1977) all recommend some variant of the technique. For instance, Lado suggests reading pairs of words with minimal sound contrasts while the stud,ents write down 'same' or 'different' (abbreviated to'S' or 'D') for each numbered pair. To test 'speakers of Spanish, Portuguese, Japanese, Finnish' and other 218 DISCRETE POINT TESTS 219 LANGUAGE TESTS AT SCHOOL language backgrounds who are learning English as a foreign or second language, Lado proposes items like the following: ' ' 1. sleep; slip, 2. fist; fist 3. ship; sheep 4. heat; heat 5. jeep; gyp 6. leap; leap 7. rid; read' 8. mill; mill 9. neat; knit 10. beat; bit (Lado, 1961, p. 53). Another item type which is quite similar is offered by both Lado and Harris. The specific examples here are from Lado. The teacher (or examiner) reads the words (with identical stress and intonation) and asks the learner (or examinee) to indicate which words are the same. If all are the same, the examinee is to check A, B, and C, on the answer sheet. If none is the same he is to check O. 1. cat; cat,; cot 2. run; sun; run 3. last; last i last . 4. beast; best; best 5. pair; fair; chair (Lado, 1961, p. 74). Now, let us consider briefly the question of what diagnostic information such test items provide. Suppose a certain examinee misses item 2 in the second set of items given above. What can we deduce from this fact? Is it safe to say that he doesn't know lsi or is it Ir/? Or could it be he had a lapse of attention? Could he have misunderstood the item instructions or marked the wrong spot on the answer sheet? What teaching strategies could be recommended to remedy the problem? What does missing item 2 mean with respect to overall comprehension? Or suppose the learner misses item 4 in the second set given above. Ask the same questions. What about item 5 where three initial consonants are contrasted? The implication of the theory that highly focussed discrete point items are diagnostic by virtue of their being aimed at specific contrasts is not entirely transparent. What about the adequacy of coverage of possible contrasts? Since it can be shown that the phonetic form of a particular phoneme is quite different when the phoneme occurs initially (after a pause or silence) rather than medially (between other sounds) or finally (before a pause or silence), an adequate recognition test for the sounds of English shoul.d- presumably assess contrasts in all three positions. If the test were to assess only minimal contrasts, it should presumably test separately each vowel contrast and each consonantal contrast (ignoring the chameleonic phonemes such as Ir I and JlI which have properties of vowels and consonants simultaneously, not to mention Iwl and Iyl which do not perfectly fit either category). It would have to be a very long test indeed. If there were only eleven vowels in English, the matrix of possible contrasts would be eleven times eleven, or 121, minus eleven (the diagonal pairs of the matrix which involve contrasts between each element and itself, or the null contrasts), or 110, divided by two (to compensate for the fact that the top half of the matrix is identical to the bottom half). ,Hence, the number of non-redundant pairs of vowels to be contrasted would·be at least 55. If we add in the number oLconsonants that can occur in' initial, medial, and final position, say, about twenty (to be on the conservative side) we must add another 190 items times the three positions, or 1,470, plus 55 equals 1,525 items. Furthermore, this estimate is still low because it does not account for consonant clusters, diphthongs, nor for vocalic elements that can occur in initial or final positions. Suppose the teacher decides to test only a sampling of the possible contrasts and develops a 100 item test. How will the data be used? Suppose there are twenty students in the class where the test is to be used. There would be 2,000 separate pieces of data to be used by the teacher. Suppose that each student missed a slightly different set of items on the test. How would this diagnostic information be used to develop different teaching strategies for each separate learner? Suppose that the teacher actually had the time and energy to sit down and go through the tests one at a time looking at each separate item for each separate learner. How would the item score for each separate learner be translated into an appropriate teaching strategy in each case?The problem we come back to is how to interpret a particular performance on a particular item on a highly focussed discrete point test. It is something like the problem of trying to determine the composition of sand in a particular sand box in relation to a certain beach by comparing the grains of sand in the box with the grains of sand on the beach - one at a time. Even if one were to set out to improve the degree of correspondence how would one go about it and what criterion of success could be conceived? ' Other types of items proposed to test phonological discrimination are minimal sentence pairs such as : - p 220 LANGUAGE TESTS AT SCHOOL 1. Will he sleep? Will he slip? 2. They beat him. They bit him. . 3. Let me see the sheep. Let me see the sheep. (Lado, 1961, p. >, 53). / Lado suggests that these items are more valid than words in isolation because they are more difficult: 'The student does not know where the difference will occur if it does occur' (p. 53). He argues that such a focussed sort of test item is to be preferred over more fully contextualized discourse samples because the former insures that the student has actually perceived the sound contrast rather than merely guessing the meaning or understanding the context instead of perceiving tlie 'words containing the difficult sounds' (p. 54). Lado refers to such guessing factors and context clues as 'non-language factors'. But let's consider the matter a bit further. In what possible context would comprehension of a sentence like, Will he sleep? depend on someone's knowledge of the difference between sleep and slip? Is the knowledge associated with the meaning of the word sleep and the sort of states and the extralinguistic situations that the word is likely to-be associated with less a matter oflanguage proficiency than knowledge of the contrast between jiy j and /II? Is it p@ssible to conceive of a context in which the sentence, Will he sleep? would be likely to be taken for the sentence, Will he slip? How often do you suppose slipping and sleeping would be expected to occur in the same contexts? Ask the same questions for each of the other example sentences. Further, consider the difficulty of sentences such as the ones used in 2, where the first sentence sets up a frame that will not fit the second. If the learner assumes that the they in They beat him has the same referential meaning as the subsequent they in They bit him the verb bit is unlikely. People may beat things or people or animals, but him seems likely to refer to a person or an animal. Take either option. Then when you hear, They bit him close on the heels of They beat him, what will you do with it? Does they refer to the same people and does him refer to the same person or animal? If so, how odd. People might beat a man or a dog, but would they then be likely to bite him? As,a result of the usual expectancies that normal language users will generate in perceiving meaningful sequences of elements in their language, the second sentence is more difficult with the first as its antecedent. Hence, the kind of contextualization proposed by Lado to increase item validity may well decrease item validity. The function DISCRETE POINT TESTS , 221 of the sort of context that is suggested for discrete point items of phonological contrasts is to mislead the better language learners into false expectanCies instead of helping them (on the basis of normally correct expectancies set up by discourse constraints) to make subtle sound distinctions. The items pictured in Figures 9-13, represent a different sort of attempt to contextualize discrete point contrasts in a listening mode. In Figure 9, for instance, both Lado (1961) and Harris (1969) have in mind testing the contrast between the words ship and sheep. Figure 10, from Lado (1961), is proposed as a basis for testing the distinctipn between watching and washing. Figure 11, also from Lado, is proposed as a basis for, testing the contrasts between pin, pen, and pan - of course, we should note that the distinction between the first two (pin and pen) no longer exists in the widest used varieties of American English. Figure 12 aims to test the initial consonant distinctions between sheep and jeep and the vowel contrast between sheep and ship. Figure 13 offers the possibility of testing several contrasts by asking the examinee to point to the pen, pin, pan, picture, pitcher; the person who is watching the dishes (according to Lado, 1961, p. 59) and the person who is washing the dishes. B A --------' o A B Figure 9. The ship/sheep contrast, Lado (1961, p. 57) and Harris (1969, p. 34). F 222 DISCRETE POINT TESTS LANGUAGE TESTS AT SCHOOL , B A Pertinent questions for the interpretation of errors on items of the type related to the pictures in Figures 9-11 are similar to the questions posed above in relation to similar items without pictures. If it were difficult to prepare an adequate test to cover the phonemic contrasts of English without pictures it would surely be more difficult to try to do it with pictures. Presumably the motivation for using pictures is to increase the meaningfulness of the test items - to contextualize them, just as in the case of the sentence frames discussed two paragraphs earlier. We saw that the sentence contexts actually are apt to create false expectancies which would distract from the purpose of the items. What about the pictures? Is it natural to say that the man in Figure 13 is watching the dishes? It would seem more likely that he might watch the woman who is washing the dishes. Or consider the man watching the window in Figure 10. Why is he doing that? Does he expect it to escape? To leave? To hatch? To move? To serve as an exit for a cdrilinal who is about to try to get away from the law? If not for some such reason, it would seem more reasonable to say that the man in the one picture is staring at a window and the one in the other picture is washing a different window. If the man were watching the same window, how is it that he cannot see the man who is washing it? The context, which is proposed by Lado to make the contrasts meaningful, not only fails to represent normal uses of language accurately, but also fails to help the learner to make the distinctions in question. If the learner does not already know the difference between watching and washing, and if he was not confused before experiencing the test item he may well be afterward. If the learner does not already know the meaning of the words ship, sheep, pin, pan, and pen, and if the sound contrasts are difficult for the learner to perceive, the pictures in conjunction with meaningless similar sounding forms merely serve as a slightly richer basis for confusion. Why should the learner who is already having difficulty with the distinction say between pin and pen have any less difficulty after being exposed to the pictures associated with words which he cannot distinguish? Ifhe should become able to perceive the distinction on the basis of some teaching exercise related to the test item types, on what possible basis is it reasonable to expect the learner to associate correctly the (previously unfamiliar) word sheep, for instance, with the picture of the sheep and the word ship with the picture of the ship? The very form of the exercise (or test item) has placed the contrasting words in a context where all of the normal bases for the Figure 10. The watching/washing contrast, Lado (1961, p. 57). A B C Figure 11. The pin/pen/pan contrast, Lado (1961, p. 58). A B 223 C Figure 12. The ship/jeep/sheep contrast, Lado (1961, p. 58). Figure 13. 'Who is watching the dishes?' (Lado, 1961, p. 59). , L 224 DISCRETE POINT TESTS LANGUAGE TESTS AT SCHOOL distinction in meaning have been deFberately obliterated. The learner is very much in the position of the child learning to spell to whom it is pointed out that the pairs of spellings their and there, pare and pair, son and sun represent different meanings and not to get confused about which is which.~uch a method of presentation is almost certain to confuse the learner concerning which meaning goes with which spelling. 2. Vocabulary items. It is usually suggested that knowledge of words should be referenced against the modality of processing - that· is, the vocabulary one can comprehend when reading. Hence, it is often claimed that there must be s,eparate vocabulary tests for each o( the traditionally recognized four skills, at least for receptive and productive repertoires. Above, especially in Chapter 3, we considered an alternative explanation for the relative availability of words in listening, speaking, reading, and writing. It was in fact suggested that it is probably the difficulty of the task and the load it places on memory and attention that creates the apparent differences in vocabulary across different processing tasks. Ask yourself the question whether you know or do not know a word you may have difficulty in thinking of at a particular juncture. Would you know it better if it were written? Less well if you heard it spoken? If you could understand its use by someone else how does this relate to your ability or inability to use the same word appropriately? It would certainly appear that there is room for the view that a single lexicon may account for word knowledge (whatever it may be) across all four skills. It may be merely the accessibility of words that changes With the nature of the process~ng task rather than the words actually in the lexicon. In any event, discrete point theory requires tests of vocabulary and often insists that there must be separate tests for what are presumed to be different skill areas. A frequently-used type of vocabulary test is one aimed specifically at the so-called reading skill. For instance, Davies (1977) suggests the following vocabulary item: Our tom cat has been missing ever since that day I upset his ' milk. C. name A. wild D. male B. drum One might want to ask how useful the word tom in the sense given is for the students in the test population. Further, is it not possible that a 225 wild tom cat became the pet in question and was then frightened off by the incident? What diagnostic information can one infer (since this is supposed to be a diagnostic type of item) from the fact that a particular student misses the item selecting, say, choice C, name? Does it mean that he does not know the meaning of tom in the sense given? That he doesn't know the meaning of name? That he doesn't kno~ any of the words used? That he doesn't understand the sentence? Or are not all of these possibilities viable as well as many other combinations of them? What is specific about the diagnostic information supposedly provided by such an item? Another item suggested by Davies (1977) is of a slightly different type: Cherry: Red Fruit Vegetable Blue Cabbage Sweet Stalk Tree Garden The task of the examinee is to order the words offered in relation to their closeness in meaning to the given word cherry. Davies avows, 'it may be argued that these tests are testing intelligence, particularly example 2 [of the two examples given immediately above] which demands a very high degree ofliteracy, so high that it may be entirely intelligence that is being tested here' (p. 81). There are several untested presuppositions in Davies' remark. One of them is that we know better what we are talking about when we speak of 'intelligence' than when we speak of language skill. (On this topic see the Appendix, especially part D.) Another is that the words in the proffered set of terms printed next to the word cherry have some intrinsic order in relation to cherries. The difficulty with this item, as with all of the items of its type isthat the relationship between cherries and cabbage, or gardens, etc., has a great deal more to do with where one finds the cherries at the moment than with something intrinsic to the nature of the word cherry. At one moment the fact that a cherry is apt to be found on a cherry tree may be the most important defining property. In a different context the fact that some of the cherries are red and therefore edible may carry more pragmatic cash value than the fact that it is a fruit. In yet another context sweetness may be the property of greatest interest. It is an open empirical question whether items of the sort in question can be scored in a sensible way and whether or not they will produce a high correlation with tests of reading ability. Lado (1961) was among the first language testers to sugge~t vocabulary items like the first of Davies' examples given above. For 226 LANGUAGET~TSATSCHOOL , - - f instance, Lado suggested items in the following form: i C', Integrity A. intelligence B. uprightness ! C. intrigue D. weakness Another alternative sugge~ted by Lado (1961, p. 189) was: The opposite of strong is A. short C. weak B. poor D. good Similar items in fact can be found in books on language testing by Harris (1969), Clark (1972), Valette (1967, 1977), Heaton (1975) and in many other sources. In fact, they date back to the earliest form~ of so-called 'intelligence' and also 'reading' tests (see Gunnarsson, 1978, and his references). Two nagging questions continue to plague the user of discrete point vocabulary tests. The first is whether such tests really measure (reliably and validly) something other than what is measured by tests that go by different names (e.g., grammar, or pronunciation, not to mention reading comprehension or IQ). The second is whether the kind of knowledge displayed in such tests could not better be demonstrated in tasks that more closely resemble normal uses of language. 3. Grammar items. Again, there is the problem of deciding what modality is the appropriate one, or how many different modalities must be used in order to test adequately grammatical knowledge (whatever the latter may be construed to be). Sample items follow with sources indicated (all of them were apparently intended for a written mode of presentation) : i. 'Does John speak French?' 'I don't know what .. .' A. does B. speaks C. he (Lado, 1961, p. 180). ii. When _ _ _ _ _ _ _ _ _ ? c. to go A. plan D. you (Harris, 1969, p. 28). B. do iii. I want to ____ home now. A. gone C. go B. went D. going (Davies, 1977, p. 77). Similar items can be found in Clark (1972), Heaton (1975), and ~ DISCRETE POINTTESTI! 227 Valette (1977). Essentially they concentrate on the ordering of words or phrases in a minimal context, Qr they require selection of the appropriate continuation at some point in the sentence. Usually no larger contexUs implied or otherwise indicated. In the Appendix we examine the correlation between a set of tests focussing o~ the formation of appropriate continuations in a given text, another set requiring the ordering of words, phrases, or clauses in similar texts, and a large battery of other tests. The results suggest that there is no reasonable basis for claiming that the so-called vocabulary (synonym matching) type of test items are measuring anything other than what the so-called grammar items (selecting the appropriate continuation, and ordering elements appropriately) are measuring. Further, these tests do not seem to be doing anything different from what standard dictation and cloze procedure can accomplish. Unless counter evidence can be produced to support the super-structure of discrete point test theory, it would appear to be in grave empirical difficulty. D. A proposed reconciliation with pragmatic testing theory From the arguments presented in this chapter and throughout this entire book - especially all of Part Two - one might be inclined to think that the 'elements' of language, whatever they may be, should never be considered at all. Or at least, one might be encouraged to read this recommendation between the lines. However, this would be a mistake. What, after all, does a pragmatic test measure? Does it not in fact measure the examinee's ability to make use of the sounds, syllables, words, phrases, intonations, clauses, etc. in the contexts of normal communication? Or at least in contexts that faithfully mirror normal uses oflanguage? If the latter is so, then pragmatic tests are in fact doing what discrete point testers wanted done all along. Indeed, pragmatic tests are the only reasonable approach to testing language skills if we want to know how well the examinee can use the elements of the language in real-life communication contexts. What pragmatic language .tests accomplish is precisely what discrete point testers were hoping to do. The advantage that pragmatic tests offer to the classroom teacher and to the educator in general is that they are far easier to prepare than are tests of the discrete point type, and they are nearly certain to produce more meaningful and more readily interpretable results. We will see in Chapter 9 that the preparation and production of multiple choice tests is no simple 228 LANGUAGE TESTS AT SCHOOL task. We have already seen that the determination of how many items of certain types to include in discrete point tests poses intrinsically insoluble and pointless theoretical and practical mind-bogglers. For instance, how many vocabulary items should be included? Is tom as in tom cat worth including? What is the relative importance of vowel contrasts as compared against morphological contrasts (e.g., plural, possessive, tense marking, and the like)? Which grammatical points found in linguistic analyses should be found in language tests focussed on 'grammar'? What relative weights should be assigned to the various categories so determined? How much is enough to represent adequately the importance of determiners? SUbject raising? Relativization? The list goes on and on and is most certainly not even close to being complete in the best analyses currently available. The great virtue, the important insight of linguistic analysis, is in demonstrating that language consists of complicated sequences of elements, subsequences of sequences, and so forth. Further, linguistic research has helped us to see that the elements of language are to a degree analyzable. Discrete point theory tried to capitalize on this insight and pushed it to the proverbial wall. It is time now to reevaluate the results of the application. Recent research with pragmatic language tests suggests that the essential insights of discrete point theories can be more adequately expressed in pragmatic tests than in overly simplistic discrete point approaches which obliterate crucial properties of language in the process of taking it to pieces. The pieces should be observed, studied, taught, and tested (it would seem) in the natural habitat of discourse rather than in isolated sentences pulled out of the clear blue sky. In Part Three we will consider ways in which the diagnostic information sought by discrete point theory in isolated items aimed at particular rules, words, sound contrasts and the like can much more sensibly be found in learner protocols related to the performance of pragmatic discourse processing tasks - where the focus is on communicating something to somebody rather than merely filling in some blank in some senseless (or nearly senseless) discrete item pulled from some strained test writer's brain. The reconciliation of discrete point theory with pragmatic testing is accomplished quite simply. All we have to do is acknowledge the fact that the elements of language are normally used in discourse for the purposes of communication by the latter term we include all of the abstract, expressive, and poetic uses oflanguage as well as the wonderful mundane uses so familiar to all normal human beings. DISCRETE POINT TESTS 229 KEY POINTS 1. Discrete point approaches to testing derive from discrete point approaches to teaching. They are mutually supportive. 2. Discrete point tests are supposed to provide diagnostic input to specific teaching remedies for specific weaknesses. 3. Both approaches stand or fall together. If discrete point tests cannot be shown to have substantial validity, discrete point teaching will be necessarily drawn into question. 4. Similarly, the validity of discrete point testing and all of its instructional applications would be drawn into question if it were shown that discrete point teaching does not work. 5. Discrete point teaching is a notorious failure. There is an almost complete scarcity of persons who have actually learned a foreign language on the basis of discrete point methods of teaching. 6. The premise of discrete point theories, that language can be taken to pieces and put back together in the curriculum, is apparently false. 7. Any discourse in any natural language is more than the mere sum of its analyzable parts. Crucial properties of language are lost when it is broken down into discrete phonemic contrasts, words, structures and the like. 8. Nonsense, of the sort recommended by some experts as a basis for discrete point test items, does not exhibit many of the pragmatic properties of normal sensible utterances in discourse contexts. 9. The trouble is that the lowest level units of discourse are involved in the production and interpretation of the highest level units. They cannot, therefore, be separated without obliterating the characteristic relationships between them. 10. No one can learn a language (or teach one) on the basis of the principles advocated by discrete point theorists. 11. Discrete point tests of posited components are often separated into the categories of phonological, lexical, and syntactic tests. 12. It can easily be shown that such tests, even though they are advocated as diagnostic tests, do not provide very specific diagnostic information at all. 13. Typically, discrete items in multiple choice format require highly artificial and unnatural distinctions among linguistic forms. 14. Further, when an attempt to contextualize the items is made, it usually falls fiat because the contrast itself is an unlikely one in normal discourse (e.g., watching the baby versus washing the baby). 15. Discrete items offer a rich basis for confusion to any student who may already be having trouble with whatever distinction is required. 16. Pragmatic tests can be shown to do a better job of what discrete point testers were interested in accomplishing all along. 17. Pragmatic tests assess the learner's ability to use the 'elements' of language (whatever they may be) in the normal contexts of human discourse. 18. Moreover, pragmatic tests are superior diagnostic devices. 230 LANGUAGE TESTS AT SCHOOL DISCUSSION QUESTIONS , 1. Obtain a protocol (answer sheet and test booklet) from a discrete point test (sound contrasts, votabulary, or structure). Analyze each item, trying to determine exactly what it is that the student does not know on each item answered incorrectly. 2. Repeat the procedure suggested in question 1, this time with any protocol from a pragmatic task for the same student. Which procedure yields more finely grained and more informative data concerning what the learner does and does not know? (For recommendations on particular pragmatic tests; see Part Three.) 3. Interpret the errors found in questions I and 2 with respect to specific teaching remedies. Which of the two procedures (or possibly, the several techniques) yields the most obvious or most useful extensions to therapeutic interventions? In other words, which test is most easily interpreted with respect to instructional procedures? 4. Is there any information yielded by discrete point testing procedures that is not also available in pragmatic testing procedures? Conversely, is there anything in the pragmatic procedures that is not available in the discrete point approaches? Consider the question of sound contrasts, word usages, structural manipulations, and communicative activities. 5. What is the necessary relationship between being able to make a particular sound contrast in a discrete item test and being able to make use of it in communication? How could we determine if a learner were not making use of a particular sound contrast in conversation ? Would the discrete point item help us to make this determination? How? What about word usage? Structural manipulation? Rhythm? Intonation'? 16. Take any discrete point test and analyze it for content coverage. How many of the possible sound contrasts does it test? Words? Structural manipulations? Repeat the procedure for a pragmatic task. Which procedure is more comprehensive? Which is apt to be more representative? Reliable? Valid? (See the Appendix on the latter two issues, also Chapter 3 above.) 7. Analyze the two tests· from the point of view of naturalness of what they require the learner to do with the language. Consider the i!llpiications, presuppositions, entailments, antecedents, and consequences of the statements or utterances used in the pragmatic context. For instance, ask what is implied by a certain form used and what it suggests which may not be stated overtly in the text. Do the same for the discrete point items. /8. Can the content of a pragmatic test be summarized? Developed? Expanded? Interpolated? Extrapolated? What about the content of items in a discrete point test? Which test has the richer'forms, meaning wise? Which forms are more explicit in meaning, more determinate? Which are more complex? SUGGESTED READINGS 1. John L. D. Clark, Foreign Language Testing: Theory and Practice. Philadelphia: Center for Curriculum Development, 1972. 2. Robert Lado, Language Testing. London: Longman, 1961. 3. Rebecca Valette, Modern Language Testing. N ew York: Harcourt, 1977. 9 ,Multiple Choice Tests A. Is there any other way to ask a question? B. Discrete point and integrative multiple choice tests C. About writing items D. Item analysis and its interpretation E. Minimal recommended steps for multiple choice test preparation F. On the instructional value of multiple choice tests The. main purpose of this chapter is to clarify the nature of multiple chOIce tests - how they are constructed, the subjective decisions that gointo their preparation, the minimal number of steps necessary before they can be reasonably used in classroom contexts the incredible range and variety of tasks that they may embody: and finally, their general impracticality for day to day classroom application. It will be shown that multiple choice tests can be of the discrete point or ihtegrative type or anywhere on the continuum in between the two extremes. So~e of them may further meet the naturalness requirements for pragmatic language tests. Thus, this chapter provides a natural bridge between Part Two (contra discrete point testing) and Part Three (an exposition of pragmatic testing techniques). ) A. Is there any other way to ask a question? At a testing conference some years ago, it was reported that the following exchange took place between two of the participants. The speaker (probably John Upshur) was asked by a would-be discussant if mUltiple choice tests were really all that necessary. To which 23L :..""0'"------------------ ..--- --- ------ 232 LANGUAGE TESTS AT SCHOOL Upshur (according to Eugene Briere) quipped, 'Is there any other way to ask a question?' End of discussion. The would-be contender withdrew to the comfort and anonymity of his former sitting position. When you think about it, conversations are laced with decision points where implicit choices are being constantly made. Questions imply a range of alternatives. Do you want to go get something to eat? Yes or no. How about a hamburger place, or would you rather have something a liJtle more elegant? Which place did you have in mind? Are you speaking to me (or to someone else)? Questions just naturally seem to imply alternatives. Perhaps the alternatives are not usually so well defined as they are in multiple choice tests, and perhaps the implicit alternatives are not usually offered to confuse or trap the person in normal communication though they are explicitly intended for that purpose in multiple choice tests, but in both cases there is the fundamental similarity that alternatives (explicit or implicit) are offered. Pilot asked Jesus, 'What is truth?' Perhaps he meant, 'There is no answer to this question,' but at the same time he appeared to be interested in the possibility of a different view. Even abstract rhetorical questions may implicitly afford alternatives. It would seem that multiple choice tests have a certain naturalness, albeit a strained one. They do in fact require people to make decisions that are at least similar in the sense defined above to decisions that people are often required to make in normal communication. But this, of course, is not the main argument in favor of their useAndeed, the strain that mUltiple choice tests put on the flow of normal communicative interactions is often used as. an argument against them. The favor that multiple choice tests enjoy among professional testers is due to their presumed 'Qpjectivity', and concomitant reliability of scoring. Further, when large numbers of people are to be tested in short periods oftime with few proctors and scorers, multiple choice tests are very economical in terms of the effort and expense they require. The questions of validity posed in relation to language tests (or other types of tests) in general are still the same questions, and the validity requirements to be imposed on such tests should be no less stringent for multiple choice versions than for other test formats. It is an empirical question whether in fact multiple choice tests afford any advantage whatsoever over other types of tests. It is not the sort of question that can be decided by a vote of the American (or any other) Psychometric Association. It can only be decided by MULTIPLE CHOICE TESTS 233 appropriate research (see the Appendix, also see Oller and Perkins, 1978). The preparation and evaluation of specific multiple choice tests hinges on two things: the nature of the decision required by test items, and the nature of the alternatives offered to the examinee on each item. It is a certainty that no multiple choice test can be any better than the items that constitute it, nor can it be any more valid than the choices it offers examinees at requisite decision points. From this it follows that the multiple choice format can only be advantageous in terms of scorIng and administrative convenience if we have a good multiple choice test in the first place. It will be demonstrated here that the preparation of sound multiple choice tests is sufficiently challenging and technically difficult to make them impracticable for most classroom needs. This will be accomplished by showing some of the pitfalls that commonly trap even the experts. The fqrmidable technical problem of item analysis done by hand will be shown to all but completely eliminate multiple choice formats from consideration. Further, it will be argued that the multiple choice format is intrinsically inimical to the interests of instruction. What multiple choice formats gain in reliability and ease of administration, in other words, is more than used up in detrimental instructional effects and difficulty of preparation. B. Discrete point and integrative multiple choice tests In Chapter 8 above, we already examined a number of multiple choice items of a discrete point type. There were items aimed at phonological contrasts, vocabulary, and 'grammar' (in the rather narrow sense of surface morphology and syntax). There are, however, many item types that can easily be put into a multiple choice format, or which are usually found in such a format but which are not discrete point items. For instance, what discrete elements oflanguage are tested in a paraphrase recognition task such as the following? Match the given sentence with the alternative that most nearly says the same thing: Before the turn of the century; the tallest buildings were rarely more than three storeys above ground (adapted from Heaton, 1975, p. 186). A. After the turn of the century, buildings had more storeys above ground. 234 LANGUAGE TESTS AT SC ORL B. Buildings rarely had as many as four storeys above ground up until the turn of the century. C. At about the turn of the century, buildings became more numerous and c;onsiderably taller than ever before. D. Buildings used to have more storeys above ground than they did at about the turn of the century. It would be hard to say precisely what point of grammar, vocabulary, etc. is being tested in the item just exemplified. Could a test composed of items of this type be called a test of reading comprehensi9fl? How about paraphrase recognition? Language proficiency in general? What if it were presented in a spoken format? As we have noted before, the problem of what to say a test is a test of is principally an issue of test validity. It is an empirical question. What we can safely say on the basis of the item format alone is what the test requires the learner to do - or at least what it appears to require. Perhaps, therefore, it is best to call the item type a 'sentence ,paraphrase recognition' task. Thl!i>,J>y_na111ing the task rather than positing some abstract construct we avoid a priori validIty' commitments - that' is, we suspend judgement on the' vaildity' questions pending empirical investigatioIL Nevertheless, whatever we choose {oean the speciRc item type, it is clearly more at the integrative side of the continuum than at the discrete point end. There are many other types of multiple choice items that are integrative in nature. Consider the problem of selecting answers to questions based on a text. Such questions may focus on some detail of information given in the text, the general topic of the text, something implied by the text but not stated, the meaning of a particular word, phrase, or clause in the text, and so forth. For example, read the following text and then select the best answers to the questions that follow: Black Students in Urban Canada is an attempt to provide information to urban Canadians who engage in educational transactions with members of this ethnicity. Although the OISE conference did not attract educators from either west of Manitoba or from Newfoundland, it is felt that there is an adequate !ninimum of relevance such that concerned urban teachers from all parts of this nation may uncover something of profit (D'Oyley and Silverman, 1976, p. vi). I (1) This paragraph is probably A. an introduction to a Canadian novel. B. a recipe for transactional analysis. MULTIPLE CHOICE TESTS 235 C. a preface to a conference report. D: an epilog to ethnic studies. (2) T~e word ethnicity as used in the paragraph has to do WIth , A. sex. B. skin color. C. birthplace. D. all of the above. (3) The message of the paragraph is addressed to A. all educators. B. urban educators. C. urban Canadians involved in education. D. members of the ethnic group referred to. (4) The abbreviation OISE probably refers to the A. city in question. B. relevant province or state. C. journal that was published. D. sponsoring organization. (5) It is implied that the ethnic group in question lives in predominantly A. rural settings. B. suburban areas. C. urban settings. D. ghetto slums. (6) Persons attending the meetings referred to were apparently A. law enforcement officers. B. black students. C. educators. D. all of the above. The preceding item type is usually found in what is called a 'reading comprehension' test. Another way of referring to it is to say that it is a task that requires reading and answering questions -leaving open the question of what the test is a test of. It may, for instance, be a fairly good test of overall language proficiency. Or, it may be about as good a test oflistening comprehension as of reading comprehension. These possibililies-eanno.t be ruled out in advance on the basis of the superficial appearance of the test. Furthermore, it is certainly possible to change the nature of the ta,sk and make it into a listening and question answering problem. In fact, the only 10gicailiInits on the types of tests that !night be constructed in siInilar formats are whatever limitations exist on the creative imaginations of the test writer. They could be converted, for instance, to an open-ended 236 LANGUAGE TESTS AT SCHOOL format requiring spoken responses to spoken questions over a heard text. - Not only is it possible to fiQd many alternate varieties of multiple choice tests that are clearly integrative in nature, but it is quite possible to take just about any pragmatic testing technique and convert it to some kind of multiple choice format more or less resembling the original pragrp.atic technique. For example, consider a cloze test over the preceding text - or, say, the continuation of it. We might delete every fifth word and replace it with a field of alternatives as follows: Black Students in Urban (1) _ _ _ _ _ (A) Europe (B) America (C) New Guinea (D) Canada is an attempt to (2) (A) find (B) provide (C) include (D) take . (A) which information to urban Canadians (3) (B) while (C) who (D) to engage in educational transactions (4) (A) with (B) on (C) to (D) by members of this ethnicity ___ . Bear in mind the fact that either a printed format (see Jonz, 1974, Porter, 1976, Hinofotis and Snow, 1977) or a spoken format would be possible (Scholz, Hendricks, Spurling, Johnson, and Vandenburg, in press). For instance, in a spoken format the text might be recorded as 'Black students in Urban blank one is an attempt to blank two information to urban Canadians blank three engage in education transactions blank four members of this ethnicity .... ' The examinee might see only the alternatives for filling in the numbered blanks, e.g., (1) __(A) Europe (B) America (C) New Guinea (D) Canada (2) __(A) find (B) provide (C) include (D) take etcetera. To make the task a bit easier in the auditory mode, the recorded text might be repeated one or more times. For. some MULTIPLE CHOICE TESTS 237 exploratory work with 'such a procedure in an auditory mode see the Appendix, also see Scholz, et al (in press). For other suggestions for making the task simpler, see Chapters 10, 11, and 12 on factors that affect the difficulty of discourse processing tasks. Once we have broached the possIbility of using discourse as a basis for constructing multiple choice tasks, many variations on test item types ,can easily be conceived. For instance, instead of asking the examinee to select the appropriate continuation at a particular point in a text, he may be asked to select the best synonym or paraphrase for an indicated portion of text from a field of alternatives. Instead of focussing exclusively on words, it would be possible to use phrases or clauses or larger units of discourse as the basis for items. Instead of a synonym matching or paraphrase matching task, the examinee might be required to put words, phrases, or clauses in an appropriate order within a given context of discourse. Results from tasks of all these types are discussed in greater detail in the Appendix (also see the references given there). The important point is that any such tests are merely illustrative of a bare smattering of the possible types. The question concerning which of the possible procedures are best is a matter for empirical consideration. Present findings seem to indicate that the most promising multiple choice tasks are those that require the processing offairly extended segments of discourse - say, 150 words of text or more. However, a note of caution should be sounded. The construction of mUltiple choice tests is generally a considerably more complicated matter than the mere selection of an appropriate segment of discourse. Although pragmatic tasks can with some premeditation and creativity be converted into a variety of multiple choice tests, the latter are scarcely as easy to use as the original pragmatic tasks themselves (see Part Three). In the next secfion we will consider some of the technical problems in writing items (especially alternatives). C. About writing items There are only a handful of principles that need to be grasped in writing good items, but there are a great many ways to violate any or all of them. The first problem in writing items is to decide what sort of items to write. The second problem is to write the items with suitable distractors in each set of alternatives. In both steps there are many pitfalls. Professionally prepared tests are usually qased on explicit instructions concerning the format for items in each section of the 238 LANGUAGE TESTS AT SCHOOL t~st. Not only will the superficial lay-out of the items be described and exemplified, but usually the kinds of fair content for questions will also be more or less circumscribed, and the intended test population will be described so as to inform item writers concerning "the appropriate range of difficultyofltems t~ be included in eadh part of the test. 1 Unfortunately, the questions of test validity are generally consigned to the statistician's department. They are rarely raised at the point of item writing. However, as a rule of thumb, all of the normal criteria for evaluating the validity of test content should be applied from the earliest stages of test construction. The' first principle, therefore, would be to ask if the material to be included in items in the test is somehow related to the skill, construct, or curriculum that the test is supposed to assess or measure. Sad to say many of the items actually included in locally prepared, teacher-made or multiple choice standardized tests are not always subjected to this primary evaluation. If a test fails this first evaluation, no matter how elegantly its items are constructed, it cannot be any better than any other ill-conceived test of whatever it is supposed to measure. Assuming that the primary validity question has been properly considered, the next problem is to write the best possible items of the defined type(s). In some cases, it will not be necessary to write items from scratch, but rather to select appropriate materials and merely edit them or record them in some appropriate fashion. Letus assume tha,t all test items are to be based on samples of realistic discourse. Arguments for this choice are given throughout the various chapters of this book. Other choices could be made - for instance, sentences in isolation could be used - but this would not change the principles directly related to the construction of items. It would merely change their fleshed-out realization in particular instances. During the writing stage, each item must be evaluated for appropriateness of content. Does it ask for information that people would normally be expected to pay attention to in the discourse 1 John Bormuth (1970) has developed an extensive argument for deriving multiplechoice items from curricula via explicit and rigorous linguistic transformations. 'The items in his methodology are directly tied to sentences uttered or written in the curriculum. The argument is provocativcr,However, it presupposes that the surface forms of the sentences in the curriculum are all that could be tested. Normal discourse processing, on the other hand, goes far. beyond what is stated overtly in surface forms (see Frederiksen's recent articles and his references). Therefore, Bormuth's interesting proposal will not be considered further here. I am indebted to Ron Mackay (of Concordia University in Montreal) for calling Bormuth's argument to my attention. I MULTIPLE CHOICETESTS 239 context in question? Is the decision that is required one that really seems to exercise the skill that the test as a whole is aimed at measuring? Is the correct choice really the best choice for someone who is good at the skill being measured (in this case, a good language user)? Are the distractors actually attractive traps for someone who is not so good at the skill in question? Are they well balanced in the sense of going together as a set? Do they avoid the inclusion of blatant (but extraneous) clues to the correct choice? In sum, is the item a well-conceived basis for a choice between clear alternatives? In addition to insisting that the items of interest be anchored to realistic discourse contexts, on the basis of research findings presented elsewhere in this text (especially the Appendix), we will disregard many of the discrete point arguments of purity in item types. In other words, we' will abandon the notion that vocabulary knowledge must be assessed as if it were independent of grammatical skill, or that reading items should not include a writing aspect, etc. All of the available empirical research seems to indicate that such distinctions are analytical niceties that have no fundamental factual counterparts in the variance actually produced by tests that are constructed on the basis of such distinctions. Therefore, the , distinctions that we will make are principally in the types of tasks required of the learner - not in the hypothetical skills or constructs to be tested. For all of these reasons, we should also be clear about the fact that the construction of a multiple choice test is not apt to produce a test that is more valid than a test of a similar sort in some other format. The point in building a multiple choice test is to attain greater economy of administration and scoring. It is purely a question of practicality and has little or nothing to do with reliability and validity in the broader sense of these terms. Most any discourse context can be dealt with in just about any processing mode. For instance, consider a. breakfast conversation. Suppose that i~ involves whaf the various members of a family are going to do that day, in addition to the normal ~pass the salt and pepper' kind of conversation at breakfast. Nine-year-old Sarah spills . the orange juice while Mr Kowolsky is scalding his mouth on boiling coffee and remonstrating that Mrs Kowolsky can't seem to cook a thing without getting it too hot to eat. Thirteen-year-old Samuel wants to know if hecan have +a dollar-(make that five dollars) so he can see that latest James Bond movie, and his mother insists that he not forget the piano lesson at four, and to feed the cat ... It is possible to talk about such a context; to listen to talk about it; to read 240 LANGUAGE TESTS AT SCHOOL about it; to write about it. The same is true for almost any context conceivable where normal people interact through the medium of language. It might be reasonable, of course, to start with a written text. Stories, narratives, expository samples of writing, in fact, just about any text may provide a suitable basis for language text material. It is possible to talk about a story, listen to a story, answer questions about a story, read a story, retell a story, write a story, and so forth. What kinds of limits should be set on the selection of materjals? Obviously, one would not ;want to select test material that would distract the test taker from the main job of selecting the appropriate choices of the multiple choice items presented. Therefore, supercharged topics about such things as rape, suicide, murder, and heinous crimes should probably be avoided along with esoteric topics oflimited interest such as highly technical crafts, hobbies, games, and the like (except, of course, in the very special case where the esoteric or super-charged topic is somehow central to the instructional goals to be assessed). Materials that state or imply moral, cultural, or racial judgements likely to offend test takers should also probably be avoided unless there is some specific reason for including them. Let us suppose that the task decided on is a reading and question answering type. Further, for whatever reasons, let us suppose that the following text is selected: Oliver Twist was born in a workhouse, and for a long time after his birth there was considerable doubt whether the child would live. He lay breathless for some time, rather unequally balanced between this world and the next. After a few struggles, however, he breathed, sneezed and uttered a loud cry. The pale face of a young woman lying on the bed was raised weakly from the pillow and in a faint voice she said, 'Let me "See the child and die.' 'Oh, y~u must not talk about dying yet,' said the doctor, as he rose from where he was sitting near the fire and advanced towards the bed. 'God bless her, no!' added the poor old pauper who was acting as nurse. The doctor placed the child in its mother's arms; she pressed her cold white lips on its forehead; passed her hands over her face; gazed wildly around, fell back - and died. 'It's all over,' said the doctor at last. 'Ah, poor dear, so it is!' said the old nurse. . 'She was a good-looking girl, too,' added the doctor: 'where did she come from?' r i I I I MULTIPLE CHOICE TESTS 241 'She was brought here last night,' replied the old woman. 'She was found lying in the street. She had walked some distance, for her shoes were worn to pieces; but nobody knows where she came from, or where she was going, nobody knows.' 'The old story,' said the doctor, shaking his head, as he leaned over the body, and raised the left hand; 'no weddingring, I see. Ah! Good night!' (Dickens, 1962, p. 1) In framing questions concerning such a text (or any other) the first thing to be considered is what the text says. What is it about? If it is a story, like this one, who is referred to in it? What are the important events? What is the connection between them? What is the relationship between the people, events, ,and states of affairs referred to? In other words, how do the surface forms in the text pragmatically map onto states of affairs (or facts, imaginings, etc.) which the text is about? The author of the text had to consider these questions (at least implicitly) the same as the reader, or anyone who would retell the story or,discuss it. Linguistically speaking, this is the problem of pragmatic mapping. Thus, a possible place to begin in constructing test items would be . with the topic. What is the text about? There are many ways of posing the question clearly, but putting it into a multiple choice format is a bit more complicated than merely asking the question. Here we are concerned with better and worse ways of forming such multiple choice questions. How should the question be put, and what alternatives should be offered as possible answers? Consider some of the ways that the question can be badly put: (1) The passage is about __' _ _ __ A. adoctor B. anurse C. an unmarried woman D. achild The trouble here is that the passage is in' fact about all of the foregoing, but is centered on none of them. If any were to be selected it would probably have to be the child, because we und~rstand from the first paragraph of the text that Oliver Twist is the child who is being born. Further, the attention of every person in the story is prim~ directed. to the· birth of this child. Even the mother is concerned merely to look at him before she dies. (2) A good title for this passage might be _ _ _ __ A. 'Too young to die.' B. 'A cross too heavy.' 242 LANGUAGE TESTS AT SCHOOL C. 'A child is born.' D. 'God bless her, no!' Perhaps the author has C in mind, but the basis for that choice is a bit obscure. Mter all, it isn't just any child; and neither is it some child of great enough moment to justify the generic sense of 'a child'. Now, consider a question that gets to the point: (3) The central fact talked about in the stor~ is _ _ _ __ A. the birth of Oliver Twist B. the death of an unwed mother C. an experience of an old doctor D. an old and common story Hence, the best choice must fit the facts well. It is essential that the correct answer be correct, and further that it be better than the other alternatives offered. Another common·problem in writing items arises when the writer selects facts that are in doubt on the basis of the given information and forces a choice between two or more possible alternatives. (4) When the author says that 'for a long time after his birth there was considerable doubt whether the child would live' he probably means that I A. the child was sickly for months or possibly years B. Oliver Twist did not immediately start breathing at birth C. the infant was born with a respiratory disease D. the child lay still without breathing for minutes after birth The trouble here is that the text does not give a sufficient basis for selecting between the alternatives. While it is possible that only B is correct, it is not impossible (on the basis of given information) that one of the other three choices is also correct. Therefore, the choice that is intended by the author to be the correct one (say, B) is not a very reasonable alternative. In fact, none of the alternatives is really a good choice in view of the indeterminacy of the facts. Hence,the facts ought to be clear on the basis of the text, or they should not be us~d as content for test items. Finally, once the factual content of the item is clear and after the correct alternative has been decided on, there is the matter of constructing suitable distractors, or incorrect alternatives. The distractors should not give away the correct choice or call undue attention to it. They should be similar in form and content to the correct choice and they should have a certain attractiveness about them. MULTIPLE CHOICE TESTS 243 For instance, consider the following rewrite of the alternatives offered for (3): A. the birth of Oliver Twist B. the death of the young unwed mother of Oliver Twist C. the experience of the old doctor who delivered Oliver Twist D. a common story about birth and death among unwed mothers There are severat problems here. First, Oliver Twist is mentioned in all but one of the alternatives ~ thus drawing attention to him and giving a clue as to the correct choice. Second, the birth of Twist is mentioned or implied in all four alternatives giving a second unmistakable clue as to the correct choice. Third, the choices are not well balanced - they become increasingly specific (pragmatically) in choices B, and C, and then jump to a vague generality in choice D. Fourth, the correct choice, A,'is the most plausible of the four even if one has not read the text. There are several Common ways item writers often draw attention to the correct choice in a field of alternatives without, of course, intending to. For one, as we have already seen, the item writer may be tempted to include the same information in several forms among the various alternatives. This highlights that information. Another way of highlighting is to include the opposite of the correct response. For instance, as alternatives to the question about a possible title for the text, consider the following: A. 'The death of Oliver Twist.' B. 'The birth of Oliver Twist.' C. 'The same old story.' D. 'God bless her, no!' The inclusion of choice A calls attention to choice B and tends to eliminate the other alternatives immediately. The tendency to include the opposite of the correct alternative is very common, especially when the focus is on a word or phrase meaning: (5) In the opening paragraph, the phrase 'unequally balanced between this world and the next' refers to the fact that Oliver appears tobe _ _ _ __ A. more alive than dead B. more dead than alive C. about to lose his balance D. in an unpleasant mental state To the test-wise examinee (or any moderately clever person), A and B are apt to seem more attractive than CorD even ifthe examinee has 244 1 I : i ii il ,,' I: " II I: I: : I' I: I: I MULTIPLE CHOICE TESTS LANGUAGE TESTS AT SCHOOL 245 the given sentence or the story, choice D is the only sane alternative. In sum, the common foul-ups in multiple choice item writing include the following: (1) Selecting inappropriate content for the item. (2) Failure to include the correct answer in the field of alternatives. (3) Including two or more plausible choices among the alternatives. (4) Asking the test taker to guess facts that are not stated or implied, . (5) Leaving unintentional clues about the correct choice by making it either the longest or shortest, or by including its opposite,or by repeatedly referring to the information in the COl"fect choice in other choices, or by including ridiculous alternatives. (6) Writing distractors that don't fit together with the correct choice - i.e., that are too general or too specific, too abstract or too concrete, too simple or too complex. These are only a few of the more common problems. Without doubt there are many other pitfalls to be avoided. not read the original text about Twist. Yet another way of cluing the test taker as to the appropriate choice is to make it the longest and most complexly worded alternative or the shortest and most succinct. We saw an example of the latter ~bove with reference to item (3) where the correct choice was obviously the shortest and the clearest one of the bunch. Here.is another case. The only difference is that now the longest alternative,is the correct choice: (6) The doctor probably tells the young mother not to talk about dying because _ _ _ __ A. he doesn't think she will die B. she is not at all ill C. he wants to encourage her and hopes that she will not die D. she is delirious The tendency is to include more information in the correct alternative in order,.to make absolutely certain that it is in fact the best choice. Another motive is to make the distractors short to save time in writing the items. Another common problem in writing distractors is to inclu~e alternatives that are ridiculous and often (perhaps only to the test writer) hilarious. After writing forty or fifty alternatives there is a certain tendency for the test writer to become a little giddy. It is difficult to think of distractors without occasionally cowing up with a real humdinger. After one or two, the stage is set for a hil~rious test, but hilarity is not the point of the testing and it may be deleterious to the validity of the test qua test. For instance, consider the following item where the task is to select the best paraphrase for the given sentence: (7) The pale face of a young woman lying on the bed was raised weakly from the pillow and she spoke in a faint voice, A. A fainting face on a pillow rose up from the bed and spoke softly to .the young woman. B. The pale face and the woman were lying faintly on the bed when she spoke. C. Weakly from the pillow the pale face rose up and faintly spoke to the woman. D. The woman who was very pale and weak lifted herself from the pillow and spoke. Alternative B is distracting in more ways than one. Choice C continues the metaphor created, and neither is apt to be a very good distractor except in a hilarious diversionary sense. Without reading D. Item analysis and its interpretation Sensible item analysis involves the careful SUbjective interpretation of some objective facts about the way examinees perform on multiple choice items. Insofar as all tests involve determinate and quantifiable choices (i.e., correct and incorrect responses, or subjectively determined better and worse responses), item analysis at base is a very general procedure. However, we will consider it here specifically with reference to multiple choice items and the very' conveniently quantifiable data that they yield. In particular, we will discuss the statistics that usually go by the names of item facility and item discrimination. Finally, we will discuss the interpretation of response frequency distributions. We will be concerned with the meaning of the statistics, the assumptions on which they depend in order to be useful, and their computation. It will be shown that item analysis is generally so tedious to perform by hand as to render it largely impracticable for classroom use. Nevertheless, it will be argued thatitem analysis is an important and necessary step in the preparation of good multiple choice tests. Because of this latter fact, it is suggested that every classroom teacher and educator who uses mUltiple choice test data . , 1 246 LANGUAGE TESTS AT SCHOOL should know something of item analysis - how it is done, and what it means. (i) Itemfacility. One of the basic item statistics is item facility (IF). It has to do with how easy (or difficult) an item is from the viewpoint of the group of students or examinees taking the test of which that item is a part. The reason for concern with IF is very simple - a test item that is too easy (say, an item that every student answers correctly) or a test item that is too difficult (one, say, that every student answers incorrectly) can tell us nothing about the differences in ability within the test population. There may be occasions when a teacher in a classroom situation wants all of the students to answer an item (or all the items on a test) perfectly. Indeed, such a goal seems tantamount to the very idea of what teaching is about. Nevertheless, in school-wide exams, or in tests that are intended to reveal differences among the students who are better and worse performers on whatever is being tested, there is nothing gained by including test items that every student answers correctly or that every student " answers incorrectly. The computation of IF is like the computation of a mean score f<1>r a test only the test in this case is a single item. Thus, an IF value can be computed for each item on any test. It is in each case like a miniature test score. The only difference between an IF value and a part score or total score on a test is that the IF value is based on exactly one item. It is the mean score of all the examinees tested on that particular item. Usually it is expressed as a percentage or as a decimal indicating the number of students who answered the item correctly: IF = tI I the number of students who answered the item correctly divided by the total number of students This formula will produce a decimal value for IF. To convert it to a percentage, we multiply the result by 100. Thus, IF is the proportion of students who" answer the item in question correctly. Some authors use the term 'item difficulty', but this is not what the proportion of students answering correctly really expresses. The IF increases as the item gets easier and decreases as it gets more difficult. Hence, it really is an index of facility. To convert it to a difficulty measure we would have to subtract the IF from the maximum possible score on the item - i.e., 1.00 if we are thinking in terms of decimal values, and 100 % if we are thinking in terms of percentage values. The proportion answering incorrectly should be referred to as the item difficulty. We will not use the latter notion, however, because 1 MULTIPLE CHOICE TESTS 247 it is completely redundant once we have the IF value. By pure logic (or mathematics, if you prefer), we can see that the IF of any item has to fall between zero and one or between 0 % and 100 %. It is not possible for more than 100 % of the examinees to answer an item correctly, nor for less than 0 % to fail to answer the item correctly. The worst any group can do on an item is for all of them to answer it incorrectly (IF = .00 = 0 %). The best they can do is for all of them to answer it correctly (IF = 1.00 = 100 %). Thus, IF necessarily falls somewhere between 0 and 1. We may say that IF ranges from 0 to 1. For reasons given above, however, an item that everyone answers correctly or incorrectly tells us nothing about the variance among examinees on whatever the test measures. Therefore, items falling somewhere between about .15 and .85 are usually preferred. There is nothing absolute about these values, but professional testers always set some such limits and throwaway or rewrite items that are judged to be too easy or too difficult. The point of the test items is almost always to yield as much variance among examinees as possible. Items that are too easy or too difficult yield very little variance - in fact, the amount of meaningful variance must decrease as the item approaches an IF of 100 % or 0 %. The most desirable IF values, therefore, are those falling toward the middle of the range of possible values. IF values falling in the middle of the range guarantee some variance in scores among the examinees. However, merely obtaining variance is not enough. Meaningful variance is required. That is, the variance must be reliable and it must be valid. It must faithfully reflect variability among tested subjects on the skill or knowledge that the test purportedly measures. This is where another statistic is required for item evaluation. (ii) Item discrimination. The fundamental issue in all testing and measurement is to discriminate between larger and smaller quantities of something, better and worse performances, success and failure, more or less of whatever one wants to test or measure. Even when the objective is to demonstrate mastery, as in a classroom setting where it may be expected that everyone will succeed, the test cannot be a measure of mastery at all unless it provides at least an opportunity for failure or for the demonstration of something less than mastery. Or to take another illustration, consider the case of engineers who 'test' the strength of a railroad trestle by running a loaded freight train over it. They don't expect the bridge to collapse. Nonetheless, the test discriminates between the criterion of success (the bridge holding up) 248 LANGUAGE TESTS AT SCHOOL MULTIPLE CHOICE TESTS 249 could be used, however. For instance, the items on one test could easily be assessed against the scores on some different test or tests purporting to measure the same thing. In the latter instance, the other test or tests would be used as bases for evaluating the validity of the items on the first test. In brief, the question of whether an individual test item discriminates between examinees on some dimension of interest is a matter of both reliability and validity. We cannot read an index of item discrimination as anything more than an index of reliability, however, unless the criterion against which the item is correlated has some independent claims to validity. In the latter case, the index of item discrimination becomes an index of validity over and above the mere question of reliability. The usual criterion selected for determining item discrimination is the total test score. It is simply assumed that the entire test is apt to be a better measure of whatever the test purports to measure than any single test item by itself. This assumption is only as good as the validity of the total test score. If the test as a whole does not measure what it purports to measure, then high item discrimination values merely indicate that the test is reliably measuring something - who knows what. If the test on the whole is a valid measure of reading comprehension on the other hand, the strength of each item discrimination value may be taken as a measure of the validity of that item. Or, to put the matter more precisely, the degree of validity of the test as a whole establishes the limitations on the interpretation of item discrimination values. As far as human beings are concerned, a test is never perfectly valid, only more or less valid within limits that can be determined only by inferential methods. To return to the example of the 100 item reading comprehension test, let us consider how an item discrimination index could be computed. The usual method is to select the total test score as the criterion against which individual items on the test will be assessed. The problem then is to compute or estimate the strength of the correlation between each individual item on the test and the test as a whole. More specifically, we want to know the strength of the correlation between the scores on each item in relation to the scores on all the items. Since 100 correlations would be a bit tedious to compute, especially when each one would require the manipulation of at least twice as many scores as there are examinees (that is, all the total scores plus all the scores on the. item in question), a simpler method would be and failure (the bridge collapsing). Thus, any valid test must discriminate between degrees of whatever it is supposed to measure. Even if only two degrees are distinguished - as in the case of mastery versus something less (Valette, 1977) - discrimination betw~en those two degrees is still the principal issue. In school testing where multiple choice tests are employed, it is necessary to raise the question whether the variance produced by a test item actually differentiates the better and worse performers, or the more proficient examinees as against the less proficient ones. What is required is an index of the validity of each item in relation to some measure of whatever the item is supposed to be a measure. Clearly if different test items are of the same type and are supposed to measure the same thing, they should produce similar variances (see Chapter 3 above for the definition of variance, and correlation). This is the same as saying that the items should be correlated. That is, the people who tend to answer one of the items correctly should also tend to answer the other correctly and the people who tend to answer the one item incorrectly should also tend to answer the other incorrectly. If this were so for all of the items of a given type we would take the degree of their correlatiol!- as an index of their reliability - or in some terminologies their internal consistency. But what if the items could be shown to correlate with some other criterion? What if it could be shown that a particular item, for instance, or a batch of items were correlated with some other measure of whatever the items purport to measure? In the latter case, the correlation would have to be taken as an index of validity - not mere internal consistency of the items. What criterion is always available? Suppose we think of a test aimed at assessing reading comprehension. Let's say the test consists of 100 items. What criterion could be used as an index of reading comprehension against which the validity of each individual item could be assessed? In effect, for each subject who takes the test there will be a score on the entire test and a score on each item of the test. Presumably, if the subject does not answer certain items they will be scored as incorrect. Now, which would be expected to be a better measure of the subject's true reading comprehension ability, the total score or a particular item score? Obviously, since the total score is a composite of 100 items, it should be assumed to be a better (more reliable and more valid) index of reading comprehension than any single item on the test. Hence, since the total score is easily obtainable and always available on any multiple choice test, it is the usual criteribn for assessing individual item reliabilities. Other criteria * r Ii 250 LANGUAGE TESTS AT SCHOOL desirable if we were to do the job by hand. With the present availability of computers, no one would be apt to do the procedure by hand any more, but just for the sake of clarity the Flanagan (1939) technique of estimating the correlation between the scores on each item and the score on the total test will be presented in a step by step fashion. Prior to computing anything, the test of course has to be administered to a group of examinees. In order to do a good job of estimating the discrimination values for each test item the selected test population (the group tested) should be representative of the people for whom the test is eventually intended. Further, it should involve a large enough number of subjects to ensure a good sampling of the true variability in the population as a whole. It would not make much sense to go to all the trouble of computing item discrimination indices on a 100 item test with a sample of subjects ofless than say 25. Probably a sample of 50 to 100 subjects, however, would provide meaningful (reliable and valid) data on the basis of which to assess the validities of individual test items in relation to total test scores. Once the test is administered and all the data are in hand, the first step is to score the tests and place them in order from the highest score to the lowest. If 100 subjects were tested, we would have 100 test booklets ranking from the highest score to the lowest. If scores are tied, it does not matter what order we place the booklets in for those particular cases. However, all of the 98s must rank ahead of all of the 97s and so forth; The next step (still following Flanagan's method) is to count off from the top down to the score that falls at the 82! percentile. In the case of our data sample, this means that we count down to the student that falls at· the 28th position down from the top of the stack of papers. We then designate that stack of papers that we have just counted off as the High Scorers. This group will contain approximately 27! % o( all the students who took the test. In fact it contains the 27! % of the students who obtained the highest scores on the test. . Then, in similar fashion we count up from the bottom of the booklets remaining in the original stack to position number 28 to obtain the corresponding group that will be designated Low Scorers. The Low Scorers will contain as near as we can get to exactly 27!% of the people who achieved scores ranking at the bottom of the stack. We now have distinguished between the 27!% (rounded off in this case to 28 %) of the students who got the highest scores and the 27! % who got the lowest scores on the test. From what we already know of MULTIPLE CHOICE TESTS 251 correlation, if scores on all individual item are correlated with the total score it follows that for any item, more of the High Scorers should get it right than of the Low Scorers. That is, the students who are good readers should tend to get an item right more often than the students who are not so good at reading. We would be disturbed if we found an item that good readers (High Scorers) tended to miss more frequently than weak readers (Low Scorers). Thus, for each item we count the number of persons inthe High Scorers group who answered it correctly and compare this with the number of persons in the Low Scorers group who answered it correctly. What we want is an index of the degree to which each item tends to differentiate High and Low Scorers the same as the total score does - i.e., an estimate of the correlation between the item scores and the total score. For each item, the following formula will yield such an index: ID = the· number of High Scorers who answered the item correctly minus the number of Low Scorers who answered the item correctly, divided by 27! % ofthe total number of students tested Flanagan showed that this method provides an optimum estimate of the correlation between the item in question and the total test. Thus, as in the case of product-moment correlation (see Chapter 3 above), ID can vary from + I to - 1. Further, it can be interpreted as an estimate' of the computable correlation between the item and the total score. Flanagan has demonstrated, in fact, that the method of comparing the top 27! % against the bottom 27! % produces the best estimate of the correlatIon that can be obtained by such a method (better for example than comparing the top 50 % against the bottom 50 %, or the top third against the bottom third, and so on). A specific example of some dummy data (i.e., made up data) for one of the above test items will show better how ID is computed in actual practice. Suppose we assume that itein (3) above, based on the text about Oliver Twist, is the item of interest. Further, that it is one of 100 items constituting the reading comprehension test posited earlier. Suppose we have already administered the test to 100 examinees and we have scored and ranked them. After determining the High Scorers and the Low Scorers by the method described above, we must then determine how many in each group answered the item correctly. We begin by examining the answers to the item given by students in the High Scorers group. We look at each test booklet to find out whethtfr the student in question 252 got the item right or wrong. Ifhe got it right we add one to the number of students in the High Scorers group answering the item correctly, If he got it wrong, we disregard his score. Suppose that we find 28 out of 28 students in the High Scorers group answered the item correctly. Then, we repeat the counting procedure for the Low Scorers. Suppose that 0 out of 28 students in the Low Scorers group answered the item correctly. The ID for item (3) is equal to 28 minus 0, divided by 28, or + 1. That is, in this hypothetical case, the item discriminated perfectlycbetween the better and not-so-good readers. We wouid be inclined to conclude that the item is a good one. Take another example. Consider the following dummy data on item (5) above (also about the Oliver Twist text). Suppose that 14 of the people in the High Scorers group and 14 of the people in the Low Scorers group answered the item correctly (as keyed, that is, , assuming the 'correct' answer really is correct). The ID would be 14 minus 14, divided by 28, or 0/28 = O. From this we would conclude that the item is not producing any meaningful variance at all in relation to the performance on the entire test. Take one further example. Consider item (4) above on the Twist text. Let us suppose that all of the better readers selected the wrong answer - choice A. Further, that all of the poorer readers selected the answer keyed by the examiner as the correct choice - say, choice D. We would have an ID equal to 0 minus 28, divided by 28, or -1. From this we would be inclined to conclude that the item is no good. Indeed, it would be fair to say that the item is producing exactly the wrong kind of variance. It is tending to place the low scorers on the" item into the High Scorers group for the total score, and the high scorers on the item are actually ending up in the Low Scorers group for the overall test. From all of the foregoing discussion about ID, it should be obvious that high positive ID values are desirable, whereas low or negative values are undesirable. Clearly, the items on a test should, be correlated with the test as a whole. The stronger the correlations, the more reliable the test, and to the extent that the test as a whole is valid, the stronger the correlations of items with total, the more valid the items must be. Usually, professional testers set a val~e of .25 or .35 as a lower limit on acceptable IDs. If an item falls below the arbitrary cut-off point set, it is either rewritten or culled from the total set of items on the test. 'il IIi 1 :i1 '1 ',111 1 !'I' " (iii) Response frequency distributions. In addition to finding out' 253 MULTIPLE CHOICE TESTS LANGUAGE TESTS AT SCHOOL how hard or easy an' item is, and besides knowing whether it is correlated with the composite of item scores in the entire test, thetest author often needs to know how each and all of the distractors performed in a given test administration. A technique for determining whether a certain distractor is distracting any of the students or not is simply to go through all of the test booklets (or answer sheets) and see how many of the students selected the alternative in question. A more informative technique, however, is to see what the distribution of responses was for the High Scorers versus the Low Scorers as well as for the group falling in between, call them the Mid Scorers. In order to accomplish this, a response frequency distribution can be set up as shown in Table 3 immediately below: TABLE 3 Response Frequency Distribution Example One. Item (3) A* B C D Omit High Scorers (top 27t%) 28 0 0 0 0 Mid Scorers (mid 45 %) 15 10 .10 9 0 Low Scorers (low 27t%) 0 7 7 7 7 The table is based on hypothetical data for item (3) based on the Oliver Twist text above. It shows that 28 of the High Scorers marked !he correct choice, namely A, and none qf them marked B, C, or D, and none of them failed to mark the item. It shows further that the distribution of scores for the Mid group favored the correct choice A, with B, C, and D functioning about equally well as distractors. No one in the Mid group failed to mark the item. Finally, reading across the last row of data in the chart, we see that no one in the Low group marked the correct choiceA, and equal numbers marked B, C, and D. Also, 7 people in the Low group failed to mark the item at all. IF and ID are directly computable from such a response frequency distribution. We get IF by adding the figures in the column headed by the letter of the correct alternative, in this case A. Here, the IF is 28 plus 15 plus 0, or 42, divided by 100 (the total number of subjects. who 254 MULTIPLE CHOICE TESTS LANGUAGE TESTS AT SCHOOL took the exam) which equals .42. The ID is 28 (the number of persons answering correctly in the High group) minus 0 (the number answering correctly in the Low group) which equals 28, divided by 27t % of all the subjects tested, or 28 divided by 28, which equals 1. Thus, the IF is .42 and the ID is 1. We would be inclined to consider this item a good one on the basis of such statistics. Further, we can see that all of the distractors in the item were working quite well. For instance, distractors Band C pulled exactly 17 responses, and D drew 16. Thus, there would appear to be no dead wood among the distractors. To see better what the response frequency distribution can tell us about specific distractors, let's consider another hypothetical example. Consider the data presented on item (4) in Table 4. E. Minimal recommended steps for multiple choice test preparation By now the reader probably does not require' much further convincing that multiple choice preparation is no simple matter. Thus, all we will do here is state in summarial form the steps considered necessary to the preparation of a good multiple choice test. (1) Obtain a clear notion of what it is that needs to be tested. (2) Select appropriate item content and devise an appropriate item format. (3) Write the test items. (4) Get some qualified person to read the test items for editorial difficulties of vagueness, ambiguity, and possible lack of clarity (this step can save much wasted energy on later steps). (5) Rewrite any weak items or otherwise revise the test format to achieve maximum clarity concerning what is required of the TABLE 4 Response Frequency Distribution Example Two. Item (4) A B C D* Omit High Scorers (top 27t %) 28 0 0 0 0 Mid Scorers (mid 45 %) 15 15 0 14 0 Low Scorers (low 27t%) 0 0 0 28 0 255 ex~minee. (6) Pretest the items on some suitable sample of subjects other than the specifically targeted group. (7) Run an item analysis over the data from the pretesting. (8) Discard or rewrite items that prove to be too easy or too difficult, or low in discriminatory power. Rewrite or discard non-functional alternatives based on response frequency distributions. (9) Possibly recycle through steps (6) to (8) until a sufficient number of good items has been attained. (10) Assess the validity of the finished product via some one or more of the techniques discussed in Chapter 3 above, and elsewhere in this book. (11) Apply the finished test to the target population. Treat the data acquired on this step in the same way as the data acquired on step (6) by recycling through steps (7) to (10) until optimum levels of reliability and validity are consistently attained. Reading across row one, we see that all of the High group missed the item by selecting the same wrong choice, namely A. If we look back at the item we can find a likely explanation for this. The phrase 'for a long time after his birth' does seem to imply alternative A which suggests that the child was sickly for 'months or possibly years'. Therefore, distractor A should probably be rewritten. Similarly, distractor A drew off at least 15 of the Mid group as well. Choice C, on the other hand, was completely useless. It would probably have changed nothing if that choice haJ not been among the field of alternatives. Finally, since only the low scorers answered the item correctly it should undoubtedly be completely reworked or discarded. In view of the complexity of the tasks involved in the construction of multiple choice tests, it would seem inadvisable for teachers with normal instructional loads to be expected to construct and use such tests for normal classroom purposes. Furthermore, it is argued that such tests have certain instructional drawbacks. I I 1 p 256 LANGUAGE TESTS AT SCHOOL MULTIPLE CHOICE TESTS 257 deal on the other hand in terms of preparation and counter productive instructional effects. Much research is needed to determine whether the possibly contrary effects on learning can be neutralized or even eliminated if the preparation of items is guided by certain statable principles - for instance, what if all of the alternatives were set up so that only factually incorrect distractors were used? It might be that some types of mUltiple choice items (perhaps the sort used in certain approaches to programmed instruction) are even instructionally beneficial. But at this point, the instructional use of multiple choice formats is not recommended. F. On the instructional value of multiple choice tests While multiple choice tests have rather obvious advantages in terms of administrative and scoring convenience, anyone who wants to make such tests part of the daily instructional routine must be willing to pay a high price in test preparation and possibly genuine instructional damage. It is the purpose of the multiple choices offered in any field of alternatives to trick the unwary, illinformed, or less skillful learner. Oddly, nowhere else in the curriculum is it common procedure for educators to recommend deliberate confusion of the learner - why should it be any different when it comes to testing? It is paradoxical that all of the popular views of how learning can be maximized seem to go nose to nose with both fists flying against the very essence of multiple choice test theory. If the test succeeds in discriminating among the stronger and weaker students it does so by decoying the weaker learners into misconceptions, half-truths, and Janus-faced traps. Dean H. Obrecht once told a little anecdote that very neatly illustrates the instructional dilemma posed by multiple choice test items. Obrecht was teaching acoustic phonetics at the University of Rochester when a certain student of Germanic extraction pointed out . the illogicality of the term 'spectrogram' as distinct from 'spectrograph'. The student observed that a spectrogram might be like a telegram, i.e., something produced by the corresponding -graph, to wit a telegraph or a spectrograph. On the other hand, the student noted, the 'spectrograph' might be like a photograph for which there is no corresponding photogram. 'Now which,' asked the bemused student, 'is the machine and which is the record that it produces?' Henceforth, Dr. Obrecht often complained that he could not be sure whether it was the machine that was the spectrograph, or the piece of paper. What then is the proper use of multiple choice testing? Perhaps it should be thoroughly re-evaluated as a procedure for educational applications. Clearly, it has limited application in classroom testing. The tests are difficult to prepare. Their analysis is tedious, technically formidable, and fraught with pitfalls. Most importantly, the design of distractors to trick the learners into confusing dilemmas is counter productive. It runs contrary to the very idea of education. Is this necessary? In the overall perspective multiple choice tests afford two principal advantages: ease of administration and scoring. They cost a great KEY POINTS 1. There is a certain strained naturalness about multiple choice test formats inasmuch as there does not seem to be any other way to ask a question. 2. However, the main argument in favor of using multiple choice tests is their supposed 'objectivity' and their ease of administration and scoring. 3. In fact, multiple choice tests may not be any more reliable or valid than similar tests in different formats - indeed, in some cases, it is known that the open-ended formats tend to produce a greater amount of reliable and valid test variance, e.g., ordinary cloze procedure versus multiple choice variants (see Chapter 12, and the Appendix). 4. Multiple choice test may be discrete point, integrative, or pragmatic there is nothing intrinsically discrete point about a multiple choice format. 5. Pragmatic tasks, with a little imagination and a lot of work, can be converted into multiple choice tests; however, the validity of the latter tests must be assessed in all of the usual ways. 6. If one is going to construct a multiple choice test for language assessment, it is recommended that the test author begin with a discourse context as the basis for test items. 7. Items must be evaluated for content, clarity, and balance among the alternatives they offer as choices. 8. Each set of alternatives should be evaluated for clarity, balance, extraneous clues, and determinacy of the correct choice. 9. Texts, i.e., any discourse based set of materials, that discuss highly technical, esoteric, super-charged, or otherwise distracting content should probably be avoided in most instances. 10. Among the common pitfalls in item writing are selecting inappropriate content; failure to include a thoroughly correct alternative; including more than one plausible alternative; asking test takers to guess facts not stated or implied; leaving unintentional clues as to the correct choice among the field of alternatives; making the correct choice the longest or shortest; including the opposite of the correct choice among the alternatives; repeatedly referring to information in the correct choice in other alternatives; and using ridiculous or hilarious alternatives. 11. Item analysis usually involves the examination of item facility indices, "b 258 LANGUAGE TESTS AT SCHOOL item discrimination indices, and response frequency distributions. 12. Item facility is simply the proportion of students answering the item correctly (according to the way it was keyed by the test item writer) .. 13. Item discrimination by Flanagan's method is the number of students in the top 27t %ofthe distribution (based on the total test scores) minus the students in the bottom 27t %answering the item correctly, all divided by the number corresponding to 27t %of the distribution. 14. Item discrimination is an estimate of the correlation between scores on a given item considered as a miniature test, and scores on the entire test. It can also be construed in a more general sense as the correlation between the item in question and any criterion measure considered to have independent claims to validity. 15. Thus, ID is always a measure of reliability and may also be taken as a measure of validity in the event that the total test score (or other criterion) has independent claims to validity. 16. Response frequency distributions display alternatives against groups of respondents (e.g., high, mid, and low). They are helpful in eliminating non-functional alternatives, or misleading alternatives that are trapping the better students. 17. Among the minimal steps for preparing multiple choice tests are the following: (1) clarify what is to be tested; (2) decide on type oftest to be used; (3) write the items; (4) have another person read and critique the items for clarity; (5) rewrite weak items; (6) pretest the items; (7) item analyze the pretest results; (8) rewrite bad items; (9) recycle steps (6) through (8) as necessary; (10) assess validity of product; (11) use the test and continue recycling of steps (7) through (10) until sufficient reliability and validity is attained. 18. Due to complexity of the preparation of multiple choice tests, and to their lack of instructional value, they are not recommended for classroom applications. 19.

Log In

Language tests at school

Sign up for access to the world's latest research

Related papers

Related papers

Related topics