Academia.eduAcademia.edu

The meanings and consequences of educational assessments

2000

The meanings and consequences of educational assessments Dylan Wiliam Critical Quarterly, 42(1), pp105-127 (2000) Introduction and overview The reasons for carrying out educational assessments can be grouped under three broad headings: formative supporting learning summative certifying individuals evaluative holding educational institutions accountable In this lecture, I want to argue that current policies with regard to the use of educational assessments, particularly those used in schools, have taken a wrong turn. They have started from the idea that the primary purpose of educational assessment is selecting and certifying the achievement of individuals (ie summative assessment)—and have tried to make assessments originally designed for this purpose also provide information with which educational institutions can be made accountable (evaluative assessment). Educational assessment has thus become divorced from learning, and the huge contribution that assessment can make to learning (ie formative assessment) has been largely lost. Furthermore, as a result of this separation, formal assessment has focused just on the outcomes of learning, and because of the limited amount of time that can be justified for assessments that do not contribute to learning, has assessed only a narrow part of those outcomes. The predictability of these assessments allows teachers and learners to focus on only what is assessed, and the high stakes attached to the results create an incentive to do so. This creates a vicious spiral in which only those aspects of learning that are easily measured are regarded as important, and even these narrow outcomes are not achieved as easily as they could be, or by as many learners, were assessment regarded as an integral part of teaching. In place of this vicious spiral, I propose that developing a system that integrates summative and formative assessment will improve both the quality of learning and the quality of the assessment. A separate system, relying on ‘light sampling’ of the performance of schools would provide stable and robust information for the purposes of accountability and policy-formation. I begin with a survey of current practices in national assessment, focusing in turn on the two key issues of reliability and validity. Reliability All measurements whether physical, psychological or educational, are unreliable. If we wanted to find out how accurate an instrument for measuring length was, one way to do that would be to take lots of measurements of the same length. These measurements would not all be the same , but we would expect them to cluster around a particular value, which we would regard as the best estimate of the actual length of whatever we were measuring. However, in doing this, we are assuming that the object we are measuring isn’t changing length between our measurements. In other words, we assume that the differences between our measurements are ‘errors of measurement’. This is a circular argument—we infer physical laws by ignoring variation in measurements that we assume to be irrelevant, because of our physical law (Kyburg, 1992). If our measurements cluster tightly together, we conclude that we have a reliable instrument, and if they are widely scattered, then we conclude that our instrument is unreliable. In the same way, if we wanted to find out the reliability of an educational test, we would test a group of students with the same (or a similar) test many times. For each candidate, the marks obtained would cluster around a particular score, which we call the ‘true score’ for that candidate. This does not, of course, mean that we think that the individual being tested has a true ability or anything like that—the idea of a true score is just the long-run average of all the different scores if we test the same individual lots of times. This is the same as the circular argument we have to use in physical measurement to ‘pull ourselves up by our own bootstraps’. Now if the test is a good test, the values will cluster together tightly—the score that the candidate gets on any one occasion will be close, if not identical, to the score obtained on a second occasion. A bad test, on the other hand will produce values that vary widely, so that, to a very real extent, the mark obtained on one occasion is not a reliable guide to what they would achieve on a subsequent occasion. The reliability of a test is defined as a kind of ‘signal to noise’ ratio. The mark of any individual on a particular tests is assumed to be made up of a signal (the true score) and some noise (the error), and of course, we want the ratio of signal to noise to be as large as possible. This can be achieved in two ways. The best way would be to reduce the ‘noise’, by reducing the error in the test scores. However, we can also improve the signal to noise ratio by increasing the strength of the signal. In the context of educational assessment, ‘increasing the signal’ entails making the differences between individuals as large as possible (in the same way that in communications engineering, say, increasing the signal would correspond to maximising the potential difference between presence and absence of signal in a wire). We want some students getting very high scores and others getting very low scores, so that the differences in test scores caused by the unreliability of the test are small compared to differences in the ‘true’ scores. This means that our test must discriminate between stronger and weaker candidates. To do this, we must select for our test items which are answered correctly only by those candidates who get high scores on the test overall. Other items, for which the success-rate of strong candidates is comparable to that for weaker candidates, are dropped because they do not discriminate. This has two consequences. The first is that choosing only what are regarded as ‘good’ items in this way invalidates all the statistical theories that are used to interpret test results, as was noted by Jane Loevinger over thirty years ago: Here is an enormous discrepancy. In one building are the test and subject matter experts doing the best they can to make the best possible tests, while in a building across the street, the psychometric theoreticians construct test theories on the assumption that items are chosen by random sampling. (Loevinger, 1965 p147) The second consequence is that our attempts to make reliable tests guarantee the production of tests that maximise differences between individuals, and minimise differences between their experiences. This is why, for example, that many studies of school effectiveness have found that schools have comparatively little effect on educational achievement. In these studies, educational achievement has been assessed with a test that was designed to maximise differences between individuals, irrespective of the differences in their experiences. Such tests are bound to minimise the differences between schools. The elevation of reliability as the primary criterion of quality is itself a value-laden assumption which has its roots in particular applications of educational assessment. As Cleo Cherryholmes remarks, “Constructs, measurements, discourses and practices are objects of history” (Cherryholmes, 1989 p115). The requirement for reliability, originally intended to ensure that the meaning of an assessment result was stable, turns out to create, and reify constructs—constructs that are generally assumed to be assessed in a value-free way. The reliability of educational assessments Educational assessments are unreliable for a number of reasons. The individuals being tested are not consistent in their performance—people have ‘good days’ and ‘bad days’—and, apart from multiplechoice tests, there is also some inconsistency in the ways that assessments are marked. But the most significant cause of unreliability is the actual choice of items for a particular test. If we have an annual assessment like the national curriculum tests or GCSE examinations, then, because the papers are not kept secret, new versions have to be prepared each year. The question is then are the tests interchangeable? The two tests might be assessing broadly the same thing, but one of the two tests might suit a particular candidate better than the other. We therefore have a situation where the scores that candidates get depend on how lucky they are—reputedly the method that Napoleon used to choose his generals, and look where that got him! In fact, as Robert Linn and his colleagues have shown, the unreliability caused by the variation in tasks is actually greater than that caused by disagreements amongst markers (Linn & Baker, 1996). What this means is that given a choice between a three-hour exam where each paper is marked by two markers, and six-hours of exams where each paper is marked only once, the latter produces greater reliability. So, given these kinds of unreliability, how accurate are the results educational tests? The reliability of specialist psychological tests can be as high as 95%, while the reliability of examinations and tests used in education is typically of the order of 85%. Unfortunately, what this actually means in practice is not at all straightforward, so I will illustrate the consequences of the unreliability of typical educational tests. An educational test will generally be pitched so as to have an average mark around 50% with the marks of the candidates ranging from about 20% to 80%. Let us consider what this means for a class of 30 children. 2 For half the children in the class, the mark that they get on any one occasion will be within 4% of their ‘true score’—that is the long-run average of what they would get over many testing occasions. This is quite re-assuring, but the corollary of this is that for the other half of the students in the class, the mark they actually get is ‘wrong’ by more than 4 marks. And for one student in the class, the mark they get will be wrong by more than 12 marks. The student probably wouldn’t mind if they get 12 marks more than they should have got, but it is equally likely that this error is the other way, and they get 12 marks less than they should have done. Of course, the student won’t know this because they don’t know what their true score is. Now does this matter? Well, although a score 12 marks below what one should have got is very unfortunate for the individual concerned, if only one person in a class of thirty is seriously disadvantaged by an assessment process, we might consider this a price worth paying. However, it turns out that serious disadvantage in educational assessment is not that uncommon. The government has never published reliability statistics for any of its statutory national curriculum assessments nor has it required the examination boards to publish statistics on the reliability of GCSE and A-level examinations. Indeed, one of the most remarkable features of the examination system of England and Wales is that relatively few reliability studies have been conducted. Those that have been carried out have found the reliability of educational assessments to be around 85%. The fact that educational assessments are unreliable is accepted to an extent in this country. We are suspicious of percentage scores, and prefer, instead, to report grades in the case of school examinations and classes in the case of undergraduates degrees. And in a way this is very sensible, because although, on balance, it is likely that someone who got 65% on a test is likely to have a higher ‘true score’ than someone who got 64%, it could easily be the other way round, and even if the first person does have a higher true score, they are unlikely to be that much better. In response to the danger of claiming ‘spurious precision’ for scores, the tendency in the UK has therefore been to report not scores but grades (in school examinations) or classes (in university examinations). However, the result of reporting scores as grades is that it is too often assumed that the grades or classes are ‘right’. The grades or classes reported are likely to be ‘right’ when the score that an individual receives is right in the middle of the range for a particular grade or class, but when someone is close to the borderline between two grades, only a small error in their score will tip them over into a different grade. For example, suppose a university decides that candidates need a particular pattern of A-level grades—say three Bs—to benefit from a programme, then a student who gets two Bs and a C may well be rejected. If the cut-off for a grade B was 60%, and that for a grade A was 70%, then a student with marks of 60, 60 and 60 would get three Bs and would get in, but a student getting marks of 59, 69 and 69 would get two Bs and a C, and would probably be rejected. Had the admissions tutor been told the actual scores, it would be clear that the candidate had only just missed the threshold for B on the first subject, and given the other scores, may well have admitted her. If scores and percentages are prone to spurious precision, then grades are prone to spurious accuracy. So, how accurate are reported grades? Well because of the lack of published statistics on educational assessments, we have to make some assumptions, but assuming that examinations have a reliability of 85%, and assuming that we use these examinations to allocate students to one of eight grades (as is the case in GCSE), then only about 60% of the candidates would get the ‘right’ grade. Of the remaining 40%, half would get a higher grade than they should, and half would get a lower grade than they should, and for a small number of candidates, their reported grade will be out by two grades. For school tests and examinations, the government’s response to this has been to abandon the attempt to distinguish between eight different levels of performance in one examination, and instead have different examinations for students of different levels of achievement. In mathematics at GCSE, for example, there are three tiers of examination each of which gives access to only four or five grades. With tiering, the number of candidates getting the ‘wrong’ grade will be reduced, but only at the cost of restricting the grades available to candidates. For example, candidates who take the least demanding tier cannot get a good grade (ie grade ‘C’ or higher) no matter how well they do. This places a great deal of pressure on teachers to make the right entry choice, and produces considerable alienation amongst the students whose potential achievement is restricted in this way. The important point here is that these difficulties arise because of a fundamental limitation in the accuracy we can expect of traditional timed examinations. The introduction of ‘tiering’ does increase the proportion of candidates correctly classified, but only at the cost of mis-classifying others, and more importantly, alienating a far greater number of students, who, because of this alienation may well give up and fail to achieve the grades of which they are capable—a cure that is probably worse than the disease. 3 This debate about the costs and benefits of tiering has been conducted largely within the professional and academic community. Given the importance attached by the public to these results, the absence of any public concern about the lack of information about the reliability of educational assessments is rather puzzling. In this country, opinion pollsters routinely publish margins of error for their poll results, and in the United States, it is expected that any user of test information will know the limits of the test result they are using. And yet, in this country, we treat test results as perfectly accurate. Why is there no measurement error in the UK? One perspective on this is provided by the work of J L Austen who in the 1955 William James lectures, discussed two different kinds of ‘speech acts’—illocutionary and perlocutionary (Austin, 1962). Illocutionary speech acts are those that by their mere utterance actually do what they say. In contrast, perlocutionary speech acts are speech acts about what has been, is or will be. For example, the verdict of a jury in a trial is an illocutionary speech act—it does what it says, since the defendant becomes innocent or guilty simply by virtue of the announcement of the verdict. Once a jury has declared someone guilty, they are guilty, whether or not they really committed the act of which they are accused, until that verdict is set aside by another (illocutionary) speech act. Another example of an illocutionary speech act is the wedding ceremony, where the speech act of one person (the person conducting the ceremony saying “I now pronounce you husband and wife”) actually does what it says, creating the ‘social fact’ of the marriage (Searle, 1995). The idea of the accuracy of an assessment derives from a view of assessment results as perlocutionary speech acts. If we claim to be describing someone’s performance now, in the past, or predicting their performance in the future, then it makes sense to ask how accurate that description is, which often raises questions of objectivity and subjectivity. However, while it may make sense to question the authority of a maker of a speech act to create social facts, it does not make any sense to question the accuracy of those speech acts. This point is well illustrated by the story of the journalist asking an American baseball umpire whether his judgements were subjective or objective: Interviewer: Did you call them the way you saw them, or did you call them the way they were? Umpire: The way I called them was the way they were. Rightly or wrongly, in the United Kingdom, at the moment, the pronouncements of the government’s testing agencies are treated as illocutionary speech acts, creating the social fact of an individual’s success or failure. The grade you get is the grade you get, and arguing about the likely effect of measurement error will do you no more good than claiming to an umpire that you weren’t out—he’ll just tell you to look in the newspaper tomorrow... Such an authoritarian stance is tenable in a stable social order, but at a time when the authority of professionals is (in my view rightly) open to challenge, it is a dangerous tactic. The reliability of our national assessments is simply not good enough to warrant the trust that is placed in them. And one day, people are going to find this out. What is perhaps even more surprising, is that reliability (ie how accurately we are measuring something) has been a priority in the development of our national assessment systems and has arguably taken precedence over questions of validity (ie whether we a re measuring the right thing). This is perhaps best summed up by the story of the drunk looking for his keys at night under a streetlamp. When asked, “Is this where you dropped them?”, he replies, “No, but this is where the light is”. Validity Instead of asking how accurately are we measuring something, a concern for validity asks what, exactly are we measuring—specifically what do the results of our assessments mean? The traditional definition of validity—and one that dates back at least sixty years—is that an assessment is valid if it assesses what it purports to assess. This definition is still common in many texts on assessment in this country even though it is unsatisfactory in many respects. In the first place, validity cannot be a property of an assessment. If we have a science test that happens to be written in a way that requires a high level of reading skill to discover what the questions are asking, then this test may well be a good science test for fluent readers, but it will not be a good test for poor readers. In other words, the validity of a test can change according to who takes the test. 4 In the second place, an assessment does not purport anything—it tests simply what it tests. The purporting is done by those who claim that a particular test result tells us something beyond just the result of that test. This is the fundamental issue in educational assessment—how we can move from a candidate’s score on a particular assessment to making more general claims. This is why it has become increasingly accepted over the last thirty years that validity is not a property of a test at all, but a property of the conclusions that we draw on the basis of test results. This marks a huge shift, because it transfers some, if not the majority, of the responsibility for establishing validity from those who make tests to those who draw specific conclusions about the meaning of test results. In the words of Lee Cronbach, “One validates, not a test, but an interpretation of data arising from a specified procedure”(Cronbach, 1971 p447, emphasis in original). For example, with the traditional view of validity, the responsibility for the validation of A-level examinations would fall on the examination boards. It would be up to them to show that the exams did actually assess ‘Physics’ or ‘English Literature’. This would involve showing that the examinations did assess the syllabuses published by the examination board. However, we also use examination results in other ways. Rather than interpreting examination results to tell us how well students have done in the past, we also use them to attempt to predict how well they will do in the future. Universities want to select students who will do well at university, and of course we can’t really find this out until the students have actually been to university. What we can do, however, is to look for something that correlates well with the outcomes of university education, but which can be assessed before students go to university. For most students, the measure that is used is A-level, but with this new conception of validity, universities who wish to use A-level scores for deciding which students they admit must provide evidence that the use of A-levels in this way is warranted. Within this view, validity subsumes reliability, because any conclusions we might want to draw from a set of test results are unlikely to be justifiable if the same test administered to the same candidate could generate completely different result tomorrow. The use of tests to predict future performance is quite widespread. In some local authorities, scores of 11+ tests are used to predict the capability of learners to “benefit from a grammar school education”, GCSE scores are used to select which students should go on to do A-level and A-level scores are used to predict who should go on to university. The extent to which an assessment can be used to predict future performance is usually expressed by a correlation—a good predictor is one where students getting high scores on the predictor go on to get good scores at the next level. Again, studies of how good these predictors are few and far between, but correlations around 70% are typical. What this means in practice depends how selective we are being, but if we are selecting around one-third of the individuals applying, then with a typical selection test, we would only be making the right decision for around three-quarters of the people. For the other quarter of the population, we would either be taking those who we shouldn’t, or not taking those who we should. Given the inaccuracy of this selection process, we ought to think very hard about whether we need to select at all. The same issues apply to the idea of ‘targeting’ particular students, which has become very popular in many secondary schools in England and Wales. Because the results of IQ tests taken at the age of 11 are correlated with GCSE scores at the age of 16, schools believe they can identify the particular students who they can expect to achieve the government’s key performance indicator for secondary schools—five ‘good’ grades at GCSE. When a large proportion of these ‘targeted’ students achieve the grades expected, the school believes that its targeting is working. However, since the correlation between IQ at 11 and GCSE scores at age 16 is only around 70%, a large number of students who could have achieved those grades, had they received the extra support given to ‘targeted’ students, do not. When predictor and outcome variables have a correlation of only 70%, the important point is that it is still all to play for. The technology of selection and targeting is not reliable or valid enough for the claims that are made for them. We simply do not know who has the potential to do well, and therefore, what we need are inclusive systems of education that allow all students to be targeted. Now the analysis so far has depended on traditional ideas of reliability and validity—ones that are, to all intents and purposes, directly borrowed from mainstream psychological testing. With educational assessments, however, we cannot ignore the fact that these assessments are carried out by, on and for real people, who change what they do as a result of the assessment. It is for this reason that Samuel Messick proposed in 1980 that a consideration of the consequences of the use of educational assessments should be part of the process of validation. 5 For example, most science teachers agree that practical skills are an important part of the content of a science curriculum. An assessment of ‘science’ therefore, ought to assess practical skills as well as more traditional forms of scientific knowledge and capability. However, testing practical skills is expensive, and those concerned with the efficiency of the assessments point out that the results of the practical and written tests correlate very highly, so there’s no need to carry on with the expensive practical testing. The same sorts of arguments have dominated the debate in the United States between multiple-choice and constructed-response tests. What then happens is the practical aspects of science are dropped from the assessment. The consequence of this, for the domain of school science is to send the message that practical science isn’t as important as the written aspects. The social consequence of this is that teachers, understandably anxious to get their students the best possible results in the assessment, not least because of its influence over the students’ future career prospects, place less emphasis on the practical aspects of science. Because teachers are no longer teaching practical science hand-in-hand with other aspects of science, the correlation between students’ performance in practical aspects of science and the written aspects weakens, so that it is no longer possible to tell anything about a student’s practical competence from the score on the science assessment. This is an example of what has become known as Goodhart’s law, name after Charles Goodhart, a former chief economist at the Bank of England, who showed that performance indicators lose their usefulness when used as objects of policy. The example he used was that of the relationship between inflation and money supply. Economists had noticed that increases in the rate of inflation seemed to coincide with increases in money supply, although neither had any discernible relationship with the growth of the economy. Since no-one knew how to control inflation, controlling money supply seemed to offer a useful policy tool for controlling inflation, without any adverse effect on growth. And the result was the biggest slump in the economy since the 1930s. As Peter Kellner comments, “The very act of making money supply the main policy target changed the relationship between money supply and the rest of the economy” (Kellner, 1997). Similar problems have beset attempts to provide performance indicators in the Health Service, in the privatised railway companies and a host of other public services. A variety of indicators is selected for their ability to represent the quality of the service, but when used as the sole index of quality, the manipulability of these indicators destroys the relationship between the indicator and the indicated. A particularly striking example of this is provided by one state in the US, which found that after steady year-on-year rises in state-wide test scores, the gains began to level off. They changed the test they used, and found that, while scores were low to begin with, subsequent years showed substantial and steady rises. However, when, five years later, they administered the original test, performance was way below the levels that had been reached by their predecessors five years earlier. By directing attention more and more onto particular indicators of performance they had managed to increase the scores on the indicator, but the score on the indicated was relatively unaffected. Now in the past, I have argued that if schools are to be held accountable via measures of performance, then the ‘value-added’ by the school—that is the amount of progress made by a student at the school—is a better measure than the achievement of students on leaving, which as often as not tells us more about what they knew when they started at the school. But the manipulable nature of the assessments that we are using means that both raw scores and value-added analyses are likely to be almost meaningless. The government is already finding that while it is making steady progress towards its targets for 11-year-olds, this has not been matched by progress towards its targets for 14-year-olds. Now part of this is no doubt due to the fact that extra money has been provided for ‘catch-up’ classes for 11-year-olds, but the reason this has helped is really only because the tests for 11-year-olds are easier to coach students for than those at age 14, combined with the fact that many secondary schools see the tests for 14-year-olds as irrelevant compared to the GCSE. Some authors have argued that the social consequences of test use, although important, are beyond the concerns of validity as such. However, others, notably the late Samuel Messick, have argued that where the use of assessments changes what people do, any enquiry into the quality of assessments that ignores the social consequences is impoverished. He has proposed that an argument about the validity of an assessment, for a particular use, requires the simultaneous consideration of four strands of evidence: a b c d evidence that the scores have a plausible meaning in terms of the domain being assessed evidence that the predictions that will be made from the results are justifiable an evaluation of the value implications inherent in adopting the assessment an evaluation of the social consequence of using the assessment These four strands can be presented as the result of crossing two facets as shown in figure 1. To illustrate this it is instructive to consider the argument that took place in 1991 about the relevant weighting of the three assessment components that were to be combined in order to produce an overall level for a student’s 6 achievement in English at age 14. The test developers had proposed that the three components (Speaking and Listening, Reading, and Writing) should be equally weighted, while the government’s advisers wanted to use the ratio 30:35:35. Within the classical validity framework, this would appear to be a technical debate about which weighting scheme would provide the best description of a candidate’s performance in English, or which one best predicted future success in the subject. In fact, on a sample of 2000 students for whom the component scores were available, only two changed level from one weighting scheme to the other! The heat of the debate that was occasioned by this issue is therefore hard to understand. However, within Messick’s framework, we can see that while there is little to choose between the two weighing schemes in terms of the meanings of the results, they differ markedly in their consequences. Giving less weight in the mark scheme to Speaking and Listening than Reading or Writing sends the message that Speaking and Listening is less important than Reading and Writing. In other words, control of the mark scheme allows one to send messages about the values associated with an assessment. The presumed social consequence of this is to persuade teachers then to place greater emphasis (and therefore, presumably, more teaching time) on Reading and Writing than Speaking and Listing, meanings consequences within-domain beyond-domain construct validity (content considerations) construct validity (predictive and concurrent validity) value implications social consequences Figure 1: facets of validity argument (after Messick, 1980) Messick’s model provides an integrated view of validity as “an overall evaluative judgement of the adequacy and appropriateness of inferences and actions based on test scores” (Messick, 1988 p42), which takes into account the essentially social nature of educational assessments. Put simply, a test is valid to the extent that you are happy for teachers to teach towards the test, because, were this the case: • the only way to increase the student’s score on the test would be to increase the score on the whole of whatever the test is meant to be testing. • increasing the student’s score on the test would indicate an improvement in whatever the test is used to predict. • the value implications of the test—ie what messages it sends about what is important—would be appropriate; and • the social consequences of the likely uses of the test would be acceptable. To sum up, the trouble with the prevalent approach to educational assessment in this country is that we have divorced the certification of achievement and capability from the learning process. Because the assessments we use have no educational value, we feel unable to justify spending a lot of time on them, so that we typically assess the outcome of several thousand hours of learning with assessments that last only a few hours. Giving so little time to the assessment means that we can assess only a limited proportion of what has been taught, and conducting the assessments in ‘standardised conditions’ means that teachers can easily guess which parts of the curriculum are going to be assessed. Because of the importance attached to these outcomes, teachers and students are pressured into focusing on only those aspects of the curriculum that will be assessed. This process is well summed up by Charles Handy’s rendering of the what has come to be known as the Macnamara Fallacy, named after the US Secretary of Defense: The Macnamara Fallacy: The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't easily be measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide. (Handy, 1994 p219) We started out with the aim of making the important measurable, and ended up making only the measurable important. 7 A different starting point Over the last ten years, there have been three major reviews of the contribution that informal classroom assessment, conducted as part of teachers’ day-to-day activities, can make to raising standards of achievement (Black & Wiliam, 1998; Crooks, 1988; Natriello, 1987). The message from these studies, between them covering over 500 research studies, is clear. Improving the quality of teachers’ day-to-day assessment practices has a substantial effect on the achievement of students—big enough to take an average country in the international ‘league tables’ of student achievement, such New Zealand, England or the United States, up into the top 5. The essential feature of effective classroom assessment is not merely that it is diagnostic (ie tells learners where they are going wrong) but also formative (ie tells them what to do in order to improve). Strictly speaking, of course, there is no such thing as a formative assessment. The distinction between formative and summative applies not to the assessment itself, but to the use to which the information arising from the assessment is put. The same assessment can serve both formative and summative functions, although in general, the assessment will have been designed so as to emphasise one of the functions. The defining feature of a formative assessment is that the information fed back to the learner must be used by the learner in improving performance. If, for example, the teacher gives feedback to the student indicating what needs to be done next, that is not formative unless the learner can understand and act on that information. An essential pre-requisite for assessment to serve a formative function is therefore that the learner comes to understand the goals towards which she is aiming (Sadler, 1989). If the teacher tells the student that she needs to “be more systematic” in her mathematical investigations, that is not feedback unless the learner understands what “being systematic” means—otherwise this is no more helpful than telling an unsuccessful comedian to “be funnier”. The difficulty with this is that if the learner understood what “being systematic” meant, she would probably have been able to be more systematic in the first place. The teacher believes the advice she is giving is helpful, but that is because the teacher already knows what it means to be systematic. Of course, to be practicable, such assessments must be built into the teacher’s day-to-day practice, but once achieved, a formative assessment system would generate a wealth of data on the achievements of the individual student. Some of these—particularly where a teacher has probed a particular aspect very deeply —would be more use in determining a student’s future learning needs than in establishing an overall level of achievement, so it would be unwise to use any mechanistic formula in order to move from the finegrained record to an overall level of achievement for the year. Instead, the teachers would re-interpret all the available data in order to come up with a grade for the student. The immediate reaction of many to this proposal is that such a system cannot be used for high-stakes assessment, such as school-leaving and university entrance examinations because teachers’ judgements of their students cannot be objective. In a sense this is true. Teachers’ judgements of their own students will not be objective, but then on the other hand neither will any other kind of assessment. Any assessment just tells us what the student achieved on that assessment. In this sense, a test tests only what a test tests. However, we are hardly ever interested in the result of the test per se. We are generally interested in the results of a test as a sample of something wider, and so we interpret each test result in terms of something else. Norm-referenced and cohort-referenced assessments For most of the history of educational testing, the way we have made sense of test results has been by comparing the performance of an individual with that of a group of students, and this has been done in one of two ways. The first is to compare the performance of an individual with the others who took the test at the same time. This is often called a norm-referenced test (more precisely a norm-referenced interpretation of a test result), but in fact it is better termed a cohort-referenced test, since the only comparison is with the cohort of students who took the test at the same time. If we want to select thirty people for a particular university course, and we have a test that correlates highly with the outcomes of the course, then it might be appropriate to admit just the thirty students who get the highest score on the test. In this case, each candidate is compared only with the other candidates taking the test at the same time, so that sabotaging someone else’s chances improves your own. Such a test is truly competitive. Frequently, however, the inferences that are sought are not restricted to just a single cohort and it becomes necessary to compare the performance of candidates in a given year with those who took the same assessment previously. The standard way to do this is to compare every candidate that takes a test with the performance of some well-defined group of individuals. For example until recently, the performance of 8 every single student who took the American Scholastic Aptitude Test (SAT) was compared with a group of college-bound young men from the east coast of the United States who took the test in 1941. Both norm- and cohort-referenced assessment are akin to the process of ‘benchmarking’ in business, which is tantamount to saying “we have no idea what level of performance we need here, so let’s just see how we’re doing compared with everybody else”. All that is required for this is that you can put the candidates in rank order. The trouble with this approach is that you can very easily put people in rank order without having any idea of what you are putting them in rank order of. It was this desire for some clarity about what actually was being assessed, particularly for teaching purposes, that led to the development of criterion-referenced assessment in the 1960s and 1970s. Criterion-referenced assessments The essence of criterion-referenced assessment is that the domain to which inferences are to be made is specified with great precision. In particular, it was hoped that performance domains could be specified so precisely that items for assessing the domain could be generated automatically and uncontroversially (Popham, 1980). However, as Angoff (1974) has pointed out, any criterion-referenced assessment is underpinned by a set of norm-referenced assumptions, because the assessments are used in social settings. In measurement terms, the criterion ‘can high jump two metres’ is no more interesting than ‘can high jump ten metres’ or ‘can high jump one metre’. It is only by reference to a particular population (in this case human beings), that the first has some interest, while the latter two do not. The need for interpretation of criteria is clearly illustrated in the UK car driving test, which requires, among other things, that the driver “Can cause the car to face in the opposite direction by means of the forward and reverse gears”. This is commonly referred to as the ‘three-point-turn’, but it is also likely that a five point-turn would be acceptable. Even a seven-point turn might well be regarded as acceptable, but only if the road in which the turn was attempted were quite narrow. A forty-three point turn, while clearly satisfying the literal requirements of the criterion, would almost certainly not be regarded as acceptable. The criterion is there to distinguish between acceptable and unacceptable levels of performance, and we therefore have to use norms, however implicitly, to determine appropriate interpretations. Another competence required by the driving test is that the candidate can reverse the car around a corner without mounting the curb, nor moving too far into the road, but how far is too far?’ In practice, the criterion is interpreted with respect to the target population; a tolerance of six inches would result in nobody passing the test, and a tolerance of six feet would result in almost everybody succeeding, thus robbing the criterion of its power to discriminate between acceptable and unacceptable levels of performance. In any particular usage, a criterion is interpreted with respect to a target population, and this interpretation relies on the exercise of judgement that is beyond the criterion itself. In particular, it is a fundamental error to imagine that the words laid down in the criterion will be interpreted by novices in the same way as they are interpreted by experts. For example, the national curriculum for English in England and Wales specifies that average 14-year olds should be able to show “sensitivity to others” in discussion. The way that this is presented suggests that “sensitivity to others” is a prior condition for competence, but in reality, it is a post hoc description of competent behaviour. If a student does not already understand what kind of behaviour is required in group discussions, it is highly unlikely that being told to be ‘sensitive to others’ will help. What are generally described as ‘criteria’ are therefore not criteria at all, since they have no objective meaning independent of the context in which they are used. This point was recognised forty years ago by Michael Polanyi who suggested that intellectual abstractions about quality were better described as ‘maxims’: “Maxims cannot be understood, still less applied by anyone not already possessing a good practical knowledge of the art. They derive their interest from our appreciation of the art and cannot themselves either replace or establish that appreciation” (Polanyi, 1958 p50). The same points have been made by Robert Pirsig who also argues that such maxims are post hoc descriptions of quality rather than constituents of it: Quality doesn’t have to be defined. You understand it without definition. Quality is a direct experience independent of and prior to intellectual abstractions (Pirsig, 1991 p64). 9 How we make sense of assessment results therefore in most cases depends neither on a comparison with a norm-group, nor on the existence of unambiguous criteria. Instead, for the vast majority of assessments, what appears to be going on is that an individual result is interpreted in terms of the collective judgement of a community of examiners. I want to illustrate this by describing the practices of teachers who have been involved in ‘high-stakes’ assessment of English Language for the national school-leaving examination in England and Wales. Construct-referenced assessment In this innovative system, students developed portfolios of their work which were assessed by their teachers. In order to safeguard standards, teachers were trained to use the appropriate standards for marking by the use of ‘agreement trials’. Typically, a teacher is given a piece of work to assess and when she has made an assessment, feedback is given by an ‘expert’ as to whether the assessment agrees with the expert assessment. The process of marking different pieces of work continues until the teacher demonstrates that she has converged on the correct marking standard, at which point she is ‘accredited’ as an assessor for some fixed period of time. However, even though these teachers have shown that they can apply the required standard of assessment consistently, their marking will still be double-checked by an external assessor, who has the power to amend the grades being awarded. The innovative feature of such assessment is that no attempt is made to prescribe learning outcomes. In that it is defined at all, it is defined simply as the consensus of the teachers making the assessments. The assessment is not objective, in the sense that there are no objective criteria for a student to satisfy, but the experience in England is that it can be made reliable. To put it crudely, it is not necessary for the examiners to know what they are doing, only that they do it right. In looking at what is really going on when people arrive at consensual judgements, Tom Christie and Gerry Forrest (Christie & Forrest, 1981) argued that the judgements of examiners was limen-referenced, suggesting that the examiners had a notion of a threshold standard that was required to receive a particular grade. Subsequently, Royce Sadler suggested that these judgements were standards-referenced (Sadler, 1987), indicating that they were arrived at by the assessors sharing a common standard for assessment. Both these ideas capture aspects of the process. The notion of a threshold is very familiar to experienced examiners who know that the crucial distinctions need to be made at the borderlines. However, there are two ways in which assessors might come to the judgment. The first is through the examiners having in their minds a clear notion of the relevant threshold. The second is that the assessors may come to their judgement not by looking at thresholds, but by having an idea of (say) a ‘typical’ D and a ‘typical’ C and seeing which is closer to the piece of work being assessed. The other difficulty with the idea of the notions of thresholds and standards is that they appear to suggest that the ‘standards’ in question lie along a single scale. This may well be true for simple assessments, but for the kinds of assessments that are used in ‘high-stakes’ assessments, there are many different routes to the same grade. In this sense, a particular grade level in an assessment appears to be more like the idea of a syndrome in medicine. In medicine, a syndrome is a collection of symptoms that often occur together. Generally, a patient exhibiting one or two of the signs associated with the syndrome would not be regarded as having the syndrome but with three or four, might be. However, it is possible that some of the characteristics may, in a particular individual, be strong enough for the syndrome to be established on the basis of one or two symptoms. In the same way whether a piece of work is a grade C or a grade D at A-level, or whether a thesis does or does not merit a PhD, involves balancing many different factors, but the important point is that the absence of some, or even the majority of the relevant factors does not mean that the piece of work is not worth the award being considered. In order to encompass all these ideas, I have proposed that these assessments are in fact ‘constructreferenced’ because they rely on the existence of a shared construct of quality—shared between the community of assessors. The touchstone for distinguishing between criterion- and construct-referenced assessment is the relationship between the written descriptions (if they exist at all) and the domains. Where written statements collectively define the level of performance required (or more precisely where they define the justifiable inferences), then the assessment is criterion-referenced. However, where such statements merely exemplify the kinds of inferences that are warranted, then the assessment is, to an extent at least, construct-referenced. The big question about any such system is of course, how reliable it is. Although a few studies of the reliability of teachers’ assessment of English portfolios have been conducted, none has been published. One such study found that the teachers’ grades agreed with the grades given by experts for 70% of the portfolios examined. This was considered unacceptably low, and so the results weren’t published. However, the figure of 70% was the consistency with which students were given the correct grade, which as we have seen, is very different from traditional measures of reliability. Table 1 shows how the 10 proportion of students who would be correctly classified on an eight-grade scale, such as that used in GCSE, varies with the reliability of the marking. Reliability .60 .70 .80 .90 .95 .99 Grading accuracy 40% 45% 52% 65% 75% 90% Table 1: impact of reliability of marking on accuracy of grading for an 8-grade scale As can be seen, a grading accuracy of 70% is achieved only when the reliability of the marking is well over 0.90—a figure that has never been achieved in GCSE. Teachers’ own assessments of their students’ portfolios are, it seems more reliable than traditional timed written examinations. The other concern that is raised by the replacement of examinations with coursework is that of authentication. Without formal written examinations, there is always a question mark over who’s work is being assessed. In GCSE, for example, despite its name, coursework is done largely outside lesson time, so that there are real concerns that what is really being assessed is the access to learning resources such as encyclopaedias and computers at home, or even what the student’s parents know. However, the consigning of coursework to time outside lessons is symptomatic of a view of assessment as divorced from day-to-day classroom work. In an integrated assessment system, there would be no question about the authentication of the work, because it would have been done by the students in the class. Coursework would be coursework—the vehicle for learning rather than an addition to the load. Accountability The system I have outlined above would, I believe, allow the benefits of formative assessment to be achieved, while producing robust assessments of the achievements and capabilities of students. However, it does not address the issue of the narrowing of the curriculum caused by ‘teaching to the test’. Even coursework-based assessments will be compromised if the results of individual students are used for the purpose of holding educational institutions accountable. To avoid this, if a measure of the effectiveness of schools is wanted, it could be provided by using a large number of tasks that cover the entire curriculum, with each student randomly assigned to take one of these tasks. The task would not provide an accurate measure of that student’s achievement, because the score achieved would depend on whether the particular task suited the individual. But for every student at a school who was lucky in the task they were assigned, there would be one who was unlucky, and the average achievement across the tasks would be a very reliable measure of the average achievement in the school. Furthermore, the breadth of the tasks would mean that it would be impossible to teach towards the test. Or more precisely, the only effective way to teach towards the test would be to raise the standard of all the students on all the tasks, which, provided the tasks are a broad representation of the desired curriculum, would be exactly what was wanted. The government and policy makers would have undistorted information about the real levels of achievement in our schools, and users of assessment results would have accurate indications of the real levels of achievement of individual students, to guide decisions about future education and employment. Summary Current policies on testing and assessment start from the idea that the main purpose of educational assessment is selecting and certifying the achievement of individuals at the end of the ‘key stages’ of schooling (ie at ages 7, 11, 14, 16 and 18), and then have tried to make these tests also provide information to parents about the quality of schools. Because the tests in use have little educational value, they have to be short, and thus are unreliable, and test only a narrow range of the skills needed for life in the 21st century. Furthermore, because the tests are narrow, and test only what can easily be tested, schools have found it possible to guess what topics are going to come up in the tests, and can increase their test results by ignoring important topics that do not get tested. Focusing on the test results has meant that the contribution that day-to-day assessments can make to learning has been ignored. However, research collected all over the world shows that if schools used assessment during teaching, to find out what students have learned, and what they need to do next, on a daily basis, the achievement of British students would be in the top five in the world, after Signapore, Japan, Taiwan and South Korea. We have created a vicious spiral in which only those aspects of learning that are easily measured are regarded as important, and even these narrow outcomes are not achieved as easily as they could be, or by as many learners, if assessment was regarded as an integral part of teaching. 11 In place of this vicious spiral, I have proposed that the test results for individual students should be derived from teacher assessments, rigorously moderated by external assessors. A separate system, relying on ‘light sampling’ of the performance of schools would provide stable and robust information for the purposes of accountability and policy-formation. Concluding note There is a widespread view that the purpose of educational research is to lead practice—that educational researchers should experiment to find what works best, and encourage teachers to take this up. In my own work, however, I am acutely aware that I have not been leading but following behind the work of teachers and other assessors, trying to make sense of what they do, and, I hope, providing a language with which we can talk about what is going on with a view to sharing it more widely. I think it is appropriate, therefore, to conclude this lecture with the words of a teacher:       —Peter Silcock References Angoff, W. H. (1974). Criterion-referencing, norm-referencing and the SAT. College Board Review, 92(Summer), 2-5, 21. Austin, J. L. (1962). How to do things with words : the William James Lectures delivered at Harvard University in 1955. Oxford, UK: Clarendon Press. Black, P. J. & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles Policy and Practice, 5(1), 7-73. Cherryholmes, C. H. (1989). Power and criticism: poststructural investigations in education. New York, NY: Teachers College Press. Christie, T. & Forrest, G. M. (1981). Defining public examination standards. London, UK: Macmillan. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.) Educational measurement (pp. 443507). Washington DC: American Council on Education. Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of Educational Research, 58(4), 438-481. Handy, C. (1994). The empty raincoat. London, UK: Hutchinson. Kellner, P. (1997). Hit-and-miss affair. Times Education Supplement, 23. Kyburg, H. E. (1992). Measuring errors of measurement. In C. W. Savage & P. Ehrlich (Eds.), Philosophical and foundational issues in measurement theory (pp. 75-91). Hillsdale, NJ: Lawrence Erlbaum Associates. Linn, R. L. & Baker, E. L. (1996). Can performance-based student assessment be psychometrically sound? In J. B. Baron & D. P. Wolf (Eds.), Performance-based assessment—challenges and possibilities: 95th yearbook of the National Society for the Study of Education part 1 (pp. 84-103). Chicago, IL: National Society for the Study of Education. Loevinger, J. (1965). Person and population as psychometric concepts. Psychological Review, 72(2), 143155. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35(11), 10121027. Messick, S. (1988). The once and future issues of validity: assessing the meaning and consequences of measurement. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 33-45). Hillsdale, NJ: Lawrence Erlbaum Associates. 12 Natriello, G. (1987). The impact of evaluation processes on students. Educational Psychologist, 22(2), 155-175. Pirsig, R. M. (1991). Lila: an inquiry into morals. New York, NY: Bantam. Polanyi, M. (1958). Personal knowledge. London, UK: Routledge & Kegan Paul. Popham, W. J. (1980). Domain specification strategies. In R. A. Berk (Ed.) Criterion-referenced measurement: the state of the art (pp. 15-31). Baltimore, MD: Johns Hopkins University Press. Sadler, D. R. (1987). Specifying and promulgating achievement standards. Oxford Review of Education, 13, 191-209. Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 145-165. Searle, J. R. (1995). The construction of social reality. London, UK: Allen Lane, The Penguin Press. 13