Academia.eduAcademia.edu

Psychometric Principles in Student Assessment

2003, International Handbook of Educational Evaluation

In educational assessment, we observe what students say, do, or make in a few particular circumstances, and attempt to infer what they know, can do, or have accomplished more generally. Some links in the chain of inference depend on statistical models and probability-based reasoning, and it is with these links that terms such as validity, reliability, and comparability are typically associated-"psychometric principles," as it were. Familiar formulas and procedures from test theory provide working definitions and practical tools for addressing these more broadly applicable qualities of the chains of argument from observations to inferences about students, as they apply to familiar methods of gathering and using assessment data. This presentation has four objectives. It offers a framework for the evidentiary arguments that ground assessments, examines where psychometric principles fit in this framework, shows how familiar formulas apply these ideas to familiar forms of assessment, and looks ahead to extending the same principles to new kinds of assessments.

Psychometric Principles in Student Assessment Robert J. Mislevy, University of Maryland Mark R. Wilson, University of California at Berkeley Kadriye Ercikan, University of British Columbia Naomi Chudowsky, National Research Council September, 2001 To appear in D. Stufflebeam & T. Kellaghan (Eds.), International Handbook of Educational Evaluation. Dordrecht, the Netherlands: Kluwer Academic Press. ACKNOWLEDGEMENTS This work draws in part on the authors’ work on the National Research Council’s Committee on the Foundations of Assessment. We are grateful for the suggestions of section editors George Madaus and Marguerite Clarke on an earlier version. The first author received support under the Educational Research and Development Centers Program, PR/Award Number R305B60002, as administered by the Office of Educational Research and Improvement, U. S. Department of Education. The second author received support from the National Science Foundation under grant No. ESI-9910154. The findings and opinions expressed in this report do not reflect the positions or policies of the National Research Council, the National Institute on Student Achievement, Curriculum, and Assessment, the Office of Educational Research and Improvement, the National Science Foundation, or the U. S. Department of Education. Psychometric Principles in Student Assessment Abstract In educational assessment, we observe what students say, do, or make in a few particular circumstances, and attempt to infer what they know, can do, or have accomplished more generally. Some links in the chain of inference depend on statistical models and probability-based reasoning, and it is with these links that terms such as validity, reliability, and comparability are typically associated— “psychometric principles,” as it were. Familiar formulas and procedures from test theory provide working definitions and practical tools for addressing these more broadly applicable qualities of the chains of argument from observations to inferences about students, as they apply to familiar methods of gathering and using assessment data. This presentation has four objectives. It offers a framework for the evidentiary arguments that ground assessments, examines where psychometric principles fit in this framework, shows how familiar formulas apply these ideas to familiar forms of assessment, and looks ahead to extending the same principles to new kinds of assessments. [V]alidity, reliability, comparability, and fairness are not just measurement issues, but social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made. Messick, 1994, p. 2. Overview What are psychometric principles? Why are they important? How do we attain them? We address these questions from the perspective of assessment as evidentiary reasoning; that is, how we draw inferences about what students know, can do, or understand as more broadly construed, from the handful of particular things they say, do, or make in an assessment setting. Messick (1989), Kane (1992), and Cronbach and Meehl (1955) show the deep insights that can be gained from examining validity from such a perspective. We aim to extend the approach to additional psychometric principles and bring out connections with assessment design and probability-based reasoning. Seen through this lens, validity, reliability, comparability, and fairness (as in the quote from Messick, above) are properties of an argument—not formulas, models, or statistics per se. We’ll do two things, then, before we even introduce statistical models. We’ll look more closely at the nature of evidentiary arguments in assessment, paying special attention to the role of standardization. And we’ll describe a framework that structures the evidentiary argument in a given assessment, based on an evidence-centered design framework (Mislevy, Steinberg, & Almond, in press). In this way we may come to appreciate psychometric principles without tripping over psychometric details. Of course in practice we do use models, formulas, and statistics to examine the degree to which an assessment argument possesses the salutary characteristics of validity, reliability, comparability, and fairness. So this presentation does have to consider how these principles are addressed when one uses particular measurement models to draw particular inferences, with particular data, for particular purposes. To this end we describe the role of probability-based reasoning in the evidentiary argument, using classical test theory to illustrate ideas. We then survey some widely-used psychometric models, such as item response theory and generalizability analysis, focusing how each is Psychometric Principles Page 2 used to address psychometric principles in different circumstances. We can’t provide a guidebook for using all this machinery, but we will point out some useful references along the way for the reader who needs to do so. This is a long road, and it may seem to wander at times. We’ll start by looking at examples from an actual assessment, so the reader will have an idea of where we want to go, for thinking about assessment in general, and psychometric principles in particular. An Introductory Example The assessment design framework provides a way of thinking about psychometrics that relates what we observe to what we infer. The models of the evidence-centered design framework are illustrated in Figure 1. The student model, at the far left, concerns what we want to say about what a student knows or can do—aspects of their knowledge or skill. Following a tradition in psychometrics, we label this “θ” (theta). This label may stand for something rather simple, like a single category of knowledge such as vocabulary usage, or something much more complex, like a set of variables that concern which strategies a student can bring to bear on mixed-number subtraction problems and under what conditions she uses which ones. The task model, at the far right, concerns the situations we can set up in the world, in which we will observe the student say or do something that gives us clues about the knowledge or skill we’ve built into the student model. Between the student and task model are the scoring model and the measurement model, through which we reason from what we observe in performances to what we infer about a student. Let’s illustrate these models with a recent example—an assessment system built for a middle school science curriculum, “Issues, Evidence and You” (IEY; SEPUP, 1995). Figure 2 describes variables in the student model upon which both the IEY curriculum and its assessment system, called the BEAR Assessment System (Wilson & Sloane, 2000) are built. The student model consists of four variables at least one of which is the target of every instructional activity and assessment in the curriculum. The four variables are seen as four dimensions on which students will make progress during the curriculum. The dimensions are correlated (positively, we expect), because they all relate to “science”, but are quite distinct educationally. The psychometric tradition would use a Psychometric Principles Page 3 diagram like Figure 3 to illustrate this situation. Each of the variables is represented as a circle—this is intended to indicate that they are unobservable or “latent” variables. They are connected by curving lines—this is intended to indicate that they are not necessarily causally related to one another (at least as far as we are modeling that relationship), but they are associated (usually we use a correlation coefficient to express that association). ================= Insert Figures 1- 3 here ================= The student model represents what we wish to measure in students. These are constructs—variables that are inherently unobservable, but which we propose as a useful way to organize our thinking about students. They describe aspects of their skill or knowledge for the purposes of, say, comparing programs, evaluating progress, or planning instruction. We use them to accumulate evidence from what we can actually observe students say and do. Now look at the right hand side of Figure 1. This is the task model. This is how we describe the situations we construct in which students will actually perform. Particular situations are generically called “items” or “tasks”. In the case of IEY, the items are embedded in the instructional curriculum, so much so that the students would not necessarily know that they were being assessed unless the teacher tells them. An example task is shown in Figure 4. It was designed to prompt student responses that relate to the “Evidence and Tradeoffs” variable defined in Figure 2. Note that this variable is a somewhat unusual one in a science curriculum—the IEY developers think of it as representing the sorts of cognitive skills one would need to evaluate the importance of, say, an environmental impact statement—something that a citizen might need to do that is directly related to science’s role in the world. An example of a student response to this task is shown in Figure 5. How do we extract from this particular response some evidence about the unobservable student-model variable we have labeled Evidence and Tradeoffs? What we need is in the second model from the right in Figure 1—the scoring model. This is a procedure that allows one to focus on aspects of the student response and assign them to categories, in this case ordered categories that suggest higher levels of proficiency along Psychometric Principles Page 4 the underlying latent variable. A scoring model can take the form of what is called a “rubric” in the jargon of assessment, and in IEY does take that form (although it is called a “scoring guide”). The rubric for the Evidence and Tradeoffs variable is shown in Figure 6. It enables a teacher or a student to recognize and evaluate two distinct aspects of responses to the questions related to the Evidence and Tradeoffs variable. In addition to the rubric, scorers have exemplars of student work available to them, complete with adjudicated scores and explanation of the scores. They also have a method (called “assessment moderation”) for training people to use the rubric. All these elements together constitute the scoring model. So, what we put in to the scoring model is a student’s performance; what we get out is one or more scores for each task, and thus a set of scores for a set of tasks. ================= Insert Figures 4-6 here ================= What now remains? We need to connect the student model on the left hand side of Figure 1 with the scores that have come out of the scoring model—in what way, and with what value, should these nuggets of evidence affect our beliefs about the student’s knowledge? For this we have another model, which we will call the measurement model. This single component is commonly known as a psychometric model. Now this is somewhat of a paradox, as we have just explained that the framework for psychometrics actually involves more than just this one model. The measurement model has indeed traditionally been the focus of psychometrics, but it is not sufficient to understand psychometric principles. The complete set of elements, the full evidentiary argument, must be addressed. Figure 7 shows the relationships in the measurement model for the sample IEY task. Here the student model (first shown in Figure 3) has been augmented with a set of boxes. The boxes are intended to indicate that they are observable rather than latent, and these are in fact the scores from the scoring model for this task. They are connected to the Evidence and Tradeoffs student-model variable with straight lines, meant to indicate a causal (though probabilistic) relationship between the variable and the observed scores, and the causality is posited to run from the student model variables to the scores. Said Psychometric Principles Page 5 another way, what the student knows and can do, as represented by the variables of the student model, determines how likely it is that the students will make right answers rather than wrong ones, carry out sound inquiry rather than founder, and so on, in each particular task they encounter. In this example, both observable variables are posited to depend on the same aspect of knowledge, namely Evidence and Tradeoffs. A different task could have more or fewer observables, and each would depend on one or more student-model variables, all in accordance with what knowledge and skill the task is designed to evoke. ================= Insert Figure 7 here ================= It is important for us to say that the student model in this example (indeed in most psychometric applications) is not proposed as a realistic explanation of the thinking that takes place when a student works through a problem. It is a piece of machinery we use to accumulate information across tasks, in a language and at a level of detail we think suits the purpose of the assessment (for a more complete perspective on this see Pirolli & Wilson, 1998). Without question, it is selective and simplified. But it ought to be consistent with what we know about how students acquire and use knowledge, and it ought to be consistent with what we see students say and do. This is where psychometric principles come in. What do psychometric principles mean in IEY? Validity concerns whether the tasks actually do give sound evidence about the knowledge and skills the student-model variables are supposed to measure, namely, the five IEY progress variables. Or are there plausible alternative explanations for good or poor performance? Reliability concerns how much we learn about the students, in terms of these variables, from the performances we observe. Comparability concerns whether what we say about students, based on estimates of their student model variables, has a consistent meaning even if students have taken different tasks, or been assessed at different times or under different conditions. Fairness asks whether we have been responsible in checking important facts about students and examining characteristics of task model variables that would invalidate the inferences that test scores would ordinarily suggest. Psychometric Principles Page 6 Psychometric Principles and Evidentiary Arguments We have seen through a quick example how assessment can be viewed as evidentiary arguments, and that psychometric principles can be viewed as desirable properties of those arguments. Let’s go back to the beginning and develop this line of reasoning more carefully. Educational assessment as evidentiary argumenti Inference is reasoning from what we know and what we observe to explanations, conclusions, or predictions. Rarely do we have the luxury of reasoning with certainty; the information we work with is typically incomplete, inconclusive, amenable to more than one explanation. The very first question in an evidentiary problem is, “evidence about what?” Data become evidence in some analytic problem only when we have established their relevance to some conjecture we are considering. And this task of establishing the relevance of data and its weight as evidence depends on the chain of reasoning we construct from the evidence to those conjectures. Both conjectures and an understanding of what constitutes evidence about them, arise from the concepts and relationships of the field under consideration. We’ll use the term “substantive” to refer to these content- or theory-based aspects of reasoning within a domain, in contrast to structural aspects such as logical structures and statistical models. In medicine, for example, physicians frame diagnostic hypotheses in terms of what they know about the nature of diseases, and the signs and symptoms that result from various disease states. The data are patients’ symptoms and physical test results, from which physicians reason back to likely disease states. In history, hypotheses concern what happened and why. Letters, documents, and artifacts are the historian’s data, which she must fit into a larger picture of what is known and what is supposed. Philosopher Stephen Toulmin (1958) provided terminology for talking about how we use substantive theories and accumulated experience (say, about algebra and how kids learn it) to reason from particular data (Joe’s solutions) to a particular claim (what Joe understands about algebra). Figure 8 outlines the structure of a simple argument. The claim is a proposition we wish to support with data. The arrow represents inference, which is justified by a warrant, or a generalization that justifies the inference from the Psychometric Principles Page 7 particular data to the particular claim. Theory and experience provide backing for the warrant. In any particular case we reason back through the warrant, so we may need to qualify our conclusions because there are alternative explanations for the data. [[Figure 8—basic Toulmin diagram]] In practice, of course, an argument and its constituent claims, data, warrants, backing, and alternative explanations will be more complex than Figure 8. An argument usually consists of many propositions and data elements, involves chains of reasoning, and often contains dependencies among claims and various pieces of data. This is the case in assessment. In educational assessments, the data are the particular things students say, do, or create in a handful of particular situations—written essays, correct and incorrect marks on answer sheets, presentations of projects, or explanations of their problem solutions. Usually our interest lies not so much in these particulars, but in the clues they hold about what students understand more generally. We can only connect the two through a chain of inferences. Some links depend on our beliefs about the nature of knowledge and learning. What is important for students to know, and how do they display that knowledge? Other links depend on things we know about students from other sources. Do they have enough experience with a computer to use it as a tool to solve an interactive physics problem, or will it be so unfamiliar as to hinder their work? Some links use probabilistic models to communicate uncertainty, because we can administer only a few tasks or because we use evaluations from raters who don’t always agree. Details differ, but a chain of reasoning must underlie an assessment of any kind, from classroom quizzes and standardized achievement tests, to coached practice systems and computerized tutoring programs, to the informal conversations students have with teachers as they work through experiments. The case for standardization Evidence rarely comes without a price. An obvious factor in the total cost of an evidentiary argument is the expense of gathering the data, but figuring out what data to gather and how to make sense of it can also be dear. In legal cases, these latter tasks are usually carried out after the fact. Because each case is unique, at least parts of the Psychometric Principles Page 8 argument must be uniquely fashioned. Marshalling evidence and constructing arguments in the O.J. Simpson case took more than a year and cost millions of dollars, to prosecution and defense alike. If we foresee that the same kinds of data will be required for similar purposes on many occasions, we can achieve efficiencies by developing standard procedures both for gathering the data and reasoning from it (Schum, 1994, p. 137). A well-designed protocol for gathering data addresses important issues in its interpretation, such as making sure the right kinds and right amounts of data are obtained, and heading off likely or pernicious alternative explanations. Following standard procedures for gathering biological materials from crime scenes, for example, helps investigators avoid contaminating a sample, and allows them to keep track of everything that happens to it from collection to testing. What’s more, merely confirming that they’ve followed the protocols immediately communicates to others that these important issues have been recognized and dealt with responsibly. A major way that large-scale assessment is made practicable in education is by thinking these issues through up front: laying out the argument for what data to gather and why, from each of the many students that will be assessed. The details of the data will vary from one student to another, and so will the claims. But the same kind of data will be gathered for each student, the same kind of claim will be made, and, most importantly, the same argument structure will be used in each instance. This strategy offers great efficiencies, but it admits the possibility of cases that do not accord with the common argument. Therefore, establishing the credentials of the argument in an assessment that is used with many students entails the two distinct responsibilities listed below. We shall see that investigating them and characterizing the degree to which they hold can be described in terms of psychometric principles. • Establishing the credentials of the evidence in the common argument. This is where efficiency is gained. To the extent that the same argument structure holds for all the students it will be used with, the specialization to any particular student inherits the backing that has been marshaled for the general form. We will discuss below how the common argument is framed. Both rational analyses and large-scale statistical Psychometric Principles Page 9 analyses can be used to test its fidelity at this macro level. These tasks can be arduous, and they can never really be considered complete because we could always refine the argument or test additional alternative hypotheses (Messick, 1989). The point is, though, that this effort does not increase in proportion to the number of examinees who are assessed. • Detecting individuals for whom the common argument does not hold. Inevitably, the theories, the generalizations, the empirical grounding for the common argument will not hold for some students. The usual data arrive, but the usual inference does not follow—even if the common argument does support validity and reliability in the main. These instances call for additional data or different arguments, often on a more expensive case-by-case basis. An assessment system that is both efficient and conscientious will minimize the frequency with which these situations occur, but routinely and inexpensively draw attention to them when they do. It is worth emphasizing that the standardization we are discussing here concerns the structure of the argument, not necessarily the form of the data. Some may think that this form of standardization is only possible with so-called objective item forms such as multiple choice items. Few large-scale assessments are more open-ended than the Advanced Placement Studio Art portfolio assessment (Myford & Mislevy, 1995); students have an almost unfettered choice of media, themes, and styles. But the AP program provides a great deal of information about the qualities students need to display in their work, what they need to assemble as work products, and how raters will evaluate them. This structure allows for a common argument, heads off alternative explanations about unclear evaluation standards in the hundreds of AP Studio Art classrooms across the country, and, most happily, helps the students come to understand the nature of good work in the field (Wolf, Glenn, and Gardner, 1991). Psychometric principles as properties of arguments Seeing assessment as argument from limited evidence is a starting point for understanding psychometric principles. Psychometric Principles Page 10 Validity Validity is paramount among psychometric principles, for validity speaks directly to the extent to which a claim about a student, based on assessment data from that student, is justified (Cronbach, 1989; Messick, 1989). Establishing validity entails making the warrant explicit, examining the network of beliefs and theories on which it relies, and testing its strength and credibility through various sources of backing. It requires determining conditions that weaken the warrant, exploring alternative explanations for good or poor performance, and feeding them back into the system to reduce inferential errors. In the introductory example we saw that assessment is meant to get evidence about students’ status with respect to a construct, some particular aspect(s) of knowledge, skill, or ability—in that case, the IEY variables. Cronbach and Meehl (1955) said “construct validation is involved whenever a test is to be interpreted as a measure of some attribute or quality is not operationally defined”—that is, when there is a claim about a person based on observations, not merely a statement about those particular observations in and of themselves. Earlier work on validity distinguished a number of varieties of validity, such as content validity, predictive validity, convergent and divergent validity, and we will say a bit more about these later. But the current view, as the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council of Measurement in Education, 1999) assert, is that validity is a unitary concept. Ostensibly different kinds of validity are better viewed as merely different lines of argument and different kinds of evidence for a single kind of validity. If you insist on a label for it, it would have to be construct validity. Embretson (1983) distinguishes between validity arguments that concern why data gathered in a certain way ought to provide evidence about the targeted skill knowledge, and those that investigate relationships of resulting scores with other variables to support the case. These are, respectively, arguments about “construct representation” and arguments from “nomothetic span.” Writing in 1983, Embretson noted that validation studies relied mainly on nomothetic arguments, using scores from assessments in their Psychometric Principles Page 11 final form or close to it. The construction of those tests, however, was guided mainly by specifications for item format and content, rather than by theoretical arguments or empirical studies regarding construct representation. The “cognitive revolution” in the latter third of the Twentieth Century provided both scientific respectability and practical tools for designing construct meaning into tests from the beginning (Embretson, 1983). The value of both lines of argument is appreciated today, with validation procedures based on nomothetic span tending to be more mature and those based on construct representation still evolving. Reliability Reliability concerns the adequacy of the data to support a claim, presuming the appropriateness of the warrant and the satisfactory elimination of alternative hypotheses. Even if the reasoning is sound, there may not be enough information in the data to support the claim. Later we will see how reliability is expressed quantitatively when probability-based measurement models are employed. We can mention now, though, that the procedures by which data are gathered can involve multiple steps or features that each affect the evidentiary value of data. Depending on Jim’s rating of Sue’s essay rather than evaluating it ourselves adds a step of reasoning to the chain, introducing the need to establish an additional warrant, examine alternative explanations, and assess the value of the resulting data. How can we gauge the adequacy of evidence? Brennan (2000/in press) writes that the idea of repeating the measurement process has played a central role in characterizing an assessment’s reliability since the work of Spearman (1904)—much as it does in physical sciences. If you weigh a stone ten times and get a slightly different answer each time, the variation among the measurements is a good index of the uncertainty associated with that measurement procedure. It is less straightforward to know just what repeating the measurement procedure means, though, if the procedure has several steps that could each be done differently (different occasions, different task, different raters), or if some of the steps can’t be repeated at all (if a person learns something by working through a task, a second attempt isn’t measuring the same level of knowledge). We will see that the Psychometric Principles Page 12 history of reliability is one of figuring out how to characterize the value of evidence in increasingly wider ranges of assessment situations. Comparability Comparability concerns the common occurrence that the specifics of data collection differ for different students, or for the same students at different times. Differing conditions raise alternative hypotheses when we need to compare students with one another or against common standards, or when we want to track students’ progress over time. Are there systematic differences in the conclusions we would draw when we observe responses to Test Form A as opposed to Test Form B, for example? Or from a computerized adaptive test instead of the paper-and-pencil version? Or if we use a rating based on two judges, as opposed to the average of two, or the consensus of three? We must then extend the warrant to deal with these variations, and we must include them as alternative explanations of differences in students’ scores. Comparability overlaps with reliability, as both raise questions of how evidence obtained through one application of a data-gathering procedure might differ from evidence obtained through another application. The issue is reliability when we consider the two measures interchangeable—which is used is a matter of indifference to the examinee and assessor alike. Although we expect the results to differ somewhat, we don’t know if one is more accurate than the other, whether one is biased toward higher values, or if they will illuminate different aspects of knowledge. The same evidentiary argument holds for both measures, and the obtained differences are what constitutes classical measurement error. The issue is comparability when we expect systematic differences of any of these types, but wish to compare results obtained from the two distinct processes nevertheless. A more complex evidentiary argument is required. It must address the way that observations from the two processes bear different relationships to the construct we want to measure, and it must indicate how to take these differences into account in our inferences. Fairness Fairness is a term that encompasses more territory than we can address in this presentation. Many of its senses concern social, political, and educational perspectives on Psychometric Principles Page 13 the uses to which assessment results inform (Willingham & Cole, 1997)—legitimate questions all, which would exist even if the chain of reasoning from observations to constructs contained no uncertainty whatsoever. Like Wiley (1991), we focus our attention here on construct meaning rather than use or consequences, and consider aspects of fairness that bear directly on this portion of the evidentiary argument. Fairness in this sense concerns alternative explanations of assessment performances in light of other characteristics of students that we could and should take into account. Ideally, the same warrant backs inferences about many students, reasoning from their particular data to a claim about what each individually knows or can do. This is never quite truly the case in practice, for factors such as language background, instructional background, and familiarity with representations surely influence performance. When the same argument is to be applied with many students, considerations of fairness require us to examine the impact of such factors on performance and identify the ranges of their values beyond which the common warrant can no longer be justified. Drawing the usual inference from the usual data for a student who lies outside this range leads to inferential errors. If they are errors we should have foreseen and avoided, they are unfair. Ways of avoiding such errors are using additional knowledge about students to condition our interpretation of what we observe under the same procedures, and gathering data from different students in different ways, such as providing accommodations or allowing students to choose among ways of providing data (and accepting the responsibility as assessors to establish the comparability of data so obtained!) A Framework for Assessment Design This section lays out a schema for the evidentiary argument that underlies educational assessments, incorporating both its substantive and statistical aspects. It is based on the ‘evidence–centered’ framework for assessment design illustrated in Mislevy, Steinberg, and Almond (in press) and Mislevy, Steinberg, Breyer, Almond, and Johnson (1999; in press). We’ll use it presently to examine psychometric principles from a more technical perspective. The framework formalizes another quotation from Messick: Psychometric Principles Page 14 A construct-centered approach [to assessment design] would begin by asking what complex of knowledge, skills, or other attribute should be assessed, presumably because they are tied to explicit or implicit objectives of instruction or are otherwise valued by society. Next, what behaviors or performances should reveal those constructs, and what tasks or situations should elicit those behaviors? Thus, the nature of the construct guides the selection or construction of relevant tasks as well as the rational development of construct-based scoring criteria and rubrics. Messick, 1994, p. 16. Figure 1, presented back in the introductory example, depicts elements and relationships that must be present, at least implicitly, and coordinated, at least functionally, in any assessment that has evolved to effectively serve some inferential function. Making this structure explicit helps an evaluator understand how to first gather, then reason from, data that bear on what students know and can do. In brief, the student model specifies the variables in terms of which we wish to characterize students. Task models are schemas for ways to get data that provide evidence about them. Two components, which are links in the chain of reasoning from students’ work to their knowledge and skill: The scoring component of the evidence model contains procedures for extracting the salient features of student’s performances in task situations—i.e., observable variables—and the measurement component contains machinery for updating beliefs about student-model variables in light of this information. These models are discussed in more detail below. Taken together, they make explicit the evidentiary grounding of an assessment, and they guide the choice and construction of particular tasks, rubrics, statistical models, and so on. An operational assessment will generally have one student model, which may contain many variables, but may use several task and evidence models to provide data of different forms or with different rationales. Psychometric Principles Page 15 The Student Model: What complex of knowledge, skills, or other attributes should be assessed? The values of student-model variables represent selected aspects of the infinite configurations of skill and knowledge real students have, based on a theory or a set of beliefs about skill and knowledge in the domain. These variables are the vehicle through which we determine student progress, make decisions, or plan instruction for students. The number and nature of the student model variables depend on the purpose of an assessment. A single variable characterizing overall proficiency in algebra might suffice in an assessment meant only to support a pass/fail decision; a coached practice system to help students develop the same proficiency might require a finer grained student model, to monitor how a student is doing on particular aspects of skill and knowledge for which we can offer feedback. When the purpose is program evaluation, the student model variables should reflect hypothesized ways in which a program may enjoy more or less success, or promote students’ learning in some ways as opposed to others. In the standard argument, then, a claim about what a student knows, can do, or has accomplished is expressed in terms of values of student-model variables. Substantive concerns about the desired outcomes of instruction, say, or the focus of a program evaluation, will suggest what the student-model variables might be, and give substantive meaning to the values of student-model variables. The student model provides a language for expressing claims about students, restricted and simplified to be sure, but one that is amenable to probability-based reasoning for drawing inferences and characterizing beliefs. A following section will explain how we can express what we know about a given student’s values for these variables in terms of a probability distribution, which can be updated as new evidence arrives. Task Models: What tasks or situations should elicit those behaviors? A task model provides a framework for constructing and describing the situations in which examinees act. We use the term “task” in the sense proposed by Haertel and Wiley (1993), to refer to a “goal-directed human activity to be pursued in a specified manner, context, or circumstance.” A task can thus be an open-ended problem in a computerized Psychometric Principles Page 16 simulation, a long-term project such as a term paper, a language-proficiency interview, or a familiar multiple-choice or short-answer question. A task model specifies the environment in which the student will say, do, or produce something; for example, characteristics of stimulus material, instructions, help, tools, and so on. It also specifies the work product, or the form in which what the student says, does, or produces will be captured. But again it is substantive theory and experience that determine the kinds of situations can evoke behaviors that provide clues about the targeted knowledge and skill, and the forms in which those clues can be expressed and captured. To create a particular task, an assessment designer has explicitly or implicitly assigned specific values to task model variables, provided materials that suit the specifications there given, and set the conditions that are required to interact with the student. A task thus describes particular circumstances meant to provide the examinee an opportunity to act in ways that produce evidence about what they know or can do more generally. For a particular task, the values of its task model variables constitute data for the evidentiary argument, characterizing the situation in which the student is saying, doing, or making something. It is useful to distinguish task models from the scoring models discussed in the next section, as the latter concern what to attend to in the resulting performance and how to evaluate what we see. Distinct and possibly quite different evaluation rules could be applied to the same work product from a given task. Distinct and possibly quite different student models, designed to serve different purposes or derived from different conceptions of proficiency, could be informed by performances on the same tasks. The substantive arguments for the evidentiary value of behavior in the task situation will overlap in these cases, but the specifics of the claims and thus the specifics of the statistical links in the chain of reasoning will differ. Psychometric Principles Page 17 Evidence Models: What behaviors or performances should reveal the student constructs, and what is the connection? An evidence model lays out the part of the evidentiary argument that concerns reasoning from the observations in a given task situation to revising beliefs about student model variables. Figure 1 shows there are two parts to the evidence model. The scoring component contains “evidence rules” for extracting the salient features of whatever the student says, does, or creates in the task situation—i.e., the “work product” that is represented by the jumble of shapes in the rectangle at the far right of the evidence model. A work product is a unique human production, perhaps as simple as a response to a multiple-choice item, or as complex as repeated cycles of treating and evaluating patients in a medical simulation. The squares coming out of the work product represent “observable variables,” or evaluative summaries of what the assessment designer has determined are the key aspects of the performance (as captured in one or more work products) to serve the assessment’s purpose. Different aspects could be captured for different purposes. For example, a short impromptu speech contains information about a student’s subject matter knowledge, presentation capabilities, or English language proficiency; any of these, or any combination, could be the basis of one or more observable variables. As a facet of fairness, however, the student should be informed of which aspects of her performance are being evaluated, and by what criteria. For students failing to understand how their work will be scored is an alternative hypothesis for poor performance we can and should avoid. Scoring rules map unique performances into a common interpretative framework, thus laying out what is important in a performance. These rules can be as simple as determining whether the response to a multiple-choice item is correct, or as complex as an expert’s holistic evaluation of multiple aspects of an unconstrained patientmanagement solution. They can be automated, demand human judgment, or require both in combination. Values of the observable variables describe properties of the particular things a student says, does, or makes. As such, they constitute data about what the student knows, can do, or has accomplished as more generally construed in the standard argument. Psychometric Principles Page 18 It is important to note that substantive concerns drive the definition of observable variables. Statistical analyses can be used to refine definitions, compare alternatives, or improve data-gathering procedures, again looking for patterns that call a scoring rule into question. But it is the conception of what to observe that concerns validity directly, and raises questions of alternative explanations that bear on comparability and fairness. The measurement component of the Evidence Model tells how the observable variables depend, in probabilistic terms, on student model variables, another essential link in the evidentiary argument. This is the foundation for the reasoning that is needed to synthesize evidence across multiple tasks or from different performances. Figure 1 shows how the observables are modeled as depending on some subset of the student model variables. The familiar models from test theory that we discuss in a following section, including classical test theory and item response theory, are examples. We can adapt these ideas to suit the nature of the student model and observable variables in any given application (Almond & Mislevy, 1999). Again, substantive considerations must underlie why these posited relationships should hold; the measurement model formalizes the patterns they imply. It is a defining characteristic of psychometrics to model observable variables as probabilistic functions of unobservable student variables. The measurement model is almost always a probability model. The probability-based framework model may extend to the scoring model as well, as when judgments are required to ascertain the values of observable variables from complex performances. Questions of accuracy, agreement, leniency, and optimal design arise, and can be addressed with a measurement model that addresses the rating link as well as the synthesis link in the chain of reasoning. The generalizability and rater models discussed below are examples of this. Psychometric Principles and Probability-Based Reasoning The role of probability-based reasoning in the assessment This section looks more closely at what is perhaps the most distinctive characteristic of psychometrics, namely, the use of statistical models. Measurement models are a particular form of reasoning from evidence; they provide explicit, formal rules for how to integrate the many pieces of information that may be relevant to a particular inference Psychometric Principles Page 19 about what students know and can do. Statistical modeling, probability-based reasoning more generally, is an approach to solving the problem of “reverse reasoning” through a warrant, from particular data to a particular claim. Just how can we reason from data to claim for a particular student, using a measurement model established for general circumstances--usually far less than certain, typically with qualifications, perhaps requiring side conditions we don’t know are satisfied? How can we synthesize the evidentiary value of multiple observations, perhaps from different sources, often in conflict? The essential idea is to approximate the important substantive relationships in some real world problem in terms of relationships among variables in a probability model. A simplified picture of the real-world situation results. A useful model does not explain all the details of actual data, but it does capture the significant patterns among them. What is important is that in the space of the model, the machinery of probability-based reasoning indicates exactly how reverse reasoning is to be carried out (specifically, through Bayes theorem), and how different kinds and amounts of data should affect our beliefs. The trick is to build a probability model that both captures the important real-world patterns and suits the purposes of the problem at hand. Measurement models concern the relationships between students’ knowledge and their behavior. A student is modeled in terms of variables (θ) that represent the facets of skill or knowledge that suit the purpose of the assessment, and the data (X) are values of variables that characterize aspects of the observable behavior. We posit that the student model variables account for observable variables in the following sense: We don’t know exactly what any student will do on a particular task, but for people with any given value of θ, there is a probability distribution of possible values of X, say p(X|θ). This is a mathematical expression of what we might expect to see in data, given any possible values of student-model variables. The way is open for reverse reasoning, from observed Xs to likely θs, as long as different values of θ produce different probability distributions for X. We don’t know the values of the student-model variables in practice; we observe “noisy” data presumed to have been determined by them, and through the probability model reason back to what their values are likely to be. Psychometric Principles Page 20 Choosing to manage information and uncertainty with probability-based reasoning, with its numerical expressions of belief in terms of probability distributions, does not constrain one to any particular forms of evidence or psychological frameworks. That is, it says nothing about the number or nature of elements of X, or about the character of the performances, or about the conditions under which performances are produced. And it says nothing about the number or nature of elements of θ, such as whether they are number values in a differential psychology model, production-rule mastery in a cognitive model, or tendencies to use resources effectively in a situative model. In particular, using probability-based reasoning does not commit us to long tests, discrete tasks, or large samples of students. For example, probability-based models have been found useful in modeling patterns of judges’ ratings in the previously mentioned Advanced Placement Portfolio Art assessment (Myford & Mislevy, 1995), about as open-ended as large-scale, high-stakes educational assessments get, and in modeling individual students’ use of production rules in a tutoring system for solving physics problems (Martin & vanLehn, 1995). Now a measurement model in any case is not intended to account for every detail of data; it is only meant to approximate the important patterns. The statistical concept of conditional independence formalizes the working assumption that if the values of the student model variables were known, there would have no further information in the details. The fact that every detail of a student’s responses could in principle contain information about what a student knows or how she is thinking underscores the constructive and purposive nature of modeling. We use a model at a given grainsize or with certain kinds of variables not because we think that is somehow ”true”, but rather because it adequately expresses the patterns in the data in light of the purpose of the assessment. Adequacy in a given application depends on validity, reliability, comparability, and fairness in ways we shall discuss further, but characterized in ways and demanded in degrees that depend on that application: The purpose of the assessment, the resources that are available, the constraints that must be accommodated. We might model the same troubleshooting performances in terms of individual problem steps for an intelligent tutoring system, in terms of general areas of strength and weakness for a Psychometric Principles Page 21 diagnostic assessment, and simply in terms of overall success rate for a pass/fail certification test. Never fully believing the statistical model we are reasoning through, we bear the responsibility of assessing model fit, in terms of both persons and items. We must examine the ways and the extent to which the real data depart from the patterns in the data, calling our attention to failures of conditional independence—places where our simplifying assumptions miss relationships that are surely systematic, and possibly important, in the data. Finding substantial misfit causes us to re-examine the arguments that tell us what to observe and how to evaluate it. Probability-based reasoning in classical test theory This section illustrates the ideas from the preceding discussion in the context of Classical Test Theory (CTT). In CTT, the student model is represented as a single continuous unobservable variable, the true score θ. The measurement model simply tells us to think of an observed score X as the true score plus an error term. If a CTT measurement model were used in the BEAR example, it would address the sum of the student scores on a set of assessment tasks as the observed score. Figure 9 pictures the situation, in a case that concerns Sue’s (unobservable) true score and her three observed scores on parallel forms of the same test; that is, they are equivalent measures of the same construct, and have the same means and variances. The probability distribution p(θ) expresses our belief about Sue’s θ before we observe her test scores, the Xs. The conditional distributions p(Xj|θ)ii indicate the probabilities of observing different values of Xj if θ took any given particular value. Modeling the distribution of each Xj to depend on θ but not the other Xs is an instance of conditional independence; more formally, we write p(X1, X2, X3|θ)= p(X1|θ) p(X2|θ) p(X3|θ). Under CTT we may obtain a form for the p(Xj|θ)s by proposing that Xj = θ + Ej, (1) Psychometric Principles Page 22 where Ej is an “error” term, normally-distributed with a mean of zero and a variance of σ E2 .iii Thus Xj|θ ~ N(θ, σ E ). This statistical structure quantifies the patterns that the substantive arguments express qualitatively, in a way that tells us exactly how to carry out reverse reasoning for particular cases. If p(θ) expresses belief about Sue’s θ prior to observing her responses, belief posterior to learning them is denoted as p(θ|x1,x2,x3) and is calculated by Bayes theorem as p(θ|x1,x2,x3) ∝ p(θ) p(x1|θ) p(x2|θ) p(x3|θ). (The lower case x’s here denote particular values of X’s.) [[Figure 9—CTT DAG—about here]] Figure 10 gives the numerical details for a hypothetical example, calculated with a variation of an important early result called Kelley’s formula for estimating true scores (Kelley, 1927). Suppose that from a large number of students like Sue, we’ve estimated that the measurement error variance is σ E2 =25, and for the population of students, θ follows a normal distribution with mean 50 and standard deviation 10. We now observe Sue’s three scores, which take the values 70, 75, and 85. We see that the posterior distribution for Sue’s θ is a normal distribution with mean 74.6 and standard deviation 2.8. [[Figure 10—CTT numerical example—about here]] The additional backing that was used to bring the probability model into the evidentiary argument was an analysis of data from students like Sue. Spearman’s (1904) seminal insight was that if their structure is set up in the right wayiv, it is possible to estimate the quantitative features of relationships like this, among both variables that could be observed and others which by their nature never can be. The index of measurement accuracy in CTT is the reliability coefficient ρ, which is the proportion of variance in observed scores in a population of interest that is attributable to true scores as Psychometric Principles Page 23 opposed to the total variance, which is composed of true score variance and noise. It is defined as follows: ρ= σ θ2 , σ θ2 + σ E2 (2) where σ θ2 is the variance of true score in the population of examinees and σ E2 is the variance of the error components—neither of which is directly observable! With a bit of algebra, though, Spearman demonstrated that if Equation 1 holds, correlations among pairs of Xs will approximate ρ. We may then estimate the contributions of true score and error, or σ θ2 and σ E2 , as proportions ρ and (1-ρ) respectively of the observed score variance. The intuitively plausible notion is that correlations among exchangeable measures of the same construct tell us how much to trust comparisons among examinees from a single measurement. (As an index of measurement accuracy, however, ρ suffers from its dependence on the variation among examinees’ true scores as well as on the measurement error variance of the test. For a group of examinees with no true-score variance, the reliability coefficient is zero no matter how much evidence a test provides about each of them. We’ll see how item response theory extends the idea of measurement accuracy.) What’s more, patterns among the observables can be so contrary to those the model would predict that we suspect the model isn’t right. Sue’s values of 70, 75, and 85 are not identical, but neither are they surprising as a set of scores (Figure 10 calculates a chisquared index of fit for Sue). Some students have higher scores than Sue, some have lower scores, but the amount of variation within a typical student’s set of scores is in this neighborhood. But Richard’s three scores of 70, 75, and 10 are surprising. His high fit statistic (a chi-square of 105 with 2 degrees of freedom) says his pattern is very unlikely from parallel tests with an error variance of 25 (less than one in a billion). Richard’s responses are so discordant with a statistical model that expresses patterns under the standard argument that we suspect that the standard argument does not apply. We must go beyond the standard argument to understand what has happened, to facts the standard data Psychometric Principles Page 24 do not convey. Our first clue is that his third score is particularly different from both the other two and from the prior distribution. Classical test theory’s simple model for examinee characteristics suffices when one is just interested in is a single aspect of student achievement, tests are only considered as a whole, and all students take tests that are identical or practically so. But the assumptions of CTT have generated a vast armamentarium of concepts and tools that help the practitioner examine the extent to which psychometric principles are being attained in situations when the assumptions are adequate. These tools include reliability indices that can be calculated from multiple items in a single test, formulas for errors of measuring individual students, strategies for selecting optimal composites of tests, formulas for approximating how long a test should be to reach a required accuracy, and methods for equating tests. The practitioner working in situations that CTT encompasses will find a wealth of useful formulas and techniques in Gulliksen’s (1950/1987) classic text. The advantages of using probability-based reasoning in assessment Because of its history, the very term “psychometrics” connotes a fusion of the inferential logic underlying Spearman’s reasoning with his psychology (trait psychology, in particular with intelligence as an inherited and stable characteristic) and his datagathering methods (many short, ‘objectively-scored,’ largely decontextualized tasks). The connection, while historically grounded, is logically spurious, however. For the kinds of problems that CTT grew to solve are not just Spearman’s problems, but ones that ought to concern anybody who is responsible for making decisions about students, evaluating the effects of instruction, or spending scarce educational resources—whether or not Spearman’s psychology or methodology are relevant to the problem at hand. And indeed, the course of development of test theory over the past century has been to continually extended the range of problems to which this inferential approach can be applied—to claims cast in terms of behavioral, cognitive, or situative psychologyv; to data that may be embedded in context, require sophisticated evaluations, or address multiple interrelated aspects of complex activities. We will look at some of these developments in the next section. But it is at a higher level of abstraction that psychometric principles are Psychometric Principles Page 25 best understood, even though it is with particular models and indices that they are investigated in practice. When it comes to examining psychometric properties, embedding the assessment argument in a probability model offers the following advantages: 1. Using the calculus of probability-based reasoning, once we ascertain the values of the variables in the data, we can express our beliefs about the likely values of the student estimates in terms of probability distributions—given that the model is both generally credible (#3 below) and applicable to the case at hand (#4 below). 2. The machinery of probability-based reasoning is rich enough to handle many recurring challenges in assessment, such as synthesizing information across multiple tasks, characterizing the evidentiary importance of elements or assemblages of data, assessing comparability across different bodies of evidence, and exploring the implications of judgment, including different numbers and configurations of raters. 3. Global model-criticism techniques allow us to not only fit models to data, but to determine where and how the data do not accord well with the models. Substantive considerations suggest the structure of the evidentiary argument; statistical analyses of ensuing data through the lens of a mathematical model help us assess whether the argument matches up with what we actually see in the world. For instance, detecting an unexpected interaction between performance on an item and students’ cultural backgrounds alerts us to an alternate explanation of poor performance. We are then moved to improve the data gathering methods, constrain the range of use, or rethink the substantive argument. 4. Local model-criticism techniques allow us to monitor the operation of the reversereasoning step for individual students even after the argument, data-collection methods, and statistical model are up and running. Patterns of observations that are unlikely under the common argument can be flagged (e.g., Richard’s high chi-square value), thus avoiding certain unsupportable inferences and drawing attention to those cases that call for additional exploration. Psychometric Principles Page 26 Implications for psychometric principles Validity Some of the historical “flavors” of validity are statistical in nature. Predictive validity is the degree to which scores in selection tests correlate with future performance. Convergent validity looks for high correlations of a test’s scores with other sources of evidence about the targeted knowledge and skills, while divergent validity looks for low correlations with evidence about irrelevant factors (Campbell & Fiske, 1959). Concurrent validity examines correlations with other tests presumed to provide evidence about the same or similar knowledge and skills. The idea is that substantive considerations that justify an assessment’s conception and construction can be put to empirical tests. In each of the cases mentioned above, relationships are posited among observable phenomena that would hold if the substantive argument were correct, and see if in fact they do; that is, exploring the nomothetic net. These are all potential sources of backing for arguments for interpreting and using test results, and they are at the same time explorations of plausible alternative explanations. Consider, for example, assessments meant to support decisions about whether a student has attained some criterion of performance (Ercikan & Julian, 2001, Hambleton & Slater, 1997). These decisions, typically reported as proficiency or performance level scores, which are increasingly being considered to be useful in communicating assessment results to students, parents and the public as well as for evaluation of programs, involve classification of examinee performance to a set of proficiency levels. Rarely do the tasks on such a test exhaust the full range of performances and situations users are interested in. Examining the validity of a proficiency test from this nomotheticnet perspective would involve seeing whether students who do well on that test also perform well in more extensive assessment, obtain high ratings from teachers or employers, or succeed in subsequent training or job performance. Statistical analyses of these kinds have always been important after the fact, as significance-focused validity studies informed, constrained, and evaluated the use of a test—but they rarely prompted more than minor modifications to its contents. Rather, Embretson (1998) notes, substantive considerations have traditionally driven assessment Psychometric Principles Page 27 construction. Neither of two meaning-focused lines of justification that were considered forms of validity used probability-based reasoning. They were content validity, which concerned the nature and mix of items in a test, and face validity, which is what a test appears to be measuring on the surface, especially to non-technical audiences. We will see in our discussion of item response theory how statistical machinery is increasingly being used in the exploration of construct representation as well in after-the-fact validity studies. Reliability Reliability, historically, was used to quantify the amount of variation in test scores that reflected ‘true’ differences among students, as opposed to noise (Equation 2). The correlations between parallel tests forms we used in classical test theory are one way to estimate reliability in this sense. Internal consistency among test items, as gauged by the KR-20 formula (Kuder & Richardson, 1937) or Cronbach’s (1951) Alpha coefficient, is another. A contemporary view sees reliability as the evidentiary value that a given realized or prospective body of data would provide for a claim—more specifically, the amount of information for revising belief about an inference involving student-model variables, be it an estimate for a given student, a comparison among students, or a determination of whether a student has attained some criterion of performance. A wide variety of specific indices or parameters can be used to characterize evidentiary value. Carrying out a measurement procedure two or more times with supposedly equivalent alternative tasks and raters will not only ground an estimate of its accuracy, as in Spearman’s original procedures, but it demonstrates convincingly that there is some uncertainty to deal with in the first place (Brennan, 2000/in press). The KR20 and Cronbach’s alpha apply the idea of replication to tests that consist of multiple items, by treating subsets of the items as repeated measures. These CTT indices of reliability appropriately characterize the amount of evidence for comparing students in a particular population with one another, but not necessarily for comparing them against a fixed standard, or for comparisons in other populations, or for purposes of evaluating schools or instructional programs. In this sense CTT indices of reliability are tied to particular populations and inferences. Psychometric Principles Page 28 Since reasoning about reliability takes place in the realm of the measurement model (assuming that it is both correct and appropriate), it is possible to approximate the evidentiary value of not only the data in hand, but the value of similar data gathered in somewhat different ways. Under CTT, the Spearman-Brown formula (Brown, 1910; Spearman, 1910) can be used to approximate the reliability coefficient that would result from doubling the length of a test: ρ double = 2ρ . 1+ ρ (3) That is, if ρ is the reliability of the original test, then ρdouble is the reliability of an otherwise comparable test with twice as many items. Empirical checks have shown that these predictions can hold up quite well—but not if the additional items differ as to their content or difficulty, or if the new test is long enough to fatigue students. In these cases, the real-world counterparts of the modeled relationships are stretched so far that the results of reasoning through the model fail. Extending this thinking to a wider range of inferences, generalizability theory (Cronbach, Gleser, Nanda, and Rajaratnam, 1972) permits predictions for the accuracy of similar tests with different numbers and configurations of raters, items, and so on. And once the parameters of tasks have been estimated under an item response theory (IRT) model, one can even assemble tests item by item to individual examinees on the fly, to maximize the accuracy with which each is assessed. (Later we’ll point to some “how-to” references for g-theory and IRT.) Typical measures of accuracy used in CTT are not sufficient for examining accuracy of the decisions concerning criterion of performance discussed above. In CTT framework, the classification accuracy is defined as the extent to which classification of students based on their observed test scores agree with those based on their true scores (Traub & Rowley, 1980). One of the two commonly used measures of classification accuracy is a simple measure of agreement, p0, defined as L p 0 =∑ l =1 p ll , Psychometric Principles Page 29 where pll represents the proportion of examinees who were classified into the same proficiency level (l=2,..,5) according to their true score and observed score. The second is Cohen’s κ coefficient (Cohen, 1960). This statistic is similar to the proportion agreement p0 , except that it is corrected for the agreement which is due to chance. The coefficient is defined as κ= p −p 1− p 0 c , c where L p =∑p p c l =1 l. .l . The accuracy of classifications based on test scores are critically dependent on measurement accuracy at the cut-score points (Ercikan & Julian, 2001; Hambleton & Slater, 1997). Even though higher measurement accuracy tends to imply higher classification accuracy, higher reliability such as one indicated by KR-20 or Coefficient alpha does not imply higher classification accuracy. These measures provide an overall indication of measurement accuracy provided by the test for all examinees, however, they do not provide information about the measurement accuracy provided at the cut-scores. Therefore, they are not sufficient indicators of accuracy of classification decisions made based on test performance. On the other hand, measurement accuracy is expected to vary for different score ranges resulting in variation in classification accuracy. This points to a serious limitation of interpretability of single indices that are intended to represent classification accuracy of a test given a set of cut-scores. Ercikan & Julian (2001) study found that the classification accuracy can be dramatically different for examinees at different ability levels. Their results demonstrated that comparing classification accuracy across tests could be deceptive, since classification accuracy may be higher for one test for certain score ranges and lower for others. Based on these limitations of interpretability of classification accuracy for different score ranges, these authors recommend that classification accuracy be reported separately for different score ranges. Psychometric Principles Page 30 Comparability Comparability, it will be recalled, concerns the equivalence of inference when different bodies of data are gathered to compare students, or to assess change from the same students at different points in time. Within a statistical framework, we can build models that address quantitative aspects of questions such as these: Do the different bodies of data have such different properties as evidence as to jeopardize the inferences? Are conclusions about students’ knowledge biased in one direction or another when different data are gathered? Is there more or less weight for various claims under the different alternatives? A time-honored way of establishing comparability has been creating parallel test forms. A common rationale is developed to create collections of tasks which, taken together, can be argued to provide data about the same targeted skills and knowledge— differing, it is hoped, only in incidentals that do not accumulate. Defining a knowledgeby-skills matrix, for example, writing items in each cell, and constructing tests by selecting the same numbers of tasks from each cell for every test form. The same substantive backing thus grounds all the forms. But it would be premature to presume that equal scores from these tests constitute equivalent evidence about students’ knowledge. Despite care in their construction, possible differences between the tests as to difficulty or amount of information must be considered as an alternative explanation for differing performances among students. Empirical studies and statistical analyses enter the picture at this point, in the form of equating studies (Petersen, Kolen, & Hoover, 1989). Finding that similar groups of students systematically perform better on Form A than on Form B confirms the alternative explanation. Adjusting scores for Form B upward to match the resulting distributions addresses this concern, refines the chain of reasoning to take form differences into account when drawing claims about students, and enters the compendium of backing for the assessment system as a whole. Below we shall see that IRT extends comparability arguments to test forms that differ in difficulty and accuracy, if they can satisfy the requirements of a more ambitious statistical model. Psychometric Principles Page 31 Fairness The meaning-focused sense of fairness we have chosen to highlight concerns a claim that would follow from the common argument, but would be called into question by an alternative explanation sparked by other information we could and should have taken into account. When we extend the discussion of fairness to statistical models, we find macrolevel and micro-level strategies to address this concern. Macro-level strategies of fairness fall within the broad category of what the assessment literature calls validity studies, and are investigations in the nomothetic net. They address broad patterns in test data, at the level of arguments or alternative explanations in the common arguments that are used with many students. Suppose the plan for the assessment is to use data (say, essay responses) to back a claim about a student’s knowledge (e. g., the student can back up opinions with examples) through an argument (e. g., students in a pretest who are known to be able to do this in their first language are observed to do so in essays that ask for them to), without regard to a background factor (such as a students’ first language). The idea is to gather, from a group of students, data that include their test performances but also include information about their first language and higher-quality validation data about the claim (e.g., interviews in the students’ native languages). The empirical question is whether inferences from the usual data to the claim (independently evidenced by the validity data) differ systematically with first language. In particular, are there students who can back arguments with specifics in their native language, but fail to do so on the essay test because of language difficulties? If so, the door is open to distorted inferences about argumentation skills for limited-English speakers, if one proceeds from the usual data through the usual argument, disregarding language proficiency. What can we do when the answer is “yes”? Possibilities include improving the data collected for all students, taking their first language into account when reasoning from data to claim (recognizing the language difficulties can account for poor performance even when the skill of interest is present), and pre-identifying students whose limited language proficiencies are likely to lead to flawed inferences about the targeted Psychometric Principles Page 32 knowledge. In this last instance, additional or different data could be used for these students, such as an interview or an essay in their primary language. These issues are particularly important in assessments used for making consequential proficiency-based decisions, in ways related to the points we raised concerning the validity of such tests. Unfair decisions are rendered if (a) alternative valid means of gathering data for evaluating proficiency yield results that differ systematically from the standard assessment, and (b) the reason can be traced to requirements for knowledge or skills (e.g., proficiency with the English language) that are not central to the knowledge or skill that is at issue (e.g., constructing and backing an argument). The same kinds of investigations can be carried out with individual tasks as well as with assessments as a whole. One variation on this theme can be used with assessments that are composed of several tasks, to determine whether individual tasks interact with first language in atypical ways. These are called studies of DIF, or differential item functioning (Holland & Wainer, 1993). Statistical tools can also be used to implement micro-level strategies to call attention to cases in which a routine application of the standard argument could produce a distorted and possibly unfair inference. The common argument provides a warrant to reason from data to claim, with attendant caveats for unfairness associated with factors (such as first language) that have been dealt with at the macro level. But the argument may not hold for some individual students for other reasons, which have not yet been dealt with at the macro level, perhaps could not have been anticipated at all. Measurement models characterize patterns in students’ data that are typical if the general argument holds. Patterns that are unlikely can signal that the argument may not apply with a given student on a given assessment occasion. Under IRT, for example, ‘student misfit’ indices take high values for students who miss items that are generally easy while correctly answering ones that are generally hard (Levine & Drasgow, 1982.) Some Other Widely-Used Measurement Models The tools of classical test theory have been continually extended and refined in the time since Spearman, to the extensive toolkit by Gulliksen (1950/1987), and to the sophisticated theoretical framework laid out in Lord and Novick (1968). Lord and Psychometric Principles Page 33 Novick aptly titled their volume Statistical theories of mental test scores, underscoring their focus on the probabilistic reasoning aspects in the measurement-model links of the assessment argument—not the purposes, not the substantive aspects, not the evaluation rules that produce the data. Models that extend the same fundamental reasoning for this portion of assessment arguments to wider varieties of data and student models include generalizability theory, item response theory, latent class models, and multivariate models. Each of these extensions offers more options for characterizing students and collecting data in a way that can be embedding in a probability model. The models do not concern themselves directly with substantive aspects of an assessment argument, but substantive considerations often have much to say about how one should think about students’ knowledge, and what observations should contain evidence about it. The more measurement models that are available and the more kinds of data than can be handled, the better assessors can match rigorous models with the patterns their theories and their needs concern. This bolsters evidentiary arguments (validity), extends quantitative indices of accuracy to more situations (reliability), enables more flexibility in observational settings (comparability), and enhances the prospects of detecting students whose data are at odds with standard argument (fairness). Generalizability Theory Generalizability Theory (g-theory) extends classical test theory by allowing us to examine how different aspects of the observational setting affect the evidentiary value of test scores. As in CTT, the student is characterized by overall proficiency in some domain of tasks. However, the measurement model can now include parameters that correspond to “facets” of the observational situation such as features of tasks (i.e., taskmodel variables), numbers and designs of raters, and qualities of performance that will be evaluated. An observed score of a student in a generalizability study of an assessment consisting of different item types and judgmental scores is an elaboration of the basic CTT equation: X ijk = θ i + τ j + ς k + Eijk , Psychometric Principles Page 34 where we now address the observed score is from Examinee i, to Item-Type j, as evaluated by Rater k; θi is the true score of Examinee i; and τ j and ς k are, respectively, effects attributable to Item-Type j and Rater k. Researchers carry out a generalizability study, or g-study, to estimate the amount of variation associated with different facets. The accuracy of estimation of scores for a given configuration of tasks can be calculated from these variance components, the numbers of items and raters, and the design in which data are collected. A “generalizability coefficient” is an extension of the CTT reliability coefficient: it is the proportion of true variance among students for the condition one wishes to measure, divided by the variance among observed scores among the measurements that would be obtained among repeated applications of the measurement procedure that is specified (how many observations, fixed or randomly selected; how many raters rating each observation, different or same raters for different items, etc.). If, in the example above, we wanted to estimate θ using one randomly selected item, scored as the average of the ratings from two randomly selected raters, the coefficient of generalizability, denoted here as α, would be calculated as follows: α= σ θ2 , σ θ2 + σ τ2 + (σ ς2 + σ E2 ) 2 where σ θ2 , σ τ2 , σ ς2 , and σ E2 are variance coefficients for examinees, item-types, raters, and error respectively. The information resulting from a generalizability study can thus guide decisions about how to design procedures for making observations; for example, what design for assigning raters to performances, how many tasks and raters, and whether to average across raters, tasks, etc. In the BEAR Assessment System, a g-study could be carried out to see which type of assessment, embedded tasks or link items, resulted in more reliable scores. It could also be used to examine whether teachers were as consistent as external raters. Psychometric Principles Page 35 G-theory offers two important practical advantages over CTT: First, generalizability models allow us to characterize how the particulars of the evaluation rules and task model variables affect the value of the evidence we gain about the student for various inferences. Second, this information is expressed in terms that allow us to project these evidentiaryvalue considerations to designs we have not actually used, but which could be constructed from elements similar to the ones we have observed. G-theory thus provides far-reaching extensions of the Spearman-Brown formula (Equation 3), for exploring issues of reliability and comparability in a broader array of data-collection designs than CTT can. Generalizability theory was developed by Professor Lee Cronbach and his colleagues, and their monograph The dependability of behavioral measurements (Cronbach et al., 1972) remains a valuable source of information and insight. More recent sources such as Shavelson and Webb (1991) and Brennan (1983) provide the practitioner with friendlier notation and examples to build on. Item Response Theory (IRT) Classical test theory and generalizability theory share a serious shortcoming: measures of examinees are confounded with the characteristics of test items. It is hard to compare examinees who have taken tests that differ by even as much as a single item, or to compare items that have been administered to different groups of examinees. Item Response Theory (IRT) was developed to address this shortcoming. In addition, IRT can be used to make predictions about test properties using item properties and to manipulate parts of tests to achieve targeted measurement properties. Hambleton (1993) gives a readable introduction to IRT, while van der Linden and Hambleton (1997) provide a comprehensive though technical compendium of current IRT models. IRT further extended probability-based reasoning for addressing psychometric principles, and it sets the stage for further developments. We’ll start with a brief overview of the key ideas. At first, the student model under IRT seems to be the same as it is under CTT and gtheory, namely, single variable measuring students’ overall proficiency in some domain of tasks. Again the statistical model does not address the nature of that proficiency. The structure of the probability-based portion of the argument is the same as shown in Figure 9: conditional independence among observations given an underlying, inherently Psychometric Principles Page 36 unobservable, proficiency variable θ. But now the observations are responses to individual tasks. For Item j, the IRT model expresses the probability of a given response xj as a function of θ and parameters βj that characterize Item j (such as its difficulty): f(xj;θ,βj). (4) Under the Rasch (1960/1980) model for dichotomous (right/wrong) items, for example, the probability of a correct response takes the following form: Prob(Xij=1|θi,βj) = f(1;θi,βj) = Ψ(θi - βj), (5) where Xij is the response of Student i to Item j, 1 if right and 0 if wrong; θi is the proficiency parameter of Student i; βj is the difficulty parameter of Item j; and Ψ(⋅) is the logistic function, Ψ(x) = exp(x)/[1+exp(x)]. The probability of an incorrect response is then Prob(Xij=0|θi,βj) = f(0;θi,βj) = 1-Ψ(θi - βj). (6) Taken together, Equations 5 and 6 specify a particular form for the item response function, Equation 4. Figure 11 depicts Rasch item response curves for two items, Item 1 an easy one, with β1=-1 and Item 2 a hard one with β2=2. It shows the probability of a correct response to each of the items for different values of θ. For both items, the probability of a correct response increases toward one as θ increases. Conditional independence means that for a given value of θ, the probability of Student i making responses xi1 and xi2 to the two items is the product of terms like Equations 5 and 6: Prob(Xi1=xi1, Xi2=xi2|θi,β1,β2) = Prob(Xi1=xi1|θi,β1) Prob(Xi2=xi2|θi,β2). [[Figure 11—two item response curves]] (7) Psychometric Principles Page 37 All this is reasoning from model and given parameters, to probabilities of not-yetobserved responses; as such, it is part of the warrant in the assessment argument, to be backed by empirical estimates and model criticism. In applications we need to reason in the reverse direction. Item parameters will have been estimated and responses observed, and we need to reason from an examinee’s xs, to the value of θ. Equation 7 is then calculated as a function of θ with xi1 and xi2 fixed at their observed values; this is the likelihood function. Figure 12 shows the likelihood function that corresponds to Xi1=0 and Xi2=1. One can estimate θ by the point at which the likelihood attains its maximum (around .75 in this example), or use Bayes theorem to combine the likelihood function with a prior distribution for θ, p(θ), to obtain the posterior distribution p(θ|xi1,xi2). [[Figure 12—a likelihood function]] The amount of information about θ available from Item j, Ij(θ), can be calculated as a function of θ, βj, and the functional form of f (see the references mentioned above for formulas for particular IRT models). Under IRT, the amount of information for measuring proficiency at each point along the scale is simply the sum of these item-byitem information functions. The square root of the reciprocal of this value is the standard error of estimation, or the standard deviation of estimates of θ around its true value. Figure 13 is the test information curve that corresponds to the two items in the preceding example. It is of particular importance in IRT that once item parameters have been estimated (“calibrating” them), estimating individual students’ θs and calculating the accuracy of those estimates can be accomplished for any subset of items. Easy items can be administered to fourth graders and harder ones to fifth graders, for example, but all scores arrive on the same θ scale. Different test forms can be given as pretests and posttests, and differences of difficulty and accuracy are taken into account. [[Figure 13—an information function]] IRT helps assessors achieve psychometric quality in several ways. Psychometric Principles Page 38 Concerning validity: The statistical framework indicates the patterns of observable responses that would occur in data if it were actually the case that a single underlying proficiency did account for all the systematic variation among students and items. All the tools of model criticism from five centuries of probability-based reasoning can be brought to bear to assess how well an IRT model fits a given data set, and where it breaks down, now item by item, student by student. The IRT model does not address the substance of the tasks, but by highlighting tasks that are operating differently than others, or proving harder or easier than expected, it helps test designers improve their work. Concerning reliability: Once item parameters have been estimated, a researcher can gauge the precision of measurement that would result from different configurations of tasks. Precision of estimation can be gauged uniquely for any matchup between a person and a set of items. We are no longer bound to measures of reliability that are tied to specific populations and fixed test forms. Concerning comparability: IRT offers strategies beyond the reach of CTT and gtheory for assembling tests that “measure the same thing.” These strategies capitalize on the above-mentioned capability to predetermine the precision of estimation from different sets of items at different levels of θ. Tests that provide optimal measurement for mastery decisions can be designed, for example, or tests that provide targeted amounts of precision at specified levels of proficiency (van der Linden, 1998). Large content domains can be covered in educational surveys by giving each student only a sample of the tasks, yet using IRT to map all performances onto the same scale. The National Assessment of Educational Progress, for example, has made good use of the efficiencies of this item sampling in conjunction with reporting based on IRT (Messick, Beaton, & Lord, 1983). Tests can even be assembled on the fly in light of a student’s previous responses as assessment proceeds, a technique called adaptive testing (see Wainer et al., 2000, for practical advice on constructing computerized adaptive tests). Concerning fairness: An approach called “differential item functioning” (DIF) analysis, based on IRT and related methods, has enabled both researchers and large–scale assessors to routinely and rigorously test for a particular kind of unfairness (e.g., Holland and Thayer, 1988; Lord, 1980). The idea is that a test score, such as a number-correct or Psychometric Principles Page 39 an IRT θ estimate, is a summary over a large number of item responses, and a comparison of students at the level of scores implies that they are similarly comparable across the domain being assessed. But what if some items of the items are systematically harder for students from a group defined by cultural or educational background, for reasons that are not related to the knowledge or skill that is meant to be measured? This is DIF, and it can be formally represented in a model containing interaction terms for items by groups by overall proficiency—an interaction whose presence can threaten score meaning and distort comparisons across groups. A finding of significant DIF can imply that the observation framework needs to be modified, or if the DIF is common to many items, that the construct-representation argument is oversimplified. DIF methods have been used in examining differential response patterns for gender and ethnic groups for the last two decades and for language groups more recently. They are now being used to investigate whether different groups of examinees of approximately the same ability appear to be using differing cognitive processes to respond to test items. Such uses include examining whether differential difficulty levels are due to differential cognitive processes, language differences (Ercikan, 1998), solution strategies and instructional methods (Lane, Wang, Magone, 1996), and skills required by the test that are not uniformly distributed across examinees (O’Neil and McPeek, 1993). Extensions of IRT We have just seen how IRT extends statistical modeling beyond the constraints of classical test theory and generalizability theory. The simple elements in the basic equation of IRT (Equation 4) can be elaborated in several ways, each time expanding the range of assessment situations to which probability-based reasoning can be applied in the pursuit of psychometric principles. Multiple-category responses. Whereas IRT was originally developed with dichotomous (right/wrong) test items, researchers have extended the machinery to observations that are coded in multiple categories. This is particularly useful for performance assessment tasks that are evaluated by raters on, say, 0-5 scales. Samejima (1969) carried out pioneering work in this regard. Thissen and Steinberg (1986) explain Psychometric Principles Page 40 the mathematics of the extension and provide a useful taxonomy of multiple-category IRT models, and Wright and Masters (1982) offer a readable introduction to their use. Rater models. The preceding paragraph mentioned that multiple-category IRT models are useful in performance assessments with judgmental rating scales. But judges themselves are sources of uncertainty, as even knowledgeable and well-meaning raters rarely agree perfectly. Generalizability theory, discussed earlier, incorporates the overall impact of rater variation on scores. Adding terms for individual raters into the IRT framework goes further, so that we can adjust for their particular effects, offer training when it is warranted, and identify questionable ratings with greater sensitivity. Recent work along these lines is illustrated by Patz and Junker (1999) and Linacre (1989). Conditional dependence. Standard IRT assumes that responses to different items are independent once we know the item parameters and examinee’s θ. This is not strictly true when several items concern the same stimulus, as in paragraph comprehension tests. Knowledge of the content tends to improve performance on all items in the set, while misunderstandings tend to depress all, in ways that don’t affect items from other sets. Ignoring these dependencies leads one to overestimate the information in the responses. The problem is more pronounced in complex tasks when responses to one subtask depend on results from an earlier subtask, or when multiple ratings of different aspects of the same performance are obtained. Wainer and his colleagues (e.g., Wainer & Keily, 1987; Bradlow, Wainer, & Wang, 1999) have studied conditional dependence in the context of IRT. This line of work is particularly important for tasks in which several aspects of the same complex performance must be evaluated (Yen, 1993). Multiple attribute models. Standard IRT posits a single proficiency to “explain” performance on all the items in a domain. One can extend the model to situations in which multiple aspects of knowledge and skill are required in different mixes in different items. One stream of research on multivariate IRT follows the tradition of factor analysis, using analogous models and focusing on estimating structures from tests more or less as they come to the analyst from the test developers (e.g., Reckase, 1985). Another stream starts from multivariate conceptions of knowledge, and constructs tasks that contain evidence of that knowledge in theory-driven ways (e.g., Adams, Wilson, & Wang, 1997). Psychometric Principles Page 41 As such, this extension fits in neatly with the task-construction extensions discussed in the following paragraph. Either way, having a richer syntax to describe examinees within the probability-based argument supports more nuanced discussions of knowledge and the ways it is revealed in task performances. Incorporating item features into the model. Embretson (1983) not only argued for paying greater attention to construct representation in test design, she argued for how to do it: Incorporate task model variables into the statistical model, and make explicit the ways that features of tasks impact examinees’ performance. A signal article in this regard was Fischer’s (1973) linear logistic test model, or LLTM. The LLTM is a simple extension of the Rasch model shown above in Equation 5, with the further requirement that each item difficulty parameter β is the sum of effects that depend on the features of that particular item: m β j = ∑ q jkηk , k =1 where hk is the contribution to item difficulty from Feature k, and qjk is the extent to which Feature k is represented in Item j. Some of the substantive considerations that drive task design can thus be embedded in the statistical model, and the tools of probability-based reasoning are available to examine how well they hold up in practice (validity), how they affect measurement precision (reliability), how they can be varied while maintaining a focus on targeted knowledge (comparability), and whether some items prove hard or easy for unintended reasons (fairness). Embretson (1998) walks through a detailed example of test design, psychometric modeling, and construct validation from this point of view. Additional contributions along these lines can be found in the work of Tatsuoka (1990), Falmagne and Doignon (1988), Pirolli and Wilson (1998), and DiBello, Stout, and Roussos (1995). Progress on other fronts The steady extension of probability-based tools to wider ranges of assessment uses has not been limited to IRT. In this section we will mention some other important lines Psychometric Principles Page 42 of development, and point to work that is bringing these many lines of progress into the same methodological framework. Latent class models. Research on learning suggests that knowledge and skills in some domains could be characterized as discrete states (e.g., the “production rule” models John Anderson uses in his intelligent tutoring systems--Anderson, Boyle, & Corbett, 1990). Latent class models characterize an examinee as a member of one of a number of classes, rather than as a position on a continuous scale (Lazarsfeld, 1950; Dayton, 1999; Haertel, 1989). The classes themselves can be considered ordered or unordered. The key idea is that students in different classes have different probabilities of responding in designated ways in assessment settings, depending on their values of the knowledge and skill variables that define the classes. When this is what theory suggests and purpose requires, using a latent class model offers the possibility of a more valid interpretation of assessment data. The probability-based framework of latent class modeling again enables us to rigorously test this hypothesis, and to characterize the accuracy with which observed responses identify students with classes. Reliability in latent class models is therefore expressed in terms of correct classification rates. Models for other kinds of data. All of the machinery of IRT, including the extensions to multivariate student models, raters, and task features, can be applied to data other than just dichotomous and multiple-category observations. Less research and fewer applications appear in the literature, but the ideas can be found for counts (Rasch, 1960/1980), continuous variables (Samejima, 1973), and behavior observations such as incidence and duration (Rogosa & Ghandour, 1991). Models that address interrelationships among variables. The developments in measurement models we have discussed encompass wider ranges of student models, observations, and task features, all increasing the fidelity of probability-based models to real-world situations. This contributes to improved construct representation. Progress on methods to study nomothetic span has taken place as well. Important examples include structural equations models and hierarchical models. Structural equations models (e.g., Jöreskog & Sörbom, 1979) incorporate theoretical relationships among variables and simultaneously take measurement error into account, so that complex hypotheses can be Psychometric Principles Page 43 posed and tested coherently. Hierarchical models (e.g., Bryk & Raudenbush, 1992) incorporate the ways that students are clustered in classrooms, classrooms within schools, and schools within higher levels of organization, to better sort out the within and acrosslevel effects that correspond to a wide variety of instructional, organizational, and policy issues, and growth and change. Clearer specifications and coherent statistical models of the relationships among variable help researchers frame and critique “nomothetic net” validity arguments. Progress in statistical methodology. One kind of scientific breakthrough is to recognize situations previously handled with different models, different theories, or different methods as special cases of a single approach. The previously mentioned models and accompanying computer programs for structural equations and hierarchical analyses qualify. Both have significantly advanced statistical investigations in the social sciences, validity studies among them, and made formerly esoteric analyses more widely available. Developments taking place today in statistical computing are beginning to revolutionize psychometric analysis in a similar way. Those developments comprise resampling-based estimation, full Bayesian analysis, and modular construction of statistical models (Gelman, Carlin, Stern, and Rubin, 1995). The idea is this. The difficulty of managing evidence leads most substantive researchers to work within known and manageable families of analytic models; that is, ones with known properties, available procedures, and familiar exemplars. All of the psychometric models discussed above followed their own paths of evolution, each over the years generating its own language, its own computer programs, its own community of practitioners. Modern computing approaches such as Markov Chain Monte Carlo estimation, provide a general approach to construct and fit such models with more flexibility, and see all as variations on a common theme. In the same conceptual framework and with the same estimation approach, we can carry out probability-based reasoning with all of the models we have discussed. Moreover, we can mix and match components of these models, and create new ones, to produce models that correspond to assessment designs motivated by theory and purpose. This approach stands in contrast to the compromises in theory and methods that Psychometric Principles Page 44 result when we have to gather data to meet the constraints of specific models and specialized computer programs. The freeware computer program BUGS (Spiegelhalter, et al., 1995) exemplifies this building-block approach. These developments are softening the boundaries between researchers who study psychometric modeling and those who address the substantive aspects of assessment. A more thoughtful integration of substantive and statistical lines of evidentiary arguments in assessment will further the understanding and the attainment of psychometric principles. Conclusion These are days of rapid change in assessment.vi Advances in cognitive psychology deepen our understanding of how students gain and use knowledge (National Research Council, 1999). Advances in technology make it possible to capture more complex performances in assessment settings, by including, for example, simulation, interactivity, collaboration, and constructed response (Bennett, 2001). Yet as forms of assessment evolve, two themes endure: The importance of psychometric principles as guarantors of social values, and their realization through sound evidentiary arguments. We have seen that the quality of assessment depends on the quality of the evidentiary argument, and how substance, statistics, and purpose must be woven together throughout the argument. A conceptual framework such as the assessment design models of Figure 1 helps experts from different fields integrate their diverse work to achieve this end (Mislevy, Steinberg, Almond, Haertel, and Penuel, in press). Questions will persist, as to ‘How do we synthesize evidence from disparate sources?’, ‘How much evidence do we have?’, ‘Does it tell us what we think it does?’, and ‘Are the inferences appropriate for each student?’ The perspectives and the methodologies that underlie psychometric principles—validity, reliability, comparability, and fairness—provide formal tools to address these questions, in whatever specific forms they arise. Psychometric Principles Page 45 Notes ii We are indebted to Prof. David Schum for our understanding of evidentiary reasoning, such as it is. This first part of this section draws on Schum (1987, 1994) and Kadane & Schum (1996). ii p(Xj|θ) is the probability density function for the random variable Xj, given that θ is fixed at a specified value. iii Strictly speaking, CTT does not address the full distributions of true and observed scores, only means, variances, and covariances. But we want to illustrate probabilitybased reasoning and review CTT at the same time. Assuming normality for θ and E is the easiest way to do this, since the first two moments are sufficient for normal distributions. iv In statistical terms, if the parameters are identified. Conditional independence is key, because CI relationships enable us to make multiple observations that are assumed to depend on the same unobserved variables in ways we can model. This generalizes the concept of replication that grounds reliability analysis. v See Greeno, Collins, & Resnick (1996) for an overview of these three perspectives on learning and knowing, and discussion of their implications for instruction and assessment. vi Knowing what students know (National Research Council, 2001), a report by the Committee on the Foundations of Assessment, surveys these developments. References Adams, R., Wilson, M.R., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23. Almond, R. G., & Mislevy, R. J. (1999). Graphical models and computerized adaptive testing. Applied Psychological Measurement. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association. Anderson, J.R., Boyle, C.F., & Corbett, A.T. (1990). Cognitive modelling and intelligent tutoring. Artificial Intelligence, 42, 7-49. Bennett, R. E. (2001). How the internet will help large-scale assessment reinvent itself. Education Policy Analysis, 9(5). Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153-168. Brennan, R. L. (1983). The elements of generalizability theory. Iowa City, IA: American College Testing Program. Brennan, R. L. (2000/in press). An essay on the history and future of reliability from the perspective of replications. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, April 2000. To appear in the Journal of Educational Measurement. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296-322. Bryk, A. S., & Raudenbush, S. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage Publications. Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Cohen, J. A. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Psychometric Principles Page 47 Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 17, 297-334. Cronbach, L.J. (1989). Construct validation after thirty years. In R.L. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp.147-171). Urbana, IL: University of Illinois Press. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. Dayton, C. M. (1999). Latent class scaling analysis. Thousand Oaks, CA: Sage Publications. Dibello, L.V., Stout, W.F., & Roussos, L.A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood based classification techniques. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361-389). Hillsdale, NJ: Erlbaum. Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179-197. Embretson, S. E. (1998). A cognitive design systems approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380-396. Ercikan, K. (1998). Translation effects in international assessments. International Journal of Educational Research, 29, 543-553. Ercikan, K., & Julian, M. (2001, in press). Classification Accuracy of Assigning Student Performance to Proficiency Levels: Guidelines for Assessment Design. Applied Measurement in Education. Falmagne, J.-C., & Doignon, J.-P. (1988). A class of stochastic procedures for the assessment of knowledge. British Journal of Mathematical and Statistical Psychology, 41, 1-23. Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374. Psychometric Principles Page 48 Gelman, A., Carlin, J., Stern, H., and Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall. Greeno, J. G., Collins, A. M., & Resnick, L. B. (1996). Cognition and learning. In D. C. Berliner and R. C. Calfee (Eds.), Handbook of educational psychology (pp. 15-146). New York: MacMillan. Gulliksen, H. (1950/1987). Theory of mental tests. New York: John Wiley/Hillsdale, NJ; Lawrence Erlbaum. Haertel, E.H. (1989). Using restricted latent class models to map the skill structure of achievement test items. Journal of Educational Measurement, 26, 301-321. Haertel, E.H., & Wiley, D.E. (1993). Representations of ability structures: Implications for testing. In N. Frederiksen, R.J. Mislevy, and I.I. Bejar (Eds.), Test theory for a new generation of tests. Hillsdale, NJ: Lawrence Erlbaum. Hambleton, R. J. (1993). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 147-200). Phoenix, AZ: American Council on Education/Oryx Press. Hambleton, R. K. & Slater, S. C. (1997). Reliability of credentialing examinations and the impact of scoring models and standard-setting policies. Applied Measurement in Education, 10, 19-39. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the MantelHaenzsel procedures. In H. Wainer and H. I. Braun (Eds.), Test validity (pp. 129145). Hillsdale, NJ: Lawrence Erlbaum. Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum. Jöreskog, K. G., and Sörbom, D. (1979). Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books. Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence. New York: Wiley. Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527-535. Psychometric Principles Page 49 Kelley, T.L. (1927). Interpretation of Educational Measurements. New York: World Book. Kuder, G.F., & Richardson, M.W. (1937). The theory of estimation of test reliability. Psychometrika, 2, 151-160. Lane, W., Wang., N., & Magone, M. (1996). Gender-related differential item functioning on a middle-school mathematics performance assessment. Educational measurement: Issues and practice, 15(4), 21-27; 31. Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. R. Lazarsfeld, S. A. Star, and J. A Clausen (Eds.), Measurement and prediction (pp.362-412). Princeton, NJ: Princeton University Press. Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies. British Journal of Mathematical and Statistical Psychology, 35, 42-56. Linacre, J. M. (1989). Many faceted Rasch measurement. Doctoral Dissertation. University of Chicago. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Lord, R. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/Macmillan. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Education Researcher, 32(2), 13-23. Messick, S., Beaton, A.E., & Lord, F.M. (1983). National Assessment of Educational Progress reconsidered: A new design for a new era. NAEP Report 83-1. Princeton, NJ: National Assessment for Educational Progress. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (in press). On the roles of task model variables in assessment design. To appear in S. Irvine & P. Kyllonen (Eds.), Generating items for cognitive tests: Theory and practice. Hillsdale, NJ: Erlbaum. Psychometric Principles Page 50 Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G., & Penuel, W. (in press). Leverage points for improving educational assessment. In B. Means & G. Haertel (Eds.), Evaluating the effects of technology in education. Hillsdale, NJ: Erlbaum. Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (1999). A cognitive task analysis, with implications for designing a simulation-based assessment system. Computers and Human Behavior, 15, 335-374. Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (in press). Making sense of data from complex assessment. Applied Measurement in Education. Myford, C. M., & Mislevy, R. J. (1995). Monitoring and improving a portfolio assessment system (Center for Performance Assessment Research Report). Princeton, NJ: Educational Testing Service. National Research Council (1999). How people learn: Brain, mind, experience, and school. Committee on Developments in the Science of Learning. Bransford, J. D., Brown, A. L., and Cocking, R. R. (Eds.). Washington, DC: National Academy Press. National Research Council (2001). Knowing what students know: The science and design of educational assessment. Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., and Glaser, R., (Eds.). Washington, DC: National Academy Press. O’Neil, K. A., & McPeek, W. M., (1993). In P. W. Holland, & H. Wainer (Eds.), Differential item functioning. (pp. 255-276). Hillsdale, NJ: Erlbaum. Patz, R. J., & Junker, B. W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24(4), 342-366. Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed.), Educational measurement (3rd Ed.) (pp. 221-262). New York: American Council on Education/Macmillan. Pirolli. P., & Wilson, M. (1998). A theory of the measurement of knowledge content, access, and learning. Psychological Review 105(1), 58-82. Psychometric Principles Page 51 Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research/Chicago: University of Chicago Press (reprint). Reckase, M. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401-412. Rogosa, D.R., & Ghandour, G.A. (1991). Statistical models for behavioral observations(with discussion). Journal of Educational Statistics, 16, 157-252. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2). Samejima, F. (1973). Homogeneous case of the continuous response level. Psychometrika, 38, 203-219. Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham, MD: University Press of America. Schum, D.A. (1994). The evidential foundations of probabilistic reasoning. New York: Wiley. SEPUP (1995). Issues, evidence, and you: Teacher’s guide. Berkeley: Lawrence Hall of Science. Shavelson, R. J., & Webb, N. W. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage Publications. Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72-101. Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3, 271-295. Spiegelhalter, D.J., Thomas, A., Best, N.G., & Gilks, W.R. (1995). BUGS: Bayesian inference using Gibbs sampling, Version 0.50. Cambridge: MRC Biostatistics Unit. Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M.G. Shafto, (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453-488). Hillsdale, NJ: Erlbaum. Psychometric Principles Page 52 Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567-77. Toulmin, S. (1958). The uses of argument. Cambridge, England: University Press. Traub, R. E. & Rowley, G. L. (1980). Reliability of test scores and decisions. Applied Psychological Measurement, 4, 517-545. van der Linden, W. J. (1998). Optimal test assembly. Applied Psychological Measurement, 22, 195-202. van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. New York: Springer. Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., & Thissen, D. (2000). Computerized adaptive testing: A primer (second edition). Hillsdale, NJ: Lawrence Erlbaum Associates. Wainer, H., & Keily, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 195-201. Wiley, D.E. (1991). Test validity and invalidity reconsidered. In R.E. Snow & D.E. Wiley (Eds.), Improving inquiry in social science (pp. 75-107). Hillsdale, NJ: Erlbaum. Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum. Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-208. Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In G. Grant (Ed.), Review of Educational Research, Vol. 17 (pp. 31-74). Washington, DC: American Educational Research Association. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press. Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-213. Task Model(s) Evidence Model(s) Student Model Scoring Model Measurement Model θ3 θ3 θ4 θ4 θ5 θ5 θ1 θ2 X1 X2 X1 X2 1. 3. 5. 7. Figure 1 General Form of the Assessment Design Models x x x x x xx x x x x x x xx x x x x x x xx x x x x x x xx x 2. 4. 6. 8. xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx Understanding Concepts (U)--Understanding scientific concepts (such as properties and interactions of materials, energy, or thresholds) in order to apply the relevant scientific concepts to the solution of problems. This variable is the IEY version of the traditional “science content”, although this content is not just “factoids”. Designing and Conducting Investigations (I)--Designing a scientific experiment, carrying through a complete scientific investigation, performing laboratory procedures to collect data, recording and organizing data, and analyzing and interpreting results of an experiment. This variable is the IEY version of the traditional “science process”. Evidence and Tradeoffs (E)--Identifying objective scientific evidence as well as evaluating the advantages and disadvantages of different possible solutions to a problem based on the available evidence. Communicating Scientific Information (C)--Organizing and presenting results in a way that is free of technical errors and effectively communicates with the chosen audience. Figure 2 The Variables in the Student Model for the BEAR “Issues, Evidence, and You” Example θE θU θI θC Figure 3 Graphical Representation of the BEAR Student Model 2. You are a public health official who works in the Water Department. Your supervisor has asked you to respond to the public's concern about water chlorination at the next City Council meeting. Prepare a written response explaining the issues raised in the newspaper articles. Be sure to discuss the advantages and disadvantages of chlorinating drinking water in your response, and then explain your recommendation about whether the water should be chlorinated. Figure 4 An Example of a Task Directive from the BEAR Assessment “As an edjucated employee of the Grizzelyville water company, I am well aware of the controversy surrounding the topic of the chlorination of our drinking water. I have read the two articals regarding the pro’s and cons of chlorinated water. I have made an informed decision based on the evidence presented the articals entitled “The Peru Story” and “700 Extra People May bet Cancer in the US.” It is my recommendation that our towns water be chlorin treated. The risks of infecting our citizens with a bacterial diseease such as cholera would be inevitable if we drink nontreated water. Our town should learn from the country of Peru. The artical “The Peru Story” reads thousands of inocent people die of cholera epidemic. In just months 3,500 people were killed and more infected with the diease. On the other hand if we do in fact chlorine treat our drinking water a risk is posed. An increase in bladder and rectal cancer is directly related to drinking chlorinated water. Specifically 700 more people in the US may get cancer. However, the cholera risk far outweighs the cancer risk for 2 very important reasons. Many more people will be effected by cholera where as the chance of one of our citizens getting cancer due to the water would be very minimal. Also cholera is a spreading diease where as cancer is not. If our town was infected with cholera we could pass it on to millions of others. And so, after careful consideration it is my opion that the citizens of Grizzelyville drink chlorine treated water.” Figure 5 An Example of a Student Response from the BEAR Assessment Using Evidence: Score 4 3 2 1 0 X Using Evidence to Make Tradeoffs: Response uses objective reason(s) based Response recognizes multiple perspectives on relevant evidence to support choice. of issue and explains each perspective using objective reasons, supported by evidence, in order to make choice. Response accomplishes Level 3 AND goes beyond in some significant way, such as questioning or justifying the source, validity, and/or quantity of evidence. Response accomplishes Level 3 AND goes beyond in some significant way, such as suggesting additional evidence beyond the activity that would further influence choices in specific ways, OR questioning the source, validity, and/or quantity of evidence & explaining how it influences choice. Response provides major objective Response discusses at least two reasons AND supports each with relevant perspectives of issue AND provides & accurate evidence. objective reasons, supported by relevant & accurate evidence, for each perspective. Response provides some objective reasons AND some supporting evidence, BUT at least one reason is missing and/or part of the evidence is incomplete. Response states at least one perspective of issue AND provides some objective reasons using some relevant evidence BUT reasons are incomplete and/or part of the evidence is missing; OR only one complete & accurate perspective has been provided. Response states at least one perspective of Response provides only subjective issue BUT only provides subjective reasons (opinions) for choice and/or uses reasons and/or uses inaccurate or irrelevant inaccurate or irrelevant evidence from the evidence. activity. No response; illegible response; response No response; illegible response; response lacks reasons AND offers no evidence to offers no reasons AND no evidence to support decision made. support choice made. Student had no opportunity to respond. Figure 6 The Scoring Model for Evaluating Two Observable Variables from Task Responses in the BEAR Assessment θE Using Evidence θU θI θC Using Evidence to Make Tradeoffs Figure 7 Graphical Representation of the Measurement Model for the BEAR Sample Task Linked to the BEAR Student Model Claim: Sue can use specifics to illustrate a description of a fictional character. Alternative Hypothesis: Warrant: Students who know how to use writing techniques will do so in an assignment that calls for them. on account of Backing: The past three terms, students’ understandings of the use of techniques in indepth interviews have corresponded with their performances in their essays. unless The student has not actually produced the work. since so Data: Sue’s essay uses three incidents to illustrate Hamlet’s indecisiveness. supports Rebuttal data: Sue’s essay is very similar to the character description in the Cliff Notes guide to Hamlet. Figure 8 A Toulmin Diagram for a Simple Assessment Situation p(θ) θ p(X1|θ) p(X3|θ) p(X2 |θ) X1 X2 X3 Figure 9 Statistical Representation for Classical Test Theory Theorem Let N ( µ , σ ) denote the normal (Gaussian) distribution with mean µ and standard deviation σ . If the prior distribution of θ is N ( µ 0 , σ 0 ) and the X is N (θ , σ E ) , then the distribution for θ posterior to observing X is N ( µ post , σ post ) , −1 where σ post = (σ 0−2 + σ E−2 ) and µ post = (σ 0−2 µ0 + σ E−2 X ) (σ 0−2 + σ E−2 ) . Calculating the posterior distribution for Sue Beginning with an initial distribution of N (50,10 ) , we can compute the posterior distribution for Sue’s θ after seeing three independent responses by applying the theorem three times, in each case with the posterior distribution from one step becoming the prior distribution for the next step. a) Prior distribution: θ~ N (50,10 ) . b) After the first response: Given θ~ N (50,10 ) and X1~ N (θ ,5 ) , observing X1=70 yields the posterior N ( 66.0, 4.5 ) . c) After the second response: Given θ~ N ( 66.0, 4.5 ) and X2~ N (θ ,5 ) , observing X2=75 yields the posterior N ( 70.0,3.3) . d) After the third response: Given θ~ N ( 70.0,3.3) and X3~ N (θ ,5 ) , observing X3=85 yields the posterior N ( 74.6, 2.8 ) . Calculating a fit index for Sue Suppose each of Sue’s scores came from a N (θ ,5 ) distribution. Using the posterior mean we estimated from Sue’s scores, we can calculate how likely her response vector is under this measurement model using a chi-square test of fit: [(70-74.6)/5]2 + [(75-74.6)/5]2 + [(85-74.6)/5]2 = .85+.01+4.31 = 5.17. Checking against the chi-square distribution with two degrees of freedom, we see that about 8-percent of the values are higher than this, so this vector is not that unusual. Figure 10 A Numerical Example Using Classical Test Theory Probability 1 0.8 Item 1: β1=-1 0.6 0.4 Item 2: β2=2 0.2 0 -5 -4 -3 -2 -1 0 1 θ Figure 11 Two Item Response Curves 2 3 4 5 Likelihood 0.04 Maximum likelihood estimate ~ .75 0 -5 -4 -3 -2 -1 0 1 2 3 4 θ Figure 12 The IRT Likelihood Function Induced by Observing Xi1=0 and Xi2=1 5 Information 0.4 Test Information Function 0.3 0.2 Information from Item 1 0.1 Information from Item 2 0 -5 -4 -3 -2 -1 0 1 2 3 theta Figure 13 An IRT Test Information Curve for the Two-Item Example 4 5