Academia.eduAcademia.edu

Teacher Test Accountability

2001, education policy analysis archives

Given the high stakes of teacher testing, there is no doubt that every teacher test should meet the industry guidelines set forth in the Standards for Educational and Psychological Testing. Unfortunately, however, there is no public or private business or governmental agency that serves to certify or in any other formal way declare that any teacher test does, in fact, meet the psychometric recommendations stipulated in the Standards. Consequently, there are no legislated penalties for faulty products (tests) nor are there opportunities for test takers simply to raise questions about a test and to have their questions taken seriously by an impartial panel. The purpose of this article is to highlight some of the psychometric results reported by National Evaluation Systems (NES) in their 1999 Massachusetts Educator Certification Test (MECT) Technical Report, and more specifically, to identify those technical characteristics of the MECT that are inconsistent with the Standards. A second purpose of this article is to call for the establishment of a standing test auditing organization with investigation and sanctioning power. The significance

Education Policy Analysis Archives Volume 9 Number 6 February 22, 2001 ISSN 1068-2341 A peer-reviewed scholarly journal Editor: Gene V Glass, College of Education Arizona State University Copyright 2001, the EDUCATION POLICY ANALYSIS ARCHIVES. Permission is hereby granted to copy any article if EPAA is credited and copies are not sold. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. Teacher Test Accountability: From Alabama to Massachusetts Larry H. Ludlow Boston College Abstract Given the high stakes of teacher testing, there is no doubt that every teacher test should meet the industry guidelines set forth in the Standards for Educational and Psychological Testing. Unfortunately, however, there is no public or private business or governmental agency that serves to certify or in any other formal way declare that any teacher test does, in fact, meet the psychometric recommendations stipulated in the Standards. Consequently, there are no legislated penalties for faulty products (tests) nor are there opportunities for test takers simply to raise questions about a test and to have their questions taken seriously by an impartial panel. The purpose of this article is to highlight some of the psychometric results reported by National Evaluation Systems (NES) in their 1999 Massachusetts Educator Certification Test (MECT) Technical Report, and more specifically, to identify those technical characteristics of the MECT that are inconsistent with the Standards. A second purpose of this article is to call for the establishment of a standing test auditing organization with investigation and sanctioning power. The significance 1 of 22 of the present analysis is twofold: a) psychometric results for the MECT are similar in nature to psychometric results presented as evidence of test development flaws in an Alabama class-action lawsuit dealing with teacher certification (an NES-designed testing system); and b) there was no impartial enforcement agency to whom complaints about the Alabama tests could be brought, other than the court, nor is there any such agency to whom complaints about the Massachusetts tests can be brought. I begin by reviewing NES's role in Allen v. Alabama State Board of Education, 81-697-N. Next I explain the purpose and interpretation of standard item analysis procedures and statistics. Finally, I present results taken directly from the 1999 MECT Technical Report and compare them to procedures, results, and consequences of procedures followed by NES in Alabama. Teacher Test Accountability: From Alabama to Massachusetts From its inception and continuing through present administrations, the Massachusetts Educator Certification Test (MECT) has attracted considerable public attention both regional and around the world (Cochran-Smith & Dudley- Marling, in press). This attention is due in part to two disturbing facts: 1) educators seeking certification in Massachusetts have generally performed poorly on the test, and 2) in many instances politicians have used these test results to assert, among other things, that candidates who failed are “idiots” (Pressley, 1998). The purpose of the MECT is “to ensure that each certified educator has the knowledge and some of the skills essential to teach in Massachusetts public schools” (National Evaluation Systems, 1999, p. 22). The Massachusetts Board of Education has raised the stakes on the MECT by enacting plans to sanction institutions of higher education (IHEs) with less than an 80% pass rate for their teacher candidates (Massachusetts Department of Education, 2000). One consequence of this proposal is that most IHEs are considering requirements that the MECT be passed before students are admitted to their teacher education programs. In addition, Title II (Section 207) of the Higher Education Act of 1998 requires the compilation of state “report cards” for teacher education programs, which must include performance on certification examinations (U.S. Department of Education, 2000). What all of this means is that poor performance on the MECT could prevent federal funding for professional development programs, limit federal financial aid to students, allow some IHEs be labeled publicly “low performing”, and prove damaging at the state-level when states are inevitably compared to one another upon release of the Title II report cards in October 2001. Given the personal, institutional, and national ramifications of the test results, there is no question that the MECT should be expected to meet the industry benchmarks for good test development practice as set forth in the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999). At this time, however, there is no public or private business or governmental agency either within the Commonwealth of Massachusetts or nationally that can certify or in any other formal way declare that the MECT does (or does not), in fact, meet the psychometric recommendations stipulated in the Standards. The National Board on Educational Testing and Public Policy (NBETPP) serves as an “independent organization that monitors testing in the US” but even it does not function as a regulatory agency 2 of 22 (NBETPP, 2000). In addition to the absence of a national regulatory agency, many state departments of education do not have the professionally trained staff to answer directly technical psychometric questions. Nor do they usually have the expertise on staff to confront a testing company, which they have contracted, and demand a sufficient response to a technical question raised by outside psychometricians. Furthermore, even when a database with the candidates' item- level responses is available for internal analysis, a state department of education does not typically conduct rigorous disconfirming analyses, e.g. evidence of adverse impact. Thus, most state departments are largely dependent on whatever information testing companies decide to release. The public is then left with an inadequate accountability process. One purpose of this article is to highlight some of the psychometric results reported by National Evaluation Systems in their 1999 MECT Technical Report (NES, 1999). Specifically, this article identifies technical characteristics of the MECT that are inconsistent with the Standards. A second purpose of this article is to voice one more call for the establishment of a standing test auditing organization with powers to investigate and sanction (National Commission on Testing and Public Policy, 1990; Haney, Madaus & Lyons, 1993). The significance of the present analysis is twofold. First, psychometric results reported by NES for the MECT are similar in nature to psychometric results entered as evidence of test development flaws in an Alabama class- action lawsuit dealing with teacher certification (Allen v. Alabama State Board of Education, 81-697-N). That suit was brought by several African-American teachers who charged, among other things, that “the State of Alabama's teacher certification tests impermissibly discriminate[d] against black persons seeking teacher certification;” the tests “[were] culturally biased;” and the tests “[had] no relationship to job performance” (Allen, 1985, p. 1048). Second, there was no impartial enforcement agency to whom complaints about the Alabama tests could be brought, other than the court, nor is there any such agency to whom complaints about the Massachusetts tests can be brought. These two points are linked in an interesting and troubling way--NES, the Massachusetts Educator Certification Tests contractor, was also the contractor for the Alabama Initial Teacher Certification Testing Program (AITCTP). Some of the criticism of debates about teacher testing, teacher standards, teacher quality, and accountability suggests that arguments are, in part, ideologically, rather than empirically based (Cochran-Smith, in press). This may or may not be the case. This article, however, takes the stance that regardless of one's political ideology or philosophy about testing, the MECT is technically flawed. Furthermore, because of the lack of an enforceable accountability process, the public is powerless in its efforts to question the quality or challenge the use of this state-administered set of teacher certification examinations. In this article I argue that the consequences of high-stakes teacher certification examinations are too great to leave questions about technical quality solely in the hands of state agency personnel, who are often ill- prepared and under-resourced, or in the hands of test contractors, who may face obvious conflicts-of-interest in any aggressive analyses of their own tests. In the sections that follow, I begin by reviewing NES's role in Allen v Alabama. Then I explain the purpose and interpretation of standard item analysis procedures and statistics. Finally I compare results taken directly from the 1999 MECT Technical Report with statistical results entered as evidence of test development flaws in Allen v Alabama. NES and the AITCTP 3 of 22 Allen, et al. v. Alabama State Board of Education, et al. In January 1980, National Evaluation Systems was awarded a contract on a non-competitive basis for the development of the Alabama Initial Teacher Certification testing Program (AITCTP). Item writing for these tests began in the Spring of 1981, and the first administration of the tests took place on June 6, 1981. Allen v Alabama was brought just six months later on December 15th, 1981. The Allen complaint challenged the Alabama State Board of Education's requirement that applicants for state teacher certification pass certain standardized tests administered under the AITCTP. On October 14, 1983, class certification (Note 1) was granted, and the first trial was set for April 22, 1985. Subsequent to a pre-trial hearing on December 19, 1984 and “after substantial discovery was done,”(Note 2) an out-of-court settlement was reached on April 4, 1985. A Consent Decree was presented to the U.S. District Court April 8, 1985(Note 3). The Attorney General for the State of Alabama immediately “publicly attacked the settlement” (Allen, 1985, p. 1050), claiming that it was illegal. Nonetheless, the consent decree was accepted by the court October 25, 1985 (Allen, Oct. 25. 1985). A succession of challenges and appeals on the legality and enforceable status of the settlement resulted (Note 4). For example, on February 5, 1986, the district court vacated its October 25th order approving the consent decree (Allen, February 5, 1985, p. 76). While the plaintiffs appeal of the February 5th decision was pending at the 11th Circuit Court of Appeals, trial began in district court on May 5, 1986. The AITCTP consisted of an English language proficiency examination, a basic professional studies examination, and 45 content-area examinations. The purpose of the examinations was to measure “specific competencies which are considered necessary to successfully teach in the Alabama schools” (Allen, Defendants' Pre-Trial Memorandum, 1986, p. 21). A pool of 120 items for each exam was generated--100 of which were scorable and mostly remained unchanged across the first eight administrations. Extensive revisions were incorporated into most of the tests at the ninth administration. By the start of the May 1986 trial the tests had been administered 15 times in all. A team of technical experts (Note 5) for the plaintiffs was hired in November 1983 (prior to the ninth administration of the exams) to examine test development, administration, and implementation procedures. The team was initially unsure about the form of the sophisticated statistical analyses they assumed would have to be conducted to test for the presence of “bias” and “discrimination”, the bases of the case. That is, the methodology for investigating what was then called “bias” and is now called “differential item functioning” was far from well established at that time (Baldus & Cole, 1980). Nevertheless, when the plaintiffs' team received the student-level item response data from the defendants, their first steps were to perform an “item analysis.” Such an analysis produces various item statistics and test reliability estimates. These initial analyses produced negative point-biserial correlations. Although point-biserial correlations are explained in detail below, suffice it to say at this point that it was a surprise to find negative point-biserial correlations between the responses that examinees provided on individual items and their total test scores. Such correlations are not an intended outcome from a well-designed testing program. These statistical results prompted a detailed inspection of the content, format, and answers for all the individual items on the AITCTP tests. Content analyses yielded discrepancies in the keyed correct responses in the NES test documents and the keyed correct responses in the NES- supplied machine scorable answer keys (i.e., miskeyed 4 of 22 items were on the answer keys). This finding led to an inspection of the original NES in-house analyses which revealed that negative point-biserials for scorable items existed in their own records from the beginning of the testing program and continuing throughout the eighth administration without correction. What this meant for the plaintiffs was that NES had item analysis results in their own possession which indicated that there were mis-keyed items. Nonetheless they implemented no significant changes in the exams until they were faced with a lawsuit and plaintiffs' hiring of the testing experts to do their own analyses. The defendants argued that it was normal for some problems to go undetected or uncorrected in a large-scale testing program because the overall effect is trivial for the final outcome. The problem with that argument was that many candidates were denied credit for test items on which they should have received credit, and some of those candidates failed the exam by only one point. In fact, as the plaintiffs argued, as many as 355 candidates over eight administrations of the basic professional skills exam alone should have passed but were denied that opportunity simply because of faulty items that remained on the tests (Milman, 1986, p. 285). It should be noted here that these were items that even one of the state's expert witnesses for the defense admitted were faulty (Millman, 1986, p. 280). Establishing that there were flawed items with negative point-biserial correlations was critical to the plaintiffs' case. The plaintiffs presented as evidence page after page of so-called “failure tables” (Note 6) with the names of candidates for each test whose answers were mis-scored on these faulty items. Based upon these failure tables, any argument from defendants that the mis-keyed items did not change the career expectations for some candidates would most likely have failed. In the face of this evidence, the defendants argued at trial that ...the real disagreement is between two different testing philosophies. One of these philosophies would require virtual perfection under its proponents' rigid definition of that word. The other looks at testing as a constantlydeveloping art in which professional judgment ultimately determines what is appropriate in a particular case” (Allen, Defendant's Pre-trial Memorandum, 1986, p. 121-2). Plaintiffs counter-argued “This case…is not a philosophical case at all. This case is a case on professional competence….this was an incompetent job, unprofessional, and as I said before, sloppy and shoddy, and in the case of the miskeyed items, unethical.” (Madaus, 1986, p. 185). Judge Thompson, in the subsequent Richardson decision which also involved the AITCTP, specifically agreed with plaintiffs on this point (Richardson, 1989, p. 821, 823, 825). Excellent reviews of the diametrically opposed plaintiff and defendant positions may be found in Walden & Deaton (1988) and Madaus (1990). At the same time that this case was proceeding, the plaintiffs' appeal to reverse the vacating of the original settlement was granted prior to a decision in this trial (Allen, Feb. 5, 1986, p. 75). The U.S. Court of Appeals decided the district court should have enforced the consent decree (Allen, April 22, 1987)—which the district court so ordered on May 14, 1987 (Allen, May 14, 1987). Although the decision to uphold the original settlement was a positive ruling for the plaintiffs, it also was somewhat counter-productive for them because it was unexpectedly beneficial to NES at this stage in the proceedings. That is because the evidence presented above in Allen v Alabama 5 of 22 was critical of the state and NES (NES was explicitly referred to in the court documents). Thus, NES's best hope for avoiding a written opinion critical of their test development procedures was if plaintiffs' appeal were to be upheld and the original settlement enforced, as it was. Then there would be no evidentiary record, no court ruling, and no legal opinion that would reflect badly upon the NES procedures. Richardson v Lamar County Board of Education (87-T-568-N) commenced, however, and the actions of NES and the Alabama State Board of Education were openly discussed and critiqued in the court's opinion of November 30, 1989 (though NES was not mentioned by name in the Richardson, 1989 decision). Richardson v Lamar County Board of Education, et al. Like Allen v Alabama, Richardson v Lamar County also addressed issues of the “racially disparate impact” of the AITCTP (Richardson, 1989, p. 808). The Honorable Myron H. Thompson again presided, and testimony from Allen v Alabama was admitted as evidence (Richardson, 1989). Although the defendants denied in the Allen v Alabama consent decree that the AITCTP tests were psychometrically invalid, and even though no decision was reached in the abbreviated Allen v Alabama trial, the State Board of Education did not attempt to defend the validity of the tests in Richardson v Lamar and, “in fact, it conceded at trial that plaintiff need not relitigate the issue of test validity” (Richardson v Alabama State Board of Education, 1991, p. 1240, 1246). Judge Thompson's position on the test development process of NES was clearly stated: “In order to fully appreciate the invalidity of the two challenged examinations, one must understand just how bankrupt the overall methodology used by the State Board and the test developer was” (Richardson, 1989, p. 825, n. 37). While sensitive to the fact that “close scrutiny of any testing program of this magnitude will inevitably reveal numerous errors,” the court concluded that these errors were not “of equal footing” and “the error rate per examination was simply too high” (Richardson, 1989, pp. 822- 24) Thus, none of the examinations that comprised the certification test possessed content validity because of five major errors by the test developer and the test developer had made six major errors in establishing cut scores (Richardson, 1989, pp. 821-25). Case Outcomes in Alabama The Allen v Alabama consent decree required Alabama to pay $500,000 in liquidated damages and issue permanent teaching certificates to a large portion of the plaintiff class (Allen, Consent Decree, Oct. 25, 1985, pp. 9-11). The decree also provided for a new teacher certification process. However, no new test was developed or implemented and the Alabama State Board of Education suspended the teacher certification testing program on July 12, 1988. In 1995 the Alabama State Legislature enacted a law requiring that teacher candidates pass an examination as a condition for graduation. Subsequently, another trial was held February 23, 1996 to decide the state's motions to modify or vacate the 1985 consent decree (Allen, 1997, p. 1414). Those motions were denied on September 8, 1997 (Allen, Sept. 8, 1997). Given the rigorous test development and monitoring conditions of the Amended Consent Decree, it was estimated by the court that the State of Alabama would not gain complete control of its teacher testing program “until the year 2015” (Allen, Jan. 5, 2000, p. 23). Only recently has a testing company stepped forward with a proposal for a new Alabama teacher certification test (Rawls, 2000). Plaintiff Richardson was awarded re-employment, backpay, and various other 6 of 22 employment benefits (Richardson, 1989, pp. 825-26). Defendants (the State of Alabama and its agencies) in both cases were ordered to pay court costs and attorney fees (Richardson, 1989, pp. 825-26). However, even though NES was responsible for the development of the tests, NES was not named as one of the defendants in these cases and was not held liable for any damages (Note 7). Psychometric and Statistical Background At this point it is appropriate to discuss some of the psychometric concepts and statistics that are fundamental to any question about test quality. The purpose of this discussion is to illustrate that excruciatingly complex analyses are not necessarily required in order to reveal flaws in a test or individual test items. The first steps in test development simply involve common sense practice combined with sound statistical interpretations. If those first steps are flawed, then no complex psychometric analysis will provide a remedy for the mistakes. One of the simplest statistics reported in the reliability analysis of a test like the MECT is the “item-test point-biserial correlation.” This statistic goes by other names such as the “item-total correlation” and the “item discrimination index.” It is called the point-biserial correlation specifically because it represents the relationship between a truly dichotomous variable (i.e., an item scored as either right or wrong) and a continuous variable (i.e., the total test score for a person). A total test score, here, is the simple sum of the number of correctly answered items on a test. The biserial correlation has a long history of statistical use (Pearson, 1909). One of its earliest measurement uses was as an item-level index of validity (Thorndike, et al., 1929, p. 129). The “point”-biserial correlation appeared specifically for individual dichotomous items in an item analysis because of concerns over the assumptions implicit in the more general biserial-correlation (Richardson & Stalnaker, 1933). It was again used as a validity index. It subsequently came to acquire diagnostic value and was re-labeled as a discrimination index (Guilford, 1936, p. 426). The purpose of this statistic is to determine the extent to which an individual item contributes useful information to a total test score. Useful information may be defined as the extent to which variation in the total test scores has spread examinees across a continuum of low scoring persons to high scoring persons. In the present situation, this refers to the extent to which well qualified candidates can be distinguished from less capable candidates. Generally, the greater the variation in the test scores, the greater the magnitude of a reliability estimate. Reliability may be defined many ways through the body of definitions and assumptions known as Classical Test Theory or CTT (Lord & Novick, 1968). According to CTT, an examinee's observed score (X) is assumed to consist of two independent components, a true score component (T) and an error component (E). One relevant definition of reliability may be expressed as the ratio of true-score variance to observed- score variance. Thus, the closer the ratio is to 1.0, the greater the proportion of observed-score variance that is attributed to true-score variance. The KR-20 reliability estimate is often reported for achievement tests (Kuder & Richardson, 1937, Eq. 20, p. 158). Although reliability as defined above is necessarily positive, the KR-20 can be negative under certain extraordinary conditions (Dressel, 1940) but typically ranges from 0 to +1. Nevertheless, the higher the value, the more “internally consistent” the items on a test. The magnitude of the KR-20, however, is 7 of 22 affected by the direction and magnitude of the point-biserial correlations. Specifically, total test score reliability is decreased by the inclusion of items with near-zero point-biserial correlations and is worsened further by the inclusion of items with negative point-biserial correlations. This is because each additional faulty item increases the error variance in the scores at a faster rate than the increase in true-score variance. Technically, the point-biserial correlation represents the magnitude and direction of the relationship between the set of incorrect (scored as “0”) and correct (scored as “1”) responses to an individual item and the set of total test scores for a given group of examinees. In other words, it is a variation of the common Pearson product-moment correlation (Lord & Novick, 1968, p. 341). It can range in magnitude from zero to . An estimate near zero is a poorly discriminating item that contributes no useful information. An estimate of +1 would indicate a perfectly discriminating item in the sense that no other items are necessary on the test for differentiating between high scoring and low scoring persons. A value of 1.0 is never attained in practice nor is it sought (Loevinger, 1954). Negative estimates are addressed below. Ideally the test item point-biserial correlation should be moderately positive. Although various authors differ on what precisely constitutes “moderately positive”, a long-standing general rule of thumb among experts is that a correlation of .20 is the minimum to be considered satisfactory (Nunnally, 1967, p. 242; Donlon, 1984, p. 48) (Note 8). There is, however, no disagreement among psychometricians on the direction of the relationship—it has to be positive. The direction of the correlation is critical. A positive correlation means that examinees who got an item right also tended to score above the mean total test score and those who got the item wrong tended to score below the mean total test score. This is intuitively reasonable and is an intended psychometric outcome. Such an item is accepted as a good “discriminator” because it differentiates between high and low scoring examinees. This is one of the fundamental objectives of classical test theory, the theory underlying the development and use of the MECT. A negative point-biserial correlation, however, occurs when examinees who got an item correct tended to score below the mean total test score while those who got the item wrong tended to score above the mean total test score. This situation is contrary to all standard test practice and is not an intended psychometric outcome (Angoff, 1971, p. 27). A negative point-biserial correlation for an item can occur because of a variety of problems (Crocker & Algina , 1986). These include: 1. 2. 3. 4. chance response patterns due to a very small sample of people having been tested, no correct answers to an item, multiple correct answers to an item, the item was written in such a way that “high ability” persons read more into the item than was intended and thus chose an unintended distracter while the “low ability” people were not distracted by a subtlety in the item and answered it as intended, 5. the item had nothing to do with the topic being tested, or 6. the item was mis-keyed, that is, a wrong answer was mistakenly keyed as the correct one on the scoring key. When an item yields a negative point-biserial correlation, the test developer is obligated to remove the item from the test so that it does not enter into the total test score calculations. In fact, the typical commercial testing situation is one where the test contractor administers the test in at least one field trial, discovers problematic items, 8 of 22 either fixes the problems or discards the items entirely, and then readministers the test prior to making the test fully operational. The presence of a flawed item on a high-stakes examination can never be defended psychometrically. One additional point must be made. The point-biserial correlation can be computed two ways. The first way is to correlate the set of 0/1 (incorrect/correct) responses with the total scores as described above. In this way of computing the statistic, the item for which the correlation is being computed contributes variance to the total score, hence, the correlation is necessarily magnified. That is, the statistical estimate of the extent to which an item is internally consistent with the other items “tends to be inflated” (Guilford, 1954, p.439). The second way in which the correlation may be computed is to compute it between the 0/1 responses on an item and the total scores for everyone but with the responses to that particular item removed from the total score (Henrysson, 1963). This is called the “corrected point-biserial correlation.” It is a more accurate estimate of the extent to which an individual item is correlated to all the other items. It is easily calculated and reported by most statistical software packages used to perform reliability analyses (e.g., SPSS's Reliability procedure). Various concerns have been raised over the interpretation of the point-biserial correlation because the magnitude of the coefficient is affected by the difficulty of the item. The fact is, however, that all the various discrimination indices are highly positively correlated (Nunnally, 1936; Crocker & Algina, 1986). Furthermore, even though the magnitude of the point-biserial correlation tends to be less than the biserial-correlation, all writers agree on the interpretation of negative discriminations. “No test item, regardless of its intended purpose, is useful if it yields a negative discrimination index”(Ebel & Frisbie, 1991, p. 237). Such an item “lowers test reliability and, no doubt, validity as well” (Hopkins, 1998, p. 261). Furthermore, “on subsequent versions of the test, these items [with negative point-biserial correlations] should be revised or eliminated (Hopkins, 1998, p. 259). NES AND THE MECT The 1999 MECT Technical Report In July 1999 NES released their five volume Technical Report on the Massachusetts Educator Certification Tests. Volume I describes the test design, item development description, and psychometric results. Volume II describes the subject matter knowledge and test objectives. Volume III consists of “correlation matrices by test field.” Volume IV consists of various content validation materials and reports. Volume V consists of pilot material, bias review material, and qualifying score material. The report was immediately hailed by Massachusetts Commissioner of Education David P. Driscoll: "I have said all along that I stand by the reliability and validity of the tests, and this report supports it.” (Massachusetts Department of Education, 1999). Field Trial Technical Report Volume I contains the psychometric results for the first four administrations of the MECT (April, July, and October 1998, and January 1999). It does not, however, contain any results from a full-scale field trial, nor are any “pilot” test results reported (Note 9). There is no information on how may different items were tested, where the items came from, how many items were revised or rejected, what the 9 of 22 revisions were to any revised items, or what the psychometric item-level results were. In fact, there is no field trial evidence in support of the initial inclusion of any of the individual items on the operational exams because there was no field trial. Interestingly, the Department of Education released a brochure in January 1998 stating that the first two test administrations would not count for certification—implying that the tests would serve as a field trial. Chairman of the Board of Education John Silber, however, declared in March 1998 that the public had been misinformed and that the first two tests would indeed count for certification. This policy reversal was unfortunate because of the confusion and anxiety it created among the first group of examinees and because it prevented the gathering of statistical results that could have improved the quality of the test. NES had considered a field trial of their teacher test in Alabama but did not conduct one and assumedly came to regret that decision. In Allen v Alabam they argued, “As the evidence will show, there was no need to conduct a separate large-scale field tryout in this case, since the first test administration served that purpose” (Allen, Defendants' Pre-Trial Memorandum, 1986, p. 113). That decision was unwise because it directly affected the implementation and validity of their procedures. For example, “The court has no doubt that, after the results from the first administration of those 35 examinations were tallied, the test developer knew that its cut-score procedures had failed” (Richardson, 1989, p. 823). In fact, the original settlement in Allen v Alabama stipulated that in any new operational examination, the items “shall be field tested using a large scale field test” (Allen, Consent Decree, Oct. 25, 1985, p. 3). The first two administrations of the MECT would have served an important purpose as a full-scale field trial for the new tests, thus avoiding the mistake made in Alabama. However, that opportunity to detect and correct problems in administration, scoring, and interpretation was lost. The impact of the lack of a field trial is further magnified when it is noted that the time period between when NES was awarded the Massachusetts contract (October 1997) and when the first tests were administered (April 1998) was even smaller than the time period NES had to develop the tests in Alabama—a time frame that the court referred to as “quite short” (Richardson, 1989, p. 817). Furthermore, even though NES may have drawn many of the MECT items from existing test item banks, items written and used elsewhere still must be field tested on each new population of teacher candidates. Point-biserial correlations In the NES Technical Report Volume I, Chapter 8, p. 140, there is a description of when an item is flagged for further scrutiny. One of the conditions is when an item displays an “item-to-test point-biserial correlation less than 0.10 (if the percent of examinees who selected the correct response is less than 50)”. After such an item is found, “The accuracy of each flagged item is reverified before examinees are scored.” The Technical Report, however, does not report or provide the percent of persons who selected the correct response on each item. Nor is there an explanation of what the reverification process consisted of, nor of how many items were flagged, nor what was subsequently modified on flagged items. Thus, there is no way to determine the extent to which NES actually followed its own stated guidelines and procedures in the development of the MECT. The relevance of what NES states as their review procedures and what they actually performed is that in Alabama, under the topic of content validity, it was argued by the defense that items rated as “content invalid” were revised by NES and that these “revisions were approved by Alabama panelists before they appeared on a 10 of 22 test.” The court, however, found that “no such process occurred” (Richardson, 1989, p. 822). The following table summarizes the point-biserial estimates reported for the MECT. Note that these are not the results prior to NES conducting the item review process. These are the results for the “scorable items” after the NES review. Table 1 Problematic Point Biserial Correlations from the 1999 MECT Technical Report Number tested Date N of M/C Items % of total items Items with point biserials <=0.20 <.00 .00-.05 .06-.10 .11-.15 .16-.20 Apr-98 4891 315 1 7 15 24 46 29.5% Jul-98 5716 443 0 2 14 17 39 16.3% Oct-98 5286 379 2 5 10 15 32 16.9% Jan-99 9471 507 1 4 14 35 49 20.3% 25,364 1,644 4 18 53 91 166 332/1644 = 20.2% Test Number N of tested M/C Items % of total items Items with point biserials <=0.20 <.00 .00-.05 .06-.10 .11-.15 .16-.20 Writing 9750 92 0 0 0 1 1 2.2% Reading 9455 144 0 0 1 1 6 5.6% 936 256 0 3 18 30 46 37.9% 3125 256 0 2 0 3 27 12.5% Social Studies 259 128 1 0 1 6 14 17.2% History 108 64 0 0 2 6 5 20.3% English 695 256 0 3 11 12 29 21.5% Mathematics 345 192 1 0 4 4 7 8.3% Special Needs 691 256 2 10 16 28 31 34.0% 1,644 4 18 53 91 166 Early Childhood Elementary Source: Massachusetts Educator Certification Tests: Technical Report, 1999 A number of observations may be made from the information in this table. First, of the 1644 total number of items administered over the first four dates, 332 items (20.19%) had point-biserial correlations that are lower than the industry minimum standard criterion of .20. That is a huge percent of poorly performing items for a high-stakes examination. Second, while there are relatively few suspect items on the Reading and Writing tests, there are large numbers of items with poor statistics on many of the subject 11 of 22 matter tests. The Early Childhood, English, and Special Needs tests, in particular, consisted of extraordinarily large percentages of poorly performing items (37.9%, 21.5%, and 34%, respectively). Overall, of the 332 items with low point-biserials, 322 (97%) occurred on the subject matter tests. On the face of it, the results for the subject matter tests are terrible. There is, unfortunately, no authoritative source in the literature (including the Standards) that tells us unequivocally whether or not this overall 20.19% of poorly performing items on a licensure examination with high-stakes consequences is acceptable, not acceptable, or even terrible. Given the steps that NES claims were followed in selecting items from existing item banks and in writing new items, there simply should not be this many technically poor items on these tests. Reliability In Volume I, Chapter 9, p. 188 of the Technical Report, the following statement appears. “It is further generally agreed that reliability estimates lower than .70 may call for the exercise of considerable caution.” The practical significance of this statement lies in the fact that when reliability is less than .70, it means that at least 30% of the variance in an examinee's test score is attributable to something other than the subject matter that is being tested. In other words, an examinee's test score consists of less than 70% true-score variance and more than 30% error variance. This ratio of true-score variance to error-variance is not desirable in high-stakes examinations (Haney, et al., 1999). Nearly 40 years ago, Nunnally went so far as to describe as “frightening” the extent to which measurement error is present in high-stakes examinations even with reliability estimates of .90 (1967, p. 226). NES, however, suggests that their reported item statistics and reliability estimates should not greatly influence one's judgment about the overall quality of the tests because the multiple-choice items make up only part of the exam format (NES, 1999, p. 189). The problem with that argument, as noted by Judge Thompson in Richardson (1989, pp. 824-25), is that small errors do accumulate and can invalidate the use for which the test was developed. This issue of simply dismissing troubling statistics as inconsequential is particularly ironic when the MECT has been described by the non-profit Education Trust as “the best [teacher test] in the country” (Daley, Vigue & Zernike, 1999). The Special Needs test deserves closer attention because it had problems at each reported administration. 1. The sample sizes for the tests were 131, 206, 154, and 200, respectively. Based on NES's own criteria (NES, 1999, p. 187), these sample sizes are sufficient for the generation of statistical estimates that would be relatively unaffected by sampling error. 2. The KR-20 reliability coefficients for the four administrations were .67, .76, .76, and .74, respectively. These are minimally tolerable for the last three administrations. The reliability is not acceptable, however, for the first administration. This means that people were denied certification in Special Needs based on their performance on a test that was deficient even by NES's own guidelines. 3. For the April 1998 administration eleven Special Needs items had point-biserials of .10 or less (again, one of NES's stated criterion for “flagging” an item). For the July 1998 administration it was five items, for October 1998 it was four items, and for January 1999 it was eight items. In fact, in two of the administrations there was an item with a negative point-biserial. (Given the previous discussion about the way 12 of 22 the point-biserials were likely to have been calculated (uncorrected), the frequency of negative point-biserials would likely increase if the corrected coefficients had been reported.) Given that there is no specific information about flagging, deleting or replacing items, it is possible that these same faulty items were, and continue to be, carried over from one administration to the next. The Linkage between Alabama and Massachusetts: A modus operandi At this point the reasonable reader might ask why I am expending so much effort upon what appears to be a relatively minor problem—some items had negative pointbiserial correlations. NES, for example, would likely call this analysis “item-bashing”, as this type of analysis was referred to in Alabama. The significance of these findings lies in the apparent connection between NES's work in Alabama and their present work on the MECT in Massachusetts. In Alabama, defendants claimed that Before any item was allowed to contribute to a candidate's score, and before the final 100 scorable items were selected, the item statistics for all the items of the test were reviewed and any items identified as questionable were checked for content and a decision was made about each such item (Allen, Defendants' Pre-Trial Memorandum, 1986, pp. 113-14). In fact, in Alabama there were negative point-biserial correlations in the original reliability reports generated by NES (their own documents reported negative point-biserial correlations as large as -0.70) and those negative point- biserial correlations for the same scorable items remained after multiple administrations of the examinations. Simply taking out the worst 20 items in each test did not remove all the faulty items since each exam had to have 100 scorable items. As seen above in Table 1, the MECT has statistically flawed items on many tests, these items have been there since the first administration, and they may be the same items still being used in current administrations. In Alabama, the negative point-biserial correlations led to the discovery of items for which there was no correct answer. Also discovered were items for which there were multiple correct answers and there were items for objectives that had been rated “not as job related.” Additionally, items were found to have been mis-keyed on the item analysis scoring forms. Furthermore, those flawed items existed unchanged for the first eight administrations of the tests. They were not revised, deleted, or changed to “experimental” non-scorable status until the ninth administration--one month after the plaintiffs' team agreed to take the case. Defendants argued that “problems with the testing instrument—such as mis-keyed answers” were simply one component of many that is taken into account by the “error of measurement” (Allen, Defendants' Pre-Trial Memorandum, 1986, pp. 108- 113). (Note 10) As noted earlier, poor item statistics may result for many reasons. Of those reasons the only acceptable one is that they may be due to sampling error (chance). That explanation is unlikely with respect to the MECT, however, because the sample sizes are sufficiently large, and the pattern of faulty item statistics persists over time. The extent to which flawed items may exist in the Massachusetts tests can only be determined by release of the student-level item response data and the content of the actual items, something that has not been done to date. Furthermore, such a release of additional technical information, or item response data, or item content is highly unlikely. (Note 11) 13 of 22 In Alabama, the statistical results and in-house documents were not produced by NES until the plaintiffs seriously discussed contempt of court actions against NES personnel. Consequently, there is little reason to expect that NES will voluntarily release MECT data or results not explicitly covered in their original confidential contract. In Alabama there were no independent testing experts appointed or contracted to monitor the test developer's work. This fact led the court to conclude that “The developer's work product was accepted by the state largely on the basis of faith” (Richardson, 1989, p. 817). In Massachusetts the original MECT contract called for the contractor to recommend a technical review committee of nationally recognized experts who were external to their organization (MDOE, 1997, Task 2.14.i, p. 11). The committee was to review the test items, test administration, and scoring procedures for validity and reliability and was to report its findings to the Department of Education. NES did not form such an independent technical advisory committee for the MECT nor has a formal independent review of the MECT been undertaken by anyone else. It is not in the short-term business interests of a testing company to conduct disconfirming studies on the technical quality of their commercial product. The MECT is, of course, a product that NES markets as an example of what they can build for other states who might be interested in certification examinations. It is, however, in the best interests of a state for such studies to be conducted. For example, the Commonwealth of Massachusetts has a statutory responsibility to “protect the health, safety and welfare of citizens” who seek services from licensed professionals (NES, 1999, p. 16). In the present situation “citizens” are defined by the Board of Education as “the children in our schools” (MDOE, Special Meeting Minutes, 1998). What has apparently been lost in all of this is the fact that prospective educators are “citizens” and deserve protection too--protection from a faulty product that can damage the profession of teaching and can alter drastically the career paths of individuals. Educators and the public at large deserve the highest quality certification examinations that the industry is capable of providing. There is ample evidence that the MECT may not be such an examination. Conclusion A technical review of the psychometric characteristics of the MECT has been called for in this journal (Haney et al. 1999; Wainer, 1999). The year 2000 and 2001 budgets passed by the Legislature of the Commonwealth also called for such an independent audit of the MECT. Those budget provisions, however, were vetoed by Governor Cellucci, and the legislature failed to override the vetoes. Until an independent review committee with full investigative authority is convened by the Commonwealth, the only technical material publicly available for independent analysis is the 1999 MECT Technical Report generated by NES (NES, 1999). (Note 12) One of the important points made by Haney et al, (1999) was that the Massachusetts Department of Education is not the appropriate agency for conducting such a review. Part of my point here is that the only review of the MECT the Commonwealth may ever see is the one prepared by NES of its own test. Such a review clearly raises a concern over conflict-of-interest (Madaus, 1990; Downing & Haladyna, 1996). Given the national interest in “higher standards” for achievement and assessment, it must be recognized that there are no “gold” standards by which a testing program such as the MECT can be evaluated (Haney & Madaus, 1990; Haney, 1996). This is ironic given how technically sophisticated the testing profession has become. Consequently, without “gold” standards to define test development practice, there are no legislated penalties for faulty products (tests) and there is no enforced protection for the public. Testing 14 of 22 companies may lose business if the details of shoddy practice are made known and the public may appeal to the judicial system for damages. But the opportunity for a test taker simply to raise a question about a test that can shape his or her career and to have that question taken seriously by an impartial panel should be the right of every test-taking citizen. (Note 13) Contrary to former Chairman John Silber's statement to the Massachusetts Board of Education, “there is nothing wrong with this test” (Minutes of the Board, Nov. 11, 1998) and the statement by the chief of staff for the MDOE, Alan Safran, “[the test]does not show who will become a great teacher, but it does reliably and validly rule out those who would not” (Associated Press, 1998), there is ample evidence that there may be significant psychometric problems with the MECT. These problems, in turn, have significant practical ramifications for certification candidates and the institutions responsible for their training. Is the MECT sound enough to support assertions that the candidates are “idiots”? No. Is there evidence that poor performance may, in part, reflect a flawed test containing defective items? Yes. Should the Massachusetts Commissioner of Education independently follow through on the twice-rejected Senate bill to "select a panel of three experts from out-of-state from a list of nationally qualified experts in educational and employment testing, provided by the National Research Council of the National Academy of Sciences, to perform a study of the validity and reliability of the Massachusetts educator certification test as used in the certification of new teachers and as used in the elimination of certification approval of teacher preparation programs and institutions to endorse candidates for teacher certification?" (Massachusetts, 1999, Section 326. (S191K)). Absolutely. Should such a panel serve as a blueprint for the formation of a standing national organization for test review and consumer protection? Yes. As we enter the 21st century, high stakes tests are becoming increasingly powerful determinants of students' and teachers' lives and life chances. Title II of the 1998 Higher Education Act, in particular, has encouraged a kind of de facto national program of teacher testing. Given the extraordinarily high stakes of these tests, the personal and institutional consequences of poorly designed teacher tests have become too great simply to allow test developers to serve as their own (and lone) quality control and their own (and often non-existent) dispute resolution boards. Now is the time for the community of professional educators and psychometricians to take a stand and demand that test developers be held accountable for their products in the test marketplace. What this would require at the very least are (1) a mechanism for an independent external audit of the technical characteristics of any test used for high stakes decisions, and (2) a mechanism for the resolution of disputed scores, results, and cases. Only then will taxpayers, educators, and test candidates have confidence that teacher tests are actually providing the information intended by legislative actions to raise educational standards and enhance teacher quality. Title II legislation certainly did not cause the high stakes test Juggernaut that is rolling through all aspects of educational reform in the U.S. and elsewhere. With mandatory teacher test reporting now tied to federal funding, however, Title II legislation certainly has added to the size, weight, and power of the test Juggernaut and strengthened its hold on reform. For this reason, federal policy makers are now responsible for providing legislative assurances that the public will be protected from the shoddy craftsmanship of some tests and some testing companies and that there will be remedies in place to right the mistakes that result from negligence. This article ends with a call to action. Policy makers must now incorporate into the federal legislation that requires state teacher test reporting new concomitant requirements for the establishment of independent audits and dispute resolution boards. 15 of 22 Notes I wish to thank Marilyn Cochran-Smith, Walt Haney, Joseph Herlihy, Craig Kowalski, George Madaus, and Diana Pullin for their advice and editorial comments. 1. The class consisted of “all black persons who have been or will be denied any level teaching certificate because of their failure to pass the tests by the Alabama Initial Teacher Certification Testing Program.” (Order On Pretrial Hearing, 1984). 2. This specific wording does not appear until the Amended Consent Decree of Jan. 5, 2000. 3. Among other things, conditions were set on the development of new tests, an independent monitoring and oversight panel was established, grade point averages were ordered to be considered in the certification process, and defendants would pay compensatory damages to the plaintiffs and plaintiffs' attorneys' fees and costs (Consent Decree, 1985). 4. That decision has been upheld numerous times since. The latest Amended Consent Decree was approved on January 5, 2000 (Allen, Jan.5 , 2000). 5. George Madaus, Joseph Pedulla, John Poggio, Lloyd Bond, Ayres D'Costa, Larry Ludlow. 6. “Failure tables” consisted of an applicant's name, their raw scores on the exams, the exam cut-scores, their actual responses to suspect items, and their recomputed raw scores if they should have been credited with a correct response to a suspect item. Examinees were identified in court who had failed an examination by one point (i.e., missed the cut- score by one item) but had actually responded correctly to a miskeyed item. For example, on the fifth administration of the Elementary Education exam there were six people who should have been scored correct on scorable item #43 (the so-called “carrot” item) but were not. Their total scores were 72. The cut-score was 73. These individuals should have passed the examination. There was even a candidate who took an exam multiple times and failed but who should have passed on each occasion. 7. The standard contract for test development will include some specification of indemnification. In the case of a state agency like the MDOE, the Request For Responses will typically specify protection for the state, holding the contractor responsible for damages (MDOE, 1997, V. (G), 1, p.17). Contractors, understandably, are reluctant to enter into such an agreement and have been successful in striking this language from the contract. 8. The rationale is that .20 is the minimum correlation required to achieve statistical significance at alpha=.05 for a sample size of 100. This is because .20 is twice the standard error (based on a sample of 100) needed to differ significantly from a correlation of zero. 9. The difference between piloting test items, as NES did, and conducting a field-trial is that the field-trial simulates the actual operational test-taking conditions. Its value is that problems can be detected that are otherwise difficult to uncover. For example, non-standardized testing conditions created numerous sources of measurement error on the first administration of the MECT (Haney et al, 1999). 10. This interpretation of measurement error goes considerably beyond conventional practice where “Errors of measurement are generally viewed as random and unpredictable.” (Standards, 1999, p. 26). A miskeyed answer key is not a random error. It is a mistake and its effect is felt greatest by those near the cut-score. 16 of 22 Although false-positive passes may benefit from the mistake, it is the false-negative fails who suffer and, as a consequence, seek a legal remedy. 11. To date the MDOE has routinely ignored questions requesting technical information, e.g. how many items originally came from item banks, who developed the item banks, how many items have been replaced, what are the reliabilities of new items, what are the technical characteristics of the present tests, will the Technical Report be updated, what “disparate impact” analyses have been conducted? 12. From the start of testing to the present time individual IHE's have not been able to initiate any systematic analysis of their own student summary scores, let alone any statewide reliability and validity analyses. The primary reason for this paucity of within- and across- institution analysis is because NES only provides IHEs with student summary scores printed on paper—no electronic medium is provided for accessing and using one's own institutional data. Thus, each IHE faces the formidable task of hand-entering each set of scores for each student for each test date. This results in a unique and incompatible database for each of the Commonwealth's IHEs. 13. I assert that the right to question any aspect of a high-stakes examination should take precedence over the waiver required when one takes the MECT: “I waive rights to all further claims, specifically including, but not limited to, claims for negligence arising out of any acts or omissions of the Massachusetts Department of Education and the Contractor for the Massachusetts Educator Certification Tests (including their respective employees, agents, and contractors)” (MDOE, 2001, p. 28). References Angoff, W. (ed.). (1971). The College Board Admissions Testing Program: A Technical Report on Research and Development Activities Relating to the Scholastic Aptitude Test and Achievement Tests. NY: College Entrance Examination Board. Allen v. Alabama State Board of Education, 612 F. Supp. 1046 (M.D. Ala. 1985). Allen v. Alabama State Board of Education, 636 F. Supp. 64 (M.D. Ala. Feb. 5, 1986). Allen v. Alabama State Board of Education, 816 F. 2d 575 (11th Cir. April 22, 1987). Allen v. Alabama State Board of Education, 976 F. Supp. 1410 (M.D. Ala. Sept. 8, 1997). Allen v. Alabama State Board of Education, 190 F.R.D. 602 (M.D. Ala. Jan. 5, 2000). American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, D.C.: American Educational Research Association. Associated Press Archives, (October 4, 1998). State Administers Teacher Certification Test Amid Ongoing Complaints. Baldus, D.C. & Cole, J.W.L. (1980). Statistical Proof of Discrimination. NY: McGraw-Hill. 17 of 22 Cochran-Smith, M. (in press). The outcomes question in teacher education. Teaching and Teacher Education. Cochran-Smith, M. & Dudley-Marling, C. (in press). The flunk heard round the world. Teaching Education. Consent Decree, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. Oct. 25, 1985). Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. NY: Holt, Rinehart and Winston. Daley, B. (1999). “Teacher exam authors put to the test”. Boston Globe, 10/7/98, B3. Daley, B.; Vigue, D.I. & Zernike, K. (1999) “Survey says Massachusetts Teacher Test is best in US”. Boston Globe, 6/22/99, B02. Defendant's Pre-trial Memorandum, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. May 1, 1986). Donlon, T. (ed.) (1984). The College Board Technical Handbook for the Scholastic Aptitude Test and Achievement Tests. NY: College Entrance Examination Board. Downing, S. & Haladyna, A. (1996). A model for evaluating high stakes testing programs: Why the fox should not guard the chicken coop. Educational Measurement: Issues and Practice, 15:1, pp.5-12. Dressel, P.L. (1940). Some remarks on the Kuder-Richardson reliability coefficient. Psychometrika, 5, 305-310. Ebel, R.L. & Frisbie, D.A. (1991) (5th ed.). Essentials of Educational Measurement. NJ: Prentice Hall. Guilford, J.P. (1936) (1st ed.). Psychometric Methods. NY: McGraw-Hill. Guilford, J.P. (1954) (2nd ed.). Psychometric Methods. NY: McGraw-Hill. Haney, W., & Madaus, G. F. (1990). Evolution of Ethical and Technical standards. In R.K. Hamilton, & J. N. Zaal (Eds.), Advances in Educational and Psychological Testing (pp.395-425). Haney, W.M., Madaus, G.F. & Lyons, R. (1993). The Fractured Marketplace for Standardized Testing. Boston: Kluwer. Haney, W. (1996). Standards, Schmandards: The need for bringing test standards to bear on assessment practice. Paper presented at the annual meeting of the American Educational Research association annual meeting. NY: NY. Haney, W., Fowler, C., Wheelock, A, Bebell, D. & Malec, N. (1999). Less truth than error?: An independent study of the Massachusetts Teacher Tests. Education Policy Analysis Archives, 7(4). Available online at http://epaa.asu.edu/epaa/v7n4/. 18 of 22 Henrysson, S. (1963). Correction for item-total correlations in item analysis. Psychometrika, 28, 211-218. Hopkins, K.D. (1998) (8th ed.). Educational and Psychological Measurement and Evaluation. Boston: Allyn and Bacon. Kuder, G.F. & Richardson, M.W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151-160. Loevinger, J. (1954). The attenuation paradox in test theory. Psychological Bulletin, 51, 493-504. Lord, F.M. & Novick, M.R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley. Madaus, G. (May 19-20, 1986). Testimony in Allen v Alabama (81-697-N). Madaus, G. (1990). Legal and professional issues in teacher certification testing: A psychometric snark hunt. In J.V. Mitchell, S. Wise, & B. Plake (Ed.), Assessment of teaching: Purposes, practices, and implications for the profession. (pp. 209-260). Hillside, NJ: Lawrence Erlbaum Associates.. Massachusetts. (1999). FY 2000-2001 Budget. Massachusetts Department of Education (February 24, 1997). Massachusetts Teacher Certification Tests of Communication and Literacy Skills and Subject Matter Knowledge: Request for Responses (RFR). Massachusetts Department of Education (July 1, 1998). Board of Education Special Meeting Minutes. http://www.doe.mass.edu/boe/minutes/98/min07 0198.html. Massachusetts Department of Education (July 27, 1999). Department of Education Press Release. http://www.doe.mass.edu/news/archive99/pr072 799.html. Massachusetts Department of Education (November 28, 2000). Board of Education Regular Meeting Minutes. http://www.doe.mass.edu/boe/minutes/00/1128r eg.pdf. Massachusetts Department of Education (February 16, 2001). Massachusetts Educator Certification Tests: Registration Bulletin. http://www.doe.mass.edu/teachertest/bulletin 00/00bulletin.pdf Melnick, S. & Pullin, D. (1999, April). Teacher education & testing in Massachusetts: The issues, the facts, and conclusions for institutions of higher education. Boston: Association of Independent Colleges and Universities of Massachusetts. Millman, J. (June 17, 1986). Testimony in Allen v Alabama (81-697-N). National Board on Educational Testing & Public Policy. (2000). Policy statement. Chestnut Hill, MA: Lynch School of Education, Boston College. National Commission on Testing and Public Policy. (1990). From Gatekeeper to Gateway: Transforming Testing in America. Chestnut Hill, MA: Lynch School of 19 of 22 Education, Boston College. National Evaluation Systems. (1999). Massachusetts Educator Certification Tests Technical Report. Amherst, MA: National Evaluation Systems. Nunnally, J. (1967). Psychometric Theory. NY: McGraw- Hill. Order On Pretrial Hearing, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. Dec. 19, 1984). Pearson, K. (1909). On a new method of determining correlation between a measured character A and a character B, of which only the percentage of cases wherein B exceeds or falls short of a given intensity is recorded for each grade of A. Biometrika, Vol. VII. Pressley, D.S. (1998). “Dumb struck: Finneran slams 'idiots' who failed teacher tests.” Boston Herald, 6/26/98 pp. 1,28. Rawls, P. (2000). “ACT may design test for Alabama's future teachers.” The Associated Press, 7/11/00 Richardson v. Lamar County Board of Education, 729 F. Supp. 806. (M.D. Ala 1989) aff'd, 935 F. 2d 1240 (11th Cir. 1991). Richardson, M.W. & Stalnaker, J.M. (1933). A note on the use of bi-serial r in test research. Journal of General Psychology, 8, 463-465. Thorndike, E.L., Bregman, M.V., Cobb, Woodyard, E. et al., (1929) The Measurement of Intelligence. NY: Teachers College, Columbia University. U.S. Department of Education, National Center for Education Statistics. Reference and Reporting Guide for Preparing State and Institutional Reports on the Quality of Teacher Preparation: Title II, Higher Education Act, NCES 2000- 089. Washington, DC: 2000. Wainer, H. (1999). Some comments on the Ad Hoc Committee's critique of the Massachusetts Teacher Tests. Education Policy Analysis Archives, 7(5). Available online at http://epaa.asu.edu/epaa/v7n5.html. Walden, J.C. & Deaton, W.L. (1988). Alabama's teacher certification test fails. 42 Ed. Law Rep.1 About the Author Larry H. Ludow Associate Professor Boston College Lynch School of Education Educational Research, Measurement, and Evaluation Department Email: [email protected] Larry Ludlow is an Associate Professor in the Lynch School of Education at Boston College. He teaches courses in research methods, statistics, and psychometrics. His 20 of 22 research interests include teacher testing, faculty evaluations, applied psychometrics, and the history of statistics. Copyright 2001 by the Education Policy Analysis Archives The World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, [email protected] or reach him at College of Education, Arizona State University, Tempe, AZ 85287-0211. (602-965-9644). The Commentary Editor is Casey D. Cobb: [email protected] . EPAA Editorial Board Michael W. Apple Greg Camilli University of Wisconsin Rutgers University John Covaleskie Alan Davis Northern Michigan University University of Colorado, Denver Sherman Dorn Mark E. Fetler University of South Florida California Commission on Teacher Credentialing Richard Garlikov Thomas F. Green [email protected] Syracuse University Alison I. Griffith Arlen Gullickson York University Western Michigan University Ernest R. House Aimee Howley University of Colorado Ohio University Craig B. Howley William Hunter Appalachia Educational Laboratory University of Calgary Daniel Kallós Benjamin Levin Umeå University University of Manitoba Thomas Mauhs-Pugh Dewayne Matthews Green Mountain College Western Interstate Commission for Higher Education William McInerney Mary McKeown-Moak Purdue University MGT of America (Austin, TX) Les McLean Susan Bobbitt Nolen University of Toronto University of Washington Anne L. Pemberton Hugh G. Petrie [email protected] SUNY Buffalo Richard C. Richardson Anthony G. Rud Jr. New York University Purdue University Dennis Sayers Ann Leavenworth Center for Accelerated Learning Jay D. Scribner University of Texas at Austin 21 of 22 Michael Scriven Robert E. Stake [email protected] University of Illinois—UC Robert Stonehill David D. Williams U.S. Department of Education Brigham Young University EPAA Spanish Language Editorial Board Associate Editor for Spanish Language Roberto Rodríguez Gómez Universidad Nacional Autónoma de México [email protected] Adrián Acosta (México) J. Félix Angulo Rasco (Spain) Universidad de Guadalajara [email protected] Universidad de Cádiz [email protected] Teresa Bracho (México) Alejandro Canales (México) Centro de Investigación y Docencia Económica-CIDE bracho dis1.cide.mx Universidad Nacional Autónoma de México [email protected] Ursula Casanova (U.S.A.) José Contreras Domingo Arizona State University [email protected] Universitat de Barcelona [email protected] Erwin Epstein (U.S.A.) Josué González (U.S.A.) Loyola University of Chicago [email protected] Arizona State University [email protected] Rollin Kent (México) María Beatriz Luce (Brazil) Departamento de Investigación Educativa-DIE/CINVESTAV [email protected] [email protected] Universidad Federal de Rio Grande do Sul-UFRGS [email protected] Javier Mendoza Rojas (México) Marcela Mollis (Argentina) Universidad Nacional Autónoma de México [email protected] Universidad de Buenos Aires [email protected] Humberto Muñoz García (México) Universidad Nacional Autónoma de México [email protected] Daniel Schugurensky (Argentina-Canadá) Angel Ignacio Pérez Gómez (Spain) Universidad de Málaga [email protected] Simon Schwartzman (Brazil) OISE/UT, Canada [email protected] Fundação Instituto Brasileiro e Geografia e Estatística [email protected] Jurjo Torres Santomé (Spain) Carlos Alberto Torres (U.S.A.) Universidad de A Coruña [email protected] University of California, Los Angeles [email protected] 22 of 22