Academia.eduAcademia.edu

Evaluating Teaching and Teachers

Chaper 20 from APA Handbook of Testing and Assessment in Psychology: Vol. 3. Testing and Assessment in School Psychology and Education, K. F. Geisinger (Editor)

CHAPTER 20 EVALUATING TEACHING AND TEACHERS U N CO RR EC TE D PR O O FS © A M ER IC A Y CH O LO G IC A L on students’ academic growth to achieve funding (U.S. Department of Education, 2010). These and other recent research and policy developments are changing the way the assessment of teaching is understood. The goal of this chapter is to provide an overview and structure to facilitate readers’ understanding of the emerging landscape and attendant assessment issues. As well described in a number of recent reports, current evaluation processes suffer from a number of problems (Toch & Rothman, 2008; Weisberg, Sexton, Mulhern, & Keeling, 2009). For example, the New Teachers Project surveyed evaluation practices in several districts large and small and found that teachers were almost all rated highly. In systems that used binary ratings (i.e., satisfactory or unsatisfactory), almost 99% of teachers were rated satisfactory. To complicate matters, the same administrators who gave all teachers high marks also recognized that staff members varied greatly in performance and some were actually poor teachers. In addition to an inability to sort teachers, current processes generally do not give teachers useful information to improve their practice and policy makers do not believe the credibility of the evaluation process (Weisberg et al., 2009). Measures of teaching should be seen from a validity perspective, and thus, it is critical to begin with the purpose and use of the assessment. As Messick (1989) argued, validity is not an inherent PS N Almost everything related to the assessment and evaluation of teaching in the United States is undergoing restructuring. Purposes and uses, data sources, analytic methods, assessment contexts, and policy are all being developed, refined, and reconsidered within a cauldron of research, development, and policy activity. For example, the District of Columbia made headlines when it announced the firing of 241 teachers based, in part, on poor performance results from their new evaluation system, IMPACT (Turque, 2010). The Bill and Melinda Gates Foundation has funded the Measures of Effective Teaching (MET) study, a $45 million study designed to test the ways in which a range of measures including scores on observation protocols, student engagement data, and value-added test scores might be combined into a single teaching evaluation metric (Bill and Melinda Gates Foundation, 2011a). The Foundation is also spending $290 million in four communities in intensive partnerships to reform how teachers are recruited, developed, rewarded, and retained (Bill and Melinda Gates Foundation, 2011b). In addition to pressure from districts and private funders, unions have also pressed for revised standards of teacher evaluation (e.g., American Federation of Teachers [AFT], 2010). Perhaps the most consequential contemporary effort is the federally funded Race to the Top Fund that encourages states to implement teacher evaluation systems based on multiple measures with a significant component based A SS O CI A TI O N Drew H. Gitomer and Courtney A. Bell The authors would like to thank Andrew Croft, Daniel Eignor, Laura Goe, Heather Hill, Daniel McCaffrey, and Joan Snowden for their careful review of the manuscript. A special thank you to Andrew Croft and Evelyn Fisch for their assistance in preparing the manuscript. Each of the authors contributed equally to the preparation of this chapter. DOI: 10.1037/14049-020 APA Handbook of Testing and Assessment in Psychology: Vol. 3. Testing and Assessment in School Psychology and Education, K. F. Geisinger (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved. APA-HTA_V3-12-0603-020.indd 1 1 04/10/12 7:24 PM Gitomer and Bell A SS O CI A TI O N composite measures in the context of psychological assessments within clinical contexts, current validity research does not address how scores from multiple measures might be combined or considered jointly in the evaluation of teachers. The validity argument for inferences based on multiple measures introduces an additional layer of complexity because support is needed for the composite inference and not simply inferences based on individual measures. As almost all the current teacher evaluation schemes are contemplating some use of multiple measures, more specific guidance is needed. A L TEACHER OR TEACHING QUALITY? U N CO RR EC TE D PR O O FS © A M ER N PS Y CH O LO G IC The current policy press is to develop measures that allow for inferences about teacher effectiveness. Using particular measures, the goal is to be able to make some type of claim about the qualities of a teacher. Yet, to varying degrees, the measures we examine do not tell us only about the teacher. A broad range of contextual factors also contributes to the evidence of the teaching quality, which is more directly observable. To illustrate why context affects the validity of what inferences can be made from the observation of a performance, consider a scenario from medicine. Assume that under the same conditions, two surgeons would operate using the same processes and their respective patients would have the same outcomes. But, as described in the following example, such simplifying assumptions that conditions are invariant often do not hold. A IC property of an instrument, but rather it is an evaluation of the inferences and actions made in light of a set of intended purposes. Given the extraordinary and unprecedented focus on evaluating teacher quality, this chapter is focused on measures being used to make inferences about the quality of practicing teachers, and to a lesser degree, the inferences made about prospective teachers who are undergoing professional preparation. These measures are examined through the perspective of modern validity frameworks used to consider the quality of assessments more generally. Building on M. T. Kane’s (2006) thinking, the strength of the validity evidence is considered while paying careful attention to the purposes of various teaching evaluation instruments. In considering the validity of inferences made about teacher quality, the focus is on three issues that may be at the forefront of discussions about teacher evaluation for the foreseeable future. The first issue concerns the validity argument for particular instruments. Guided by M. T. Kane (2006), Messick (1989), and the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME], 1999), the respective validity arguments for a range of measures being used to evaluate teachers are summarized briefly. The second issue concerns an often underresearched aspect of any validity argument—casual attribution of scores to particular teachers. Observing teaching or assessing student learning provides a set of observables that produce a score or set of scores. Policy makers and many researchers are seeking ways to establish a causal relationship that attributes these scores to an individual teacher. Yet the nature of teaching and the context in which it occurs raises questions about the extent to which causal attribution can be made. To date, issues of casual attribution have not yet been adequately dealt with across instruments and processes used to measure teaching. The final issue concerns the consideration of multiple measures in an evaluation. Although the most recent Standards for Educational and Psychological Testing (AERA et al., 1999) discussed the use of Imagine Two Surgeons We would like to evaluate them on the quality of their surgical skills using multiple measures. We will use the size of the scar, the rate of infection, the quality of pain management, and patient satisfaction as our measures of the quality of their surgical skills. One is in Miami, Florida, the other in Moshe, Tanzania. Both must remove a benign tumor from a 53-year-old man’s abdomen. The surgeon in Miami has a #10 blade steel scalpel that is designed for cutting muscle and 2 APA-HTA_V3-12-0603-020.indd 2 04/10/12 7:24 PM Evaluating Teaching and Teachers the sole control of the teacher and how much might be attributable to contextual factors that influence what the teacher does and how well students learn? For example, although one can judge the quality of the content being taught, that content is frequently influenced by district-imposed curricula and texts. Social interactions that occur among students are certainly a function of the teacher’s establishment of a classroom climate, but students also bring a set of interpersonal dynamics into the classroom. Teachers may design homework assignments or assessments, but others may be compelled to use assessments and assignments developed by the school district. How do parental actions differentially support the intentions of teachers? The point is that it may be impossible to disentangle the individual teacher from all of the classroom interactions and outside variables that influence student outcomes (Braun, 2005a). Outside the classroom, there are additional contextual effects (e.g., interactions within schools and the larger community) that are difficult to isolate (e.g., Pacheco, 2008). At a minimum, if we are to ascribe casual attribution for student learning to teachers, we must attempt to understand these complexities and use analytic processes and methods that can educate stakeholders about the quality and limitations of those casual attributions. U N CO RR EC TE D PR O O FS © A M ER IC A IC G LO O CH Y PS N It is possible that neither patient will get an infection and both will be satisfied with the care they received. But it is also possible, perhaps likely, that the patient in Miami will have a smaller scar than the Moshe patient, due to the knife used; and the Miami patient will have better pain management than the Moshe patient because of access to an anesthesiologist. So even in one surgery, one would expect the Miami surgeon to carry out a more effective surgery than the Moshe surgeon. And over a number of years, as these surgeons do 100 similar surgeries, it becomes increasingly likely that the Moshe surgeon will have poorer surgical outcomes than the Miami surgeon. But has the quality of each surgeon’s respective skills really been judged? The quality of medical care the two patients have received has been evaluated. Are surgical skill and medical care the same thing? Perhaps all that has really been learned is that if someone had a tumor, he or she would like it removed in Miami, not Moshe. The point is that even in medicine, with its more objective outcomes of scar size and infection rate, it is not always so obvious to attribute surgical outcomes to the surgeon alone. There are many factors beyond the surgeon’s control that can contribute to her success. Of course, the best conditions in the world will not, over time, make an incompetent surgeon appear to be expert. Now, imagine walking into a classroom and observing a lesson in order to make judgments about a teacher. How much of what is seen is under A L A SS O CI A TI O N skin. The surgeon in Moshe has a wellsharpened utility knife that is used for a range of surgical purposes. The excision in Miami will occur in a sterile operating room with no external windows, fans and filters to circulate and clean the air, an anesthesiologist, and a surgical nurse. The excision in Moshe will occur in a clean operating room washed with well water and bleach, windows opened a crack to allow the breeze to circulate the stiflingly hot air, no fans or filters, and a nurse borrowed from the pediatrics unit because she was the only available help. The Purposes of Evaluating Teaching For a range of reasons, there has been a push for improved teacher evaluation models. The push is strong, in part, because it comes from different constituencies with varying purposes for evaluating teaching. These purposes include educational accountability, strategic management of human capital, professional development of teachers, and the evaluation of instructional policies. The confluence of underlying constituencies and a wide range of purposes have led to intense research and development activity around teacher effectiveness measures. The first and perhaps most broadly agreed on purpose for teaching evaluation is public accountability. The time period during which this chapter is being written is an era of a pervasive emphasis on educational accountability. Concerns about persistent achievement gaps between Black and White, poor and rich, and English language speakers and 3 APA-HTA_V3-12-0603-020.indd 3 04/10/12 7:24 PM Gitomer and Bell U N CO RR EC TE D PR O O FS © A M ER N PS Y CH O LO G IC A L A SS O CI A TI O N actual teaching quality (Goe, 2007; Wayne & Youngs, 2003). Stakeholders have grown increasingly frustrated with the lack of an apparent relationship between student achievement and measures used to evaluate teachers (e.g., Weisberg et al., 2009). This has led to a perspective with a far more empirical view of what defines effective teaching. Largely emanating from individuals who are not representative of the traditional educational research and measurement communities, another goal of teaching evaluation has become prominent—the strategic management of human capital (Odden & Kelley, 2002). This view rests on basic economic approaches to managing the supply of teachers by incentives and disincentives for individuals with specific characteristics. The logic suggests that if the supply of “effective” teachers can be increased by replacing “ineffective” teachers, overall achievement would increase and the achievement gap would decrease (Gordon, Kane, & Staiger, 2006). In this view, the evaluation of teaching is the foundation for managing people via retention, firing, placement, and compensation policies (Heneman, Milanowski, Kimball, & Odden, 2006). A parsimonious definition of teaching quality guides the measurement approach of human capital management. This is characterized in the following remark by Hanushek (2002): “I use a simple definition of teacher quality: good teachers are ones who get large gains in student achievement for their classes; bad teachers are just the opposite” (p. 3). Hanushek adopted this definition because it is empirically based, straightforward, and in his and others’ views, tractable. Most of all, such a definition avoids defining quality by the execution of particular teaching processes or the possession of specific teacher characteristics, factors that have had modest, if any, relationships to valued outcomes (e.g., Cochran-Smith & Zeichner, 2005). Although recent approaches to the strategic management of human capital have raised the stakes substantially for how teacher evaluations are used, most policies broaden teacher evaluation to include other factors besides student achievement growth. A IC English language learners, coupled with concerns about U.S. academic performance relative to other countries (Gonzales et al., 2008; Programme for International Student Assessment [PISA], 2006), have led policy makers to implement unprecedented policies that focus on achievement and other measurable outcomes. Nowhere is this press for a public accounting on measurable outcomes stronger than in the K–12 accountability policy of the No Child Left Behind revision of the Elementary and Secondary Education Act in 2002 (No Child Left Behind Act, 2002). Supported by a growing body of research that identifies teachers as the major school-related determinant of student success (Nye, Konstantopoulos, & Hedges, 2004; Raudenbush, Martinez, & Spybrook, 2007), perhaps it was only a matter of time before the public accounting of student performance gave way to a public accounting of teacher performance. The purpose of teaching evaluation in this way of thinking is to document publicly measurable outcomes that drive decision making and ensure the public’s financial investment in teachers is maximized. It is important to recognize that out-ofschool factors continue to be most predictive of student outcomes; but for the variance that can be accounted for by schools, teachers are the greatest source of variation in student test score gains (Nye et al., 2004; Raudenbush, 2004). Estimates of the size of teachers’ contribution varies with the underlying analytic model employed (Kyriakides & Creemers, 2008; Luyten, 2003; Rowan, Correnti, & Miller, 2002). Earlier efforts to account publicly for teaching quality have not been particularly useful or insightful. Characteristics valued in existing compensation systems, such as postbaccalaureate educational course-taking, credit and degree attainment, and years on the job have modest associations with student achievement (e.g., Clotfelter, Ladd, & Vigdor, 2005; Harris & Sass, 2006; T. J. Kane, Rockoff, & Staiger, 2006).1 In addition, widely used surface markers of professional preparation, such as certification status and coursework, only weakly predict 1 The relationship of student achievement growth and teacher experience does increase for the first several years of teaching, but levels off after only a few years (e.g., Nye, Konstantopoulos, & Hedges, 2004). 4 APA-HTA_V3-12-0603-020.indd 4 04/10/12 7:24 PM Evaluating Teaching and Teachers LO G IC A L A SS O CI A TI O N and school and class size (Molnar et al., 1999) relate to teacher practice. Often, the types of measures used for this purpose are logs or other surveys that ask teachers to report on the frequency of important activities or practices. By evaluating teaching, researchers and evaluators can assess the degree to which policies intended to shape teaching and learning are working as intended. In this chapter, the classes of measures that can be used to support the evaluation of teaching for one or more purposes are described: educational accountability, strategic management of human capital, professional development of teachers, and the evaluation of instructional policies. The next section looks briefly at the history of assessing teaching quality and considers the ways in which these multiple purposes have played out in recent history. CH O A Selective History of Assessing Teaching Quality U N CO RR EC TE D PR O O FS © A M ER IC A Y Only a few years ago S. Wilson (2008) characterized the U.S. national system of assessing teacher quality as “undertheorized, conceptually incoherent, technically unsophisticated, and uneven” (p. 24). Although Wilson focused on the system of assessments used to characterize teacher quality, the same characterization can be leveled at the constituent measures and practices that make up what she referred to as a “carnival of assessment” (p. 14). Three dominant assessment purposes at the “carnival” are described, each of which renders judgments about preservice, in-service, and master teaching, respectively. Across the three purposes, there are both strengths and weaknesses that lay the foundation for understanding current research and development activities. By far, the most common purpose of formal assessment in teaching occurs for beginning licensure, in which the state ensures that candidates have sufficient knowledge, typically of content and basic skills, so that the state can warrant that the individual will “do no harm.”2 These tests have almost always been, and continue to be, standardized assessments that require teacher candidates to meet a particular state-established passing standard to be PS N Nevertheless, student achievement growth estimates typically are a dominant factor in making determinations of effectiveness. In addition to strategic management of human capital, teacher evaluation has been viewed as a means for improving individual and organizational capacity. There have been longstanding concerns that the professional development of teachers, beginning even in preservice, is disconnected from the particular needs of individual teachers and inconsistent with understandings of how teachers learn (Borko, 2004) and the supports they need to teach well (Johnson et al., 2001; Johnson, Kardos, Kauffman, Liu, & Donaldson, 2004; Kardos & Johnson, 2007). There is also increasing research that documents how organizational variables—the alignment of curriculum, the presence of professional learning communities and effective leadership, and the quality of reform implementation—are related to the nature and quality of teaching (Honig, Copland, Rainey, Lorton, & Newton, 2010). With capacity building as a goal, teaching evaluation can be a tool that can diagnose the practices most in need of improvement. The goal of teaching evaluation from this perspective is to improve what teachers and schools know and are able to do around instruction. The measures used toward this end vary dramatically from low-inference checklists of desired behaviors to high-inference holistic rubrics of underlying teaching quality values to school-level measures of teaching contexts (Hirsch & Sioberg, 2010; Kennedy, 2010). Finally, researchers and evaluators use teaching evaluation to assess whether and how various education policies are working. Deriving from both measurement and evaluation perspectives, teaching evaluation has been used to investigate the degree to which school and curricular reforms and their implementation influence instruction (e.g., Rowan, Camburn, & Correnti, 2004; Rowan & Correnti, 2009; Rowan, Jacob, & Correnti, 2009), the impacts of professional development (Desimone, Porter, Garet, Suk Yoon, & Birman, 2002; Malmberg, Hagger, Burn, Mutton, & Colls, 2010), and how particular policies such as academic tracking (Oakes, 1987) 2 That is, in the legal context of licensure, failure to demonstrate sufficient knowledge or skill on an assessment would present some probability of causing harm in the workplace (M. T. Kane, 1982). 5 APA-HTA_V3-12-0603-020.indd 5 04/10/12 7:24 PM Gitomer and Bell U N CO RR EC TE D PR O O FS © A M ER N PS Y CH O LO G IC A L A SS O CI A TI O N to improve individuals’ capacity while continuing to adhere to the “do no harm” principle. Pass rates, particularly given multiple opportunities to complete the assessment as is characteristic of these systems, are very high (more than 95%; e.g., Connecticut State Department of Education, Bureau of Program and Teacher Evaluation, 2001; Ohio Department of Education, 2006). Taken together, the licensure testing processes serve the function of preventing a relatively small proportion of individuals from becoming teachers, but they do not support inferences about the quality of teachers or teaching beyond minimal levels of competence. Furthermore, because these instruments are disconnected from practice either by not being able to sort teaching or not being close enough to practice to provide information about what a teacher is and is not able to do beyond minimal levels, this group of assessment practices provides modest accountability and capacity-building information. In addition to supporting judgments about individual teacher candidates, beginning teacher assessment is also influenced by teacher education program accreditation. Almost all states use some combination of teacher testing and program accreditation to regulate and hold programs accountable for the quality of teachers entering classrooms (S. M. Wilson & Youngs, 2005). Accreditation is governed by state and regional accrediting agencies as well as by two national organizations: National Council for Accreditation of Teacher Education (NCATE) and Teacher Education Accreditation Council (TEAC).3 Accreditation requirements vary but generally include a site visit and a paper review of program offerings, program coherence, and the alignment of program standards with national organizations’ subject matter teaching standards. In some accreditation processes, programs must provide evidence that graduates can teach competently and have acquired relevant subject matter knowledge and teaching experiences. That evidence can come from whatever assessments the program uses and there are few, if any, common assessments. These processes require much of teacher education programs (e.g., Barnette & A IC awarded a license (S. M. Wilson & Youngs, 2005). Tests have differed in terms of the proportion of multiple-choice versus constructed-response questions, whether they are paper-and-pencil or computer-delivered, whether they are linear or adaptive, and the methodology by which passing standards are set. State licensure requirements vary in terms of the number and types of tests required and are not guided by a coherent theory of teaching and learning. Tests are designed most often by testing companies in collaboration with states. Although this system results in adequate levels of standardization within a state, the tests are criticized as being disconnected from teaching practice and based on incomplete views of teaching (e.g., Klein & Stecher, 1991). Such tests are designed to support inferences about the knowledge of prospective teachers. They publicly account for what teachers know prior to entering a classroom. They deliberately have not been designed to encourage inferences about the quality of teaching, although the implicit assumption on the part of many is that higher scores are associated with higher levels of teaching proficiency, however defined. When researchers have investigated this assumption, the evidence of any relationship has been weak, at best (Buddin & Zamarro, 2009; Goldhaber & Hansen, 2010; National Research Council, 2008). A number of states more recently adopted the view that, to attain a full license, there ought to be some direct evidence of teaching. Treating the initial license as provisional, they adopted measures that involved more direct evidence of teaching. Almost always grounded in professional standards for teachers, assessments included live classroom observations, interviews, and teacher-developed portfolios that contain artifacts of classroom practice such as planning documents, assignments, and videotapes (California Commission on Teacher Credentialing, 2008; Connecticut State Department of Education, Bureau of Program and Teacher Evaluation, 2001; Educational Testing Service [ETS], 2001). These assessments are intended to provide both public accountability and formative information about how 3 As of October 2010, NCATE and TEAC announced the merger of the two organizations into a unified body, The Council for the Accreditation of Educator Preparation (CAEP). 6 APA-HTA_V3-12-0603-020.indd 6 04/10/12 7:24 PM Evaluating Teaching and Teachers U N CO RR EC TE D PR O O FS © A M ER IC A Y CH O LO G IC A L A SS O CI A TI O N can be equally applied to particular measures within an evaluative system. It is fair to say that there is a substantial chasm between the values expressed in these standards and the state of teacher evaluation practice for preservice and in-service teachers. It is rare to find an evaluation system in which there is any information collected as to the validity or reliability of judgments. Often principals and other administrators are reluctant to give anything but acceptable ratings because of the ensuing responsibilities to continue to monitor and support individuals determined to be in need of improvement. It is extremely rare that teachers—tenured or not—are removed from their jobs simply because of poor instructional performance. Routinely, the propriety and accuracy of the evaluation is challenged at great cost to the school system (Pullin, 2010). Current policies are attempting to transform this historic state of affairs by largely defining teaching effectiveness as the extent to which student test scores improve on the basis of year-to-year comparisons. The methods that are used necessarily force a distribution of individual teachers and are explicitly tied to student outcomes. The details of these methods and the challenges they present are discussed in subsequent sections of this chapter. One other major teacher evaluation purpose that was first implemented in the mid-1990s is the National Board for Professional Teaching Standards (NBPTS) certification system. Growing out of the report A Nation Prepared: Teachers for the 21st Century (Carnegie Forum on Education and the Economy, 1986), a system of assessments was designed to recognize and support highly accomplished teachers. All NBPTS-certified teachers are expected to demonstrate accomplished teaching that aligns with five core propositions about what teachers should know and be able to do as well as subjectand age range–specific standards that detail the characteristics of highly accomplished teachers (NBPTS, 2010). The architecture of the NBPTS system is described by Pearlman (2008) and was used to guide PS N Gorham, 1999; Kornfeld, Grady, Marker, & Ruddell, 2007; Samaras et al., 1999) and may produce changes in program structure and processes (Bell & Youngs, 2011); however, there is no research that documents the effects of accreditation on preservice teacher learning or K–12 pupil learning. A second dominant assessment purpose occurs once teachers are hired and teaching in classrooms. States and districts typically set policies concerning the frequency of evaluation and its general processes. In states with collective bargaining, evaluation is often negotiated by the administration and the union. Despite the variety of agencies that have responsibility for the content of annual evaluations, evaluations are remarkably similar (Weisberg et al., 2009). They are administered by a range of stakeholders—coaches, principals, central office staff, and peers—and use a wide range of instruments each with its own idiosyncratic view of quality teaching.4 Although evaluations apply to all teachers, the systematic and consistent application of evaluative judgments are rare (e.g., Howard & Gullickson, 2010). Whereas traditional assessment practices for preservice teachers have had standards but are disconnected from teaching practice, in-service assessment practices have been connected to practice but lack rigorous standards. This has led in-service teaching evaluation to be viewed as a bankrupt and uninformative enterprise (Toch & Rothman, 2008; Weisberg et al., 2009). Evaluations are often viewed as bureaucratic functions that provide little or no useful information to teachers, administrators, institutions, or the public. Howard and Gullickson (2010) have made the case that teacher evaluation efforts should meet the Personnel Evaluation Standards (Gullickson, 2008) that include the following: propriety standards, addressing legal and ethical issues; utility standards, addressing how evaluation reports will be used and by whom; feasibility standards, addressing the practicality and feasibility of evaluation systems; and accuracy standards, addressing the validity and credibility of the evaluative inferences. These standards 4 Annual teaching evaluations are generally idiosyncratic within and across districts; however, there are examples of more coherent district-level practices in places like Cincinnati and Toledo. Increasingly, as a part of the Teacher Incentive Fund (TIF) grants, districts are experimenting with pilot programs that have higher technical quality. 7 APA-HTA_V3-12-0603-020.indd 7 04/10/12 7:24 PM Gitomer and Bell A N PS Y CH O LO G IC A L A SS O CI A TI O N evidence suggests that even the most common assessment practices have had a modest impact on the structures and capacity of the system to improve educational performance. Looking across the practices, there is no common view of quality teaching, and sound measurement principles are missing from at least some core practices of in-service evaluation. These findings, along with political reluctance to make evaluative judgments (e.g., Weisberg et al., 2009), have led many researchers and policy makers to conclude that the measures that make up the field’s most common assessments will be unable to satisfy the ambitious purposes of accountability, human resource management, and instructional improvement that are driving current policy demands around evaluation. Thus, the chapter next reviews measures the field is developing and implementing to support purposes ranging from accountability to capacity building. The primary features of different classes of measures, the nature of inferences they potentially can support, and current validation approaches and challenges are described. U N CO RR EC TE D PR O O FS © A M ER IC the development of 25 separate certificates, each addressing a unique combination of subject area and age ranges of students. For all certificates, teachers participate in a year-long assessment process that contains two major components. The first requires teachers to develop a portfolio that is designed to provide a window into practice. Portfolio entries require teachers to write about particular aspects of their practice as well as include artifacts that provide evidence of this practice. Artifacts can include videos and samples of student work. In all cases, teachers are able to choose the lesson(s) they want to showcase, given the general constraints of the portfolio entry. Examples of a portfolio entry include videos of the teacher leading a whole-class discussion or teaching an important concept. The second major component of NBPTS certification is the assessment center activities. Candidates go to a testing center and, under standardized testing conditions, respond to six constructed-response prompts about important content and content pedagogy questions within their certificate area. To achieve certification, candidates need to attain a total score across all assessment tasks that exceeds a designated passing standard. On balance, research suggests that the NBPTS system is able to identify teachers who are better able to support student achievement—as measured by standardized test scores—than are teachers who attempt certification but do not pass the assessment, but the differences are quite modest (National Research Council, 2008). The states in which teachers have been most likely to participate in the NBPTS system are those that have provided monetary rewards or salary supplements for certification. This has led to NBPTS being very active in a relatively small number of states, with only limited participation in other states. In contrast to assessment policies that shape preservice and in-service teaching, NBPTS takes a nuanced and professional view of teaching via a standardized assessment system that is tied to teaching practice. However, it is voluntary, touches relatively few teachers in most states, and is expensive. Although this discussion does not cover all teacher evaluation practices, it does provide a synopsis of the most common formal assessment and evaluation systems for teachers. Taken together, the Conceptualizing Measures of Teaching Quality Teaching quality is defined in many different ways and operationalized by the particular sets of measures used to characterize quality. Every measure brings with it, either explicitly or implicitly, a particular perspective as to which aspects and qualities of teaching should receive attention and how evidence ought to be valued. For example, there have been heated political arguments about whether dispositions toward teaching ought to be assessed as part of teacher education (Hines, 2007; Wasley, 2006). Although there is general agreement that the impact on students ought to be a critical evaluative consideration, the indicators of impact are not agreed upon. Some are satisfied with a focus on subject areas that are tested in schools. Others want both to broaden the academic focus and emphasize outcomes that are associated with mature participation in a democratic society (Koretz, 2008; Ravitch, 2010). Although reasonable people disagree about what distinguishes high-quality teaching, it is important 8 APA-HTA_V3-12-0603-020.indd 8 04/10/12 7:24 PM Evaluating Teaching and Teachers CH O LO G IC A L A SS O CI A TI O N value-added models [VAM]) does not capture the whole domain of teaching quality. In many fields, it is reasonable to expect that particular classes of measures are associated with particular stages of educational or professional development. For teaching, that has been partially true, particularly with content knowledge measures being used as a requirement to attain a teaching license. By and large, however, the measures reviewed here are being considered for use throughout the professional span during which teachers are assessed. At the time of the writing of this chapter, how the measures actually are used to meet particular assessment purposes remains to be seen. Nevertheless, because of the lack of any inherent relationship between category of measure and particular use, the remainder of this chapter is organized by construct focus rather than assessment purpose. PS Y MEASURES OF TEACHING QUALITY N In this section, an overview of measures that have been developed to support inferences about constructs associated with teaching quality is presented. M ER IC A to identify clearly the constructs that comprise teaching quality and how those constructs may be understood relative to the measures used in evaluation systems. Figure 20.1 describes a model we have developed that presumes that teaching quality is interactional and constructive. Within specific teaching and learning contexts, teachers and students construct a set of interactions that is defined as teaching quality. Six broad constructs make up the domain of teaching quality. These are teachers’ knowledge, practices, and beliefs, and students’ knowledge, practices, and beliefs. The domain of teaching quality and by extension the constructs themselves are intertwined with critical contextual features, such as the curriculum, school leadership, district policies, and so on. Therefore, by definition, specific instruments measure both context and construct. As can be seen in the figure, instruments may detect multiple constructs or a single teaching quality construct. For example, observation protocols allow the observer to gather evidence on both teacher and student practices, whereas assessments of content knowledge for teaching only measure teachers’ knowledge. Finally, the figure suggests that any one measure (e.g., a knowledge test or Contextual Curriculum Factors School Leadership Policy A © Community Students & Colleagues Resources Teaching Quality TEACHER CONSTRUCTS Teacher Knowledge Teacher Practices TE D PR O O FS TARGET DOMAIN MEASURES U N CO RR EC CONSTRUCTS Content Knowledge for Teaching Tests Knowledge of Teaching Tests STUDENT CONSTRUCTS Teacher Beliefs Student Beliefs Student Practices Belief Instruments Growth Models Observation Measures Artifact Measures Teacher Portfolios Student Knowledge Value-Added Methods Graduation Rates Course Taking Patterns Student Portfolios FIGURE 20.1. The contextual factors, constructs, and measures associated with teaching quality. 9 APA-HTA_V3-12-0603-020.indd 9 04/10/12 7:24 PM Gitomer and Bell For each set of measures, their core design characteristics, the range of uses to which they have been put, and the status of evidence to support a validity argument for the evaluation of teaching are reviewed. TI O A CI SS O A L A IC G LO O CH Y PS N A U N CO RR EC TE D PR O O FS © A M ER IC Knowledge of content. Knowledge of content has been a mainstay of the teacher licensure process since the 1930s, with the advent of the National Teacher Examinations (Pilley, 1941). With the requirement for highly qualified teachers in the reauthorization of the Elementary and Secondary Education Act (No Child Left Behind Act, 2002), all states now require teachers to demonstrate some level of content knowledge about the subjects for which they are licensed to teach. These assessments typically consist of multiplechoice questions that sample content derived from extant disciplinary and teaching standards and then confirmed through surveys of practitioners and teacher educators. Individual states set passing scores for candidates based on standard-setting processes (Livingston & Zieky, 1982) that are used to define a minimum-level “do no harm” threshold. Some states also require tests that assess knowledge of pedagogy and content-specific pedagogy. Although some of these tests may include constructedresponse formats, the basic approach to design and validation support is similar for both content and pedagogical tests.5 The validity argument for these kinds of tests has long been a source of debate. M. T. Kane (1982) discussed two possible interpretations: one concerned with the ability of the licensure test to predict future professional performance and the other to evaluate the current competence on a set of skills and knowledge that was deemed necessary but not sufficient for professional practice. M. T. Kane (1982) argued that the latter interpretation was appropriate for N Teacher Knowledge licensure tests as any single instrument would be insufficient to capture the set of complex and coordinated skills, understandings, and experiences necessary for professional competence. In endorsing the much more limited competence interpretation, M. T. Kane (1982) argued that establishing content validity was the critical task for a licensure test validity argument. Evidence is expected to demonstrate the adequacy of content needed for minimal job performance, both in terms of content representation and expectations for meeting the passing standard. Processes that include job analysis and standard-setting studies typically are used to provide such evidence. The adequacy of scores is typically supported through standard psychometric analyses that include test form equating, reliability, scaling, differential item functioning (DIF), and group performance studies. Other scholars have agreed that it is both inappropriate and infeasible to expect a broader validity argument (e.g., Jaeger, 1999; Popham, 1992). Even under this relatively constrained set of requirements, the status of validity evidence in practice is uneven. In its 2001 report, the National Research Council reviewed the validity evidence of the two primary organizations that design, develop, and administer these assessments. ETS6 was viewed as having evidence to support the content validity argument, although some assessments were using studies that were dated. National Evaluation Systems (NES)7 tests were typically unavailable, and so, the study panel concluded that for a very substantial amount of teacher licensure testing, the available evidence did not satisfy even the most basic requirements of available information articulated in the Standards for Educational and Psychological Testing (AERA et al., 1999). M. T. Kane’s (1982) position that content validation is by itself sufficient to establish the validity of licensure assessments has been argued against 5 While all states require demonstrations of content knowledge, some also require candidates to pass assessments of basic knowledge of reading, writing, and mathematics. We do not include these tests in our analysis because these instruments test knowledge and skills that are equally germane for any college student, not just teacher candidates. 6 The authors of this chapter were both employees of ETS as this chapter was written. The statements included here are a description of the conclusions of the National Research Council (2001) study report Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. We believe our statements are a fair representation of the study findings. 7 National Evaluation Systems was acquired by Pearson Education in 2006 and is now known as Evaluation Systems of Pearson. 10 APA-HTA_V3-12-0603-020.indd 10 04/10/12 7:24 PM Evaluating Teaching and Teachers U N CO RR EC TE D PR O O FS © A M ER IC A Y CH O LO G IC A L A SS O CI A TI O N Knowledge of content for teaching. Teaching involves much more than simple mastery of content knowledge. Shulman (1986) argued persuasively that teachers also needed to master a body of knowledge he identified as pedagogical content knowledge (PCK). Shulman argued that PCK involves pedagogical strategies and representations that make content understandable to others and also involves teachers grasping what makes particular content challenging for students to understand, what kinds of conceptions and misconceptions students might have, and how different students might interact with the content in different ways. Building on Shulman’s (1986) ideas, Hill, Ball, and colleagues focused on mathematics and developed theory and assessments of what they called mathematical knowledge for teaching (MKT; Ball, Thames, & Phelps, 2008). MKT attempts to specify the knowledge of mathematical content that is used in practice, differentiating the advanced subject knowledge that one might learn as a student majoring in a college discipline from the particular forms of knowledge that teachers need to help their students learn concepts in K–12 education. Content knowledge for teaching (CKT) is the more general term applied to this type of knowledge as it used across different subject matter domains (e.g., science, social studies, etc.). CKT incorporates what Shulman called PCK and further specifies both the content knowledge and the PCK that teachers need to know in particular subject areas. The argument that accompanies CKT suggests that teachers must know mathematics differently than someone who uses math in her daily life but is not charged with teaching children math. For example, a common task of teaching requires that teachers assess the degree to which certain problems allow students to practice a particular math objective. In Figure 20.2, the teacher must be able to recognize whether a proportion can be used to solve the word problem. Although adults may use proportions in their professional or personal lives, teachers must be able to look at problems and determine whether that problem can be solved in a specific way that meets a learning objective. Ball et al. (2008) highlighted six forms of CKT that fall into two categories—content knowledge PS N strongly by other experts (e.g., Haertel, 1991; Haney, Madaus, & Kreitzer, 1987; Moss & Schutz, 1999; Pullin, 1999). A number of researchers and policy makers, including the authors of the National Research Council (2001) study, have argued that these assessments ought to be evaluated using the predictive criterion interpretation, including a demonstration of a relationship between scores on the licensure test and other measures associated with teaching. To that end, researchers have conducted studies relating scores on teacher licensure assessments to student gains in achievement by studying practicing teachers who varied on their licensure test scores, including those who would not have met the passing standard in one state even if they scored sufficiently high to teach in another. There is some evidence of a quite modest relationship between licensure test scores and student outcomes (e.g., Goldhaber & Hansen, 2010). Gitomer and Qi (2010), however, observed that the licensure tests were successful in identifying individuals who performed so substantially below the passing standard that such individuals would not have ever become practicing teachers in any locale and, thus, would not have been part of the distribution studied by Goldhaber and Hansen. Because these individuals do not attain a license to teach, any studies examining the relationships between test scores and other outcomes are attenuated. In part because content knowledge tests have been used so widely, there is a large body of evidence demonstrating disparate impact for minority candidates. Scores and passing rates are significantly lower for African American candidates than White candidates (Gitomer, 2007; Gitomer, Latham, & Ziomek 1999; Gitomer & Qi, 2010), which raises questions about the validity of the assessments and whether bias is associated with them. Although test developers attempt to ensure fairness through their test development and analysis processes (e.g., DIF analyses), it is imperative that research continue not only to examine issues of bias but also to pursue strategies to mitigate unequal educational opportunities that many minority candidates have experienced (e.g., National Research Council, 2001, 2008). 11 APA-HTA_V3-12-0603-020.indd 11 04/10/12 7:24 PM Gitomer and Bell Mr. Sucevic is working with his students on understanding the use of proportional relationships in solving problems. He wants to select some problems from a mathematics workbook with which his students can practice. For each of the following problems, indicate whether or not it would be answered by setting up and solving a proportional relationship. Would Be Answered by Would Not Be Answered by Setting Up and Solving a Setting Up and Solving a Proportional Relationship Proportional Relationship TI O N A) Cynthia is making cupcakes from a recipe that requires 4 eggs and 3 cups of milk. If she has only 2 eggs to make the cupcakes, how many cups of milk must she use? SS O CI A B) John and Robert are each reading their books at the same rate. When John is on page 20, Robert is on page 15. What page will John be on when Robert is on page 60? L A C) Julie and Karen are riding their bikes at the same rate. Julie rides 12 miles in 30 minutes. How many miles would Karen ride in 35 minutes? LO G IC A D) Rashida puts some money into an account that earns the same rate each month. She does not remove or add any money to the account. After 6 months, the balance in the account is $1,093.44. What is the balance in the account after 12 months? PS Y CH O E) A square with area 16 square units can be inscribed in a circle with area 8π square units. How many square units are in the area of a square inscribed in a circle that has area 24π square units? IC A N FIGURE 20.2. A sample question from MET Mathematics 6–8. U N CO RR EC TE D PR O O FS © A M ER and PCK. Each of the main categories has three subcategories. Content knowledge is composed of common content knowledge, specialized content knowledge, and horizon content knowledge. Specialized content knowledge is knowledge that enables work with students around content, but not knowledge that other professionals using the same content (e.g., mathematics) in their jobs might find important. For example, a teacher needs to not only carry out a mathematical operation (e.g., dividing fractions) but also to understand why the operation works so that different student solutions can be understood as being mathematically reasonable or not. Specialized content knowledge contrasts with common content knowledge, which is knowledge held by individuals who use that content in their work and personal lives. Horizon content knowledge involves an understanding of how different content is interrelated across curricular topics both within and across school years. PCK, the second organizing category of knowledge, is composed of knowledge of content and students, knowledge of content and teaching, and knowledge of content and curriculum. Knowledge of content and students combines content knowledge and knowledge of how students’ interact with and learn the subject. It includes, for example, knowledge of what aspects of a subject students are likely to find difficult, errors students might make, and difficulties students might encounter in understanding a subject. Knowledge of content and teaching includes knowledge of the best examples to use, how to link subject-specific tasks, and ways of responding to students’ ideas and confusion that will develop their understanding of the subject. Finally, knowledge of content and curriculum focuses on knowledge of how to sequence and organize a subject and of the material programs that can be used to support students’ developing understanding of the subject. Hill, Schilling, and Ball (2004) described the developmental processes for constructing items of these types and also provided information about the psychometric quality and structure of assessment 12 APA-HTA_V3-12-0603-020.indd 12 04/10/12 7:24 PM Evaluating Teaching and Teachers A IC ER M A © FS O O PR TE D EC RR CO N U CI A TI O N 2004), but assessments are being developed and tested in English language arts. Most validity work has been done on MKT, not the more general CKT. Given the use of MKT as a research tool, there is a relatively strong validity argument. However, the validity argument for MKT for other uses is modest (e.g., teacher preparation program evaluation) but growing. The validity argument for assessments of CKT, used in both research and for personnel decisions, is nascent but also growing. SS O Teacher Practices Y CH O LO G IC A L A Observations. Scholarship on observation protocols goes back to the turn of the 20th century (Kennedy, 2010). The actual practice of individuals with authority using some type of protocol to make evaluative decisions about a teacher likely dates back even further. Kennedy (2010) suggested that for more than half of the 20th century the protocols in use have been general, poorly defined, idiosyncratic, heavily subjective, and often focused on teachers’ personal characteristics rather than teaching. Given the view of teaching as one involving socially and culturally situated interactions between teachers and students to support the construction of knowledge, instruments that are unable to detect these types of interactions are not reviewed. This means that the instruments from the productive history of process–product research in the 1970s and 1980s that used observation protocols to assess teaching quality are not included (for a review of this research, see Brophy & Good, 1986). Instead, the focus is on the relatively new and small number of instruments and associated research that has been developed and used over roughly the past 25 to 30 years. These instruments are designed to measure whole-class instruction (e.g., not tutoring situations) and adopt the view that teaching and learning occur through interactions that support the construction of knowledge. The observation protocols currently in use generally adhere to the following description: The protocol begins with an observer developing a record of evidence from the classroom for some defined segment of time, typically without making any evaluative judgments. At the end of the segment, observers use a set of scoring criteria or rubric that typically PS N forms built with these items. Schilling and Hill (2007) have laid out a validity argument for these kinds of assessments and have conducted a research program to marshal evidence to evaluate the argument. To date, these assessments have been used in the context of research, particularly in the context of examining the impact of professional development and curricular interventions. They have not been used as part of licensure or other high-stakes testing programs. Thus, the validity argument pertains to use as a research tool. In one study, Hill, Dean, and Goffney (2007) conducted cognitive interviews of problem solutions by teachers, nonteachers, and mathematicians. Although they found that mathematical knowledge itself was critically important to solving the problems, they observed important differences that were not simply attributable to content knowledge. Mathematicians, for example, sometimes had difficulty interpreting nonstandard solutions, the kinds of solutions that students often generate. Although mathematicians could reason their way through problems, it was teachers who could call on their prior experiences with students to reason through other problems. Krauss, Baumert, and Blum (2008) developed another measure of PCK and also found strong but not uniform relationships with content knowledge—in some cases, teachers brought unique understandings that allowed them to solve problems more effectively than others who had far stronger mathematical content knowledge. Other studies have found modest relationships between CKT measures and student achievement gains (Hill, Rowan, & Ball, 2005) and relationships with judgments of classroom instruction through observation (Hill et al., 2008). The lack of studies that address questions of impact on subgroups of teachers (e.g., special education teachers, teachers of color, or teachers of English language learners) likely is due to the purposes and scope of the existing research studies. The studies to date have typically relied on relatively small samples. Studies currently being conducted will yield data based on far larger samples and broader sets of measures of teacher quality (e.g., Bill and Melinda Gates Foundation, 2010). To date, there has been only limited work in other content domains (e.g., Phelps, 2009; Phelps & Schilling, 13 APA-HTA_V3-12-0603-020.indd 13 04/10/12 7:24 PM Gitomer and Bell U N CO RR EC TE D PR O O FS © A M ER N PS Y CH O LO G IC A L A SS O CI A TI O N 2005; La Paro, Pianta, & Stuhlman, 2004; Piburn & Sawada, 2000). Because much of the research on these protocols has happened in the context of university-based research projects, the raters themselves are often graduate students or faculty members. With this rater group, trainers are able to teach raters to see teaching and learning through the lens of their respective protocol at acceptable levels of interrater agreement (e.g., Hill et al., 2008). Initial qualification of raters typically requires agreement with master codes at some prespecified level (e.g., 80% exact match on a 4-point scale). Among both researchers and practitioners, the best methods and standards for judging rater agreement on holistic observation protocols are evolving. The most simple and common way of judging agreement is to calculate the proportion of scores on which raters agree. For many protocols, agreement requires an exact match in scores (e.g., Danielson & McGreal, 2000). But for others with larger scales, raters are deemed to agree if their scores do not differ by more than 1 score point (e.g., Pianta et al., 2007). Such models do not take into account the overall variation in scores assigned—raters may appear to agree by virtue of not using more than a very narrow range of the scale. More sophisticated analyses make use of rater agreement metrics that take into account the distribution of scores, including Cohen’s kappa,8 intraclass correlations, and variance component decomposition. Emerging models attempt to understand a range of factors that might affect rater quality and agreement. For example, in addition to rater main effects, Raudenbush, Martinez, Bloom, Zhu, and Lin (2010) consider how rater judgments can interact with the classrooms, days, and lesson segments observed. To the extent that these variance components (or facets, if g-study approaches are used; see Volume 1, Chapter 3, this handbook) can be estimated, it may be possible to develop observation scores that adjust for such rater effects. When using these models, preliminary findings suggest there are substantial training challenges in obtaining high levels of agreement, A IC includes a set of Likert scales to make both low- and high-inference judgments based on the record of evidence. Those judgments result in numerical scores. Although some of the protocols have been used to evaluate thousands of teachers (e.g., Charlotte Danielson’s Framework for Teaching has been the most widely used), the protocols have rarely been used for summative consequential decisions, although this is changing rapidly. Despite the fact that many districts are considering or have already begun using these observation protocols for consequential decisions, there is still much not known about the strength of the validity argument for these protocols as a group as well as the strength of the validity argument for individual protocols. Although there are exceptions, the instruments have been used in both live and video-based observation settings. Bell et al. (in press) have recently used an argument approach to evaluate the validity of one observation protocol. Protocols tend to fall into two broad categories— protocols for use across subject areas and those intended for use in specific subject areas (Baker, Gersten, Haager, & Dingle, 2006; Danielson, 1996; Grossman et al., 2010; Hill et al., 2008; Horizon Research, 2000; Pianta, La Paro, & Hamre, 2007; Taylor, Pearson, Peterson, & Rodriguez, 2003). There are subject-specific protocols in mathematics, science, and English language arts, but none are evident for social studies classrooms (e.g., social studies, history, government, geography, etc.). There are more protocols for use at the elementary grades than the secondary ones. Many of the subject-specific protocols have been studied in K–3, K–5, or K–8 classrooms, so it is unclear whether or how the protocols might function differently in high school classrooms. These protocols reflect a particular perspective on teaching quality—some privilege a belief in constructivist perspectives on teaching and others are more agnostic to the particular teaching methods used. Observation protocols are generally developed and vetted within a community of practice that has a corresponding set of teaching standards (Danielson & McGreal, 2000; Gersten, Baker, Haager, & Graves, 8 It is important to note that kappa can be sensitive to skewed or uneven distributions and, therefore, may be of limited value depending on the particular score distributions on a given instrument (e.g., Byrt, Bishop, & Carlin, 1993). 14 APA-HTA_V3-12-0603-020.indd 14 04/10/12 7:24 PM Evaluating Teaching and Teachers CO RR EC TE D PR O O FS © A M ER IC A U N Instructional collections and artifacts. A second group of instruments to measure teaching quality has emerged in the past 15 to 20 years. Instructional collections (sometimes referred to as portfolios) and artifacts have a shorter history than observations. Research began in earnest on these types of instruments in the early to mid-1990s with peer-reviewed articles and book chapters beginning to appear in Y CH O LO G IC A L A SS O CI A TI O N the late 1990s. Thus far that work has produced a relatively small number of instruments used and studied by a relatively small number of researchers. In contrast to observation protocols that were largely designed as professional development tools, the design and development of instructional collections and artifact protocols gave more attention to psychometric quality from the outset. Even so, research remains highly uneven—a moderate number of studies with very small numbers of teachers and a handful of studies with large numbers of teachers. Claims about such protocols as a group should therefore be taken as preliminary. Instructional collections are evidence collection and scoring protocols that typically involve one or more holistic judgments about a range of evidence that often addresses the multiple constructs that comprise the teaching quality construct in Figure 20.1. Instructional collections draw inferences from evidence that can include lesson plans, assignments, assessments, student work samples, videos of classroom interactions, reflective writings, interviews, observations, notes from parents, evidence of community involvement, and awards or recognitions. These protocols identify what types of evidence the teacher is expected to submit within broad guidelines; the teacher is able to choose the specific materials upon which the judgment is based. Often the teacher provides both an explicit rationale for the selection of evidence in the collection and a reflective analysis to help the reader or evaluator of the collection make sense of the evidence. Artifact protocols can be thought of as a particular type of instructional collection that is much narrower. The protocols most widely researched are designed to measure the intellectual rigor and quality of the assignments teachers give students as well as the student work that is produced in response to those assignments (e.g., Borko, Stecher, & Kuffner, 2007; Newmann, Bryk, & Nagaoka, 2001). These protocols are designed to be independent of the academic difficulty of a particular course of study. For example, an advanced physics assignment would receive low scores if students were simply asked to provide definitions. The judgments made focus on an important but limited part of the teaching quality domain, focusing almost exclusively on teacher and PS N particularly with higher inference instruments (e.g., Gitomer & Bell, 2012; McCaffrey, 2011). As observation systems are included in evaluation systems, systems will need to ensure not only that raters are certified but also that effective monitoring and continuing calibration processes are in place. In general, there is little or no information provided about whether or how raters are calibrated over time (Bell, Little, & Croft, 2009). A research literature is now beginning to amass around these observation protocols. Research is being conducted examining the extent to which empirical results support the underlying structure of the instruments (e.g., La Paro et al., 2004) and changes in practice as the result of teacher education (Malmberg et al., 2010) and professional development (Pianta, Mashburn, Downer, Hamre, & Justice, 2008; Raver et al., 2008). A number of studies are now being reported that examine the relationship of observation scores to student achievement gains (Bell et al., in press; Bill and Melinda Gates Foundation, 2011b; Grossman et al., 2010; Hamre et al., in press; Hill, Umland, & Kapitula, 2011; Milanowski, 2004). Thus, over the next 5 to 10 years, a very strong body of research is likely to emerge that will provide information about the validity and potential of classroom observation tools. It is important to understand that these protocols are designed to evaluate the quality of classroom practice. As described in Figure 20.1, factors such as curriculum, school policy, and environment as well as the characteristics of the students in the classroom are being detected by these observation protocols. Thus, causal claims about the teacher require another inferential step and are not transparent. Furthermore, given the high-stakes uses to which these instruments are being applied, the state of the current validity argument is weak. 15 APA-HTA_V3-12-0603-020.indd 15 04/10/12 7:24 PM Gitomer and Bell U N CO RR EC TE D PR O O FS © A M ER N PS Y CH O LO G IC A L A SS O CI A TI O N committees were consulted extensively. As a class of protocols, there has been significant attention to raters and score quality. Although there have been graduate students, principals, and other education professionals trained to rate, raters have overwhelmingly been teachers with experience at the grade level and subject area being assessed (Aschbacher, 1999; Boston & Wolf, 2006; Matsumura et al., 2006; Matsumura, Garnier, Slater, & Boston, 2008; Newmann et al., 2001). Training on both instructional collection and artifact protocols is usually intensive (e.g., 3 to 5 days for artifacts and sometimes longer for instructional collections) and makes use of benchmark and training papers. For almost all protocols, raters are required to pass a certification test before scoring. Although the quality of the training as judged by interrater agreement varies across protocols and studies, the literature suggests it is possible to train raters to acceptable levels of agreement (more than 70%) with significant effort (Borko et al., 2007; Gitomer, 2008b; Ingvarson & Hattie, 2008; Matsumura et al., 2006; M. Wilson, Hallam, Pecheone, & Moss, 2006). As with observations, score accuracy is often a challenge to the validity of interpretations of evidence for instructional collections. Accuracy problems, most often in the form of rater drift and bias, have been addressed by putting in place procedures for bias training (e.g., Ingvarson & Hattie, 2008) and retraining raters, rescoring, and, in some cases, modeling rater severity using Rasch models (Gitomer, 2008b; Kellor, 2002; National Research Council, 2008; Shkolnik et al., 2007; Wenzel, Nagaoka, Morris, Billings, & Fendt, 2002). Because there is such a wide range of practices to account for rater agreement across the instruments and purposes of those instruments, it is difficult to generalize about the quality of scores except to say it is uneven. Instructional collections and artifact protocols examine evidence that is often produced as a regular part of teaching and learning. Perhaps in part because of this closeness to practice, instructional collections have high levels of face validity, and for at least some protocols, teachers report that A IC student practices, with much less teacher description and analysis called for than with other instructional collections. The protocols circumscribe what types of assignments are assessed, often asking for a mix of four to six typical and challenging assignments that produce written student work. Often researchers sample assignments across the school year and allow for some teacher choice in which assignment is assessed. Both artifact and instructional collection instruments have been used for various purposes, ranging from formative feedback for the improvement of teaching practice to licensure and high-stakes advanced certification decisions. For example, the Scoop Notebook is an instructional collection protocol that has been used to improve professional practice (Borko et al., 2007; Borko, Stecher, Alonzo, Moncure, & McClam, 2005). The portfolio protocol for NBPTS certification is used as a part of a voluntary high-stakes assessment for advanced certification status (e.g., Cantrell, Fullerton, Kane, & Staiger, 2008; National Research Council, 2008; Szpara & Wylie, 2007). Related protocols have been used as part of licensure (e.g., California Commission on Teacher Credentialing, 2008; Connecticut State Department of Education, Bureau of Program and Teacher Evaluation, 2001), and three artifact protocols documented in the research literature have been used for the improvement of practice, judgments about school quality, and the evaluation of school reform models (Junker et al., 2006; Koh & Luke, 2009; Matsumura & Pascal, 2003; Mitchell et al., 2005; Newmann et al., 2001). These protocols vary in the degree to which they require the teacher to submit evidence that is naturalistic (i.e., already exists as a regular part of teaching practice) or evidence that is created specifically for inclusion in the assessment (e.g., written reflections or interviews). All of the protocols reviewed have been developed to reflect a community’s view of quality teaching. In the high-stakes assessments (e.g., the now-redesigned Connecticut’s Beginning Educator Support and Training [BEST] portfolio assessment9 and the NBPTS observation protocol), stakeholder 9 BEST has been redesigned and, as of the 2009–2010 school year, is now known as the Teacher Education and Mentoring (TEAM) Program. This paper considers BEST as it existed before the redesign. 16 APA-HTA_V3-12-0603-020.indd 16 04/10/12 7:24 PM Evaluating Teaching and Teachers FS © A M ER IC A U N CO RR EC TE D PR O O Validity challenges to the measurement of teacher practices. Across these different measures of teacher and practice, valid inferences about teaching quality will depend in large part on the ability to address the following issues. First, claims about teacher effectiveness must take into account contextual factors that individuals do not control. For example, as teachers are required to focus on test preparation activities, an increasingly common practice in recent years (Stecher, Vernez, & Steinberg, 2010), qualities of instruction valued by particular protocols may become less visible. Teachers who work within certain curricula may be judged to be more effective, not necessarily because of their own abilities, but because they are working with a curriculum that supports practices valued by particular measurement instruments (e.g., Cohen, 2010). Y CH O LO G IC A L A SS O CI A TI O N Causal claims based on any single instrument may be inappropriate and can be better justified by considering evidence from multiple measures. Second, issues of bias and fairness need to be examined and addressed. As with other assessment measures, there must be vigilance to ensure that measures do not, for construct-irrelevant reasons, privilege teachers with particular backgrounds. Aside from the NBPTS and Connecticut’s previous BEST instructional collection research, there is very little research to suggest the field understands the bias and fairness implications of specific protocols. This is understandable given the more formative uses of many of the instruments; however, as stakes are attached, this will not be an acceptable state of affairs for either legal or ethical reasons. Finally, implementation of these protocols is critical to the validity of the instrument for specific uses. Even if there is validity evidence for a particular measure, such evidence is dependent on implementing the protocols in particular ways, for example, with well-trained and calibrated raters. Using a protocol that has yielded valid inferences in one context with a specific set of processes in place does not guarantee that inferences made in a similar context with different implementation processes will yield those valid inferences. States and districts will have to monitor implementation specifics closely, given the budgetary and human capital constraints under which they will operate. PS N preparing an instructional collection improves their practice (e.g., Moss et al., 2004; Tucker, Stronge, Gareis, & Beers, 2003). Across protocols, however, teachers often feel they are burdensome. Evidence is modest and mixed on the relationship to teaching practice and student achievement, depending on the instrument under investigation (e.g., National Research Council, 2008; M. Wilson et al., 2006). Instruments that focus on evaluating the products of classroom interactions rather than the teacher’s commentary on those products in collections seem to have stronger evidence for a relationship to student learning (e.g., Cantrell et al., 2008; Matsumura et al., 2006). Consistent with this trend, there is a somewhat stronger, more moderate relationship between scores on artifact protocols and student achievement (Matsumura & Pascal, 2003; Matsumura et al., 2006; Mitchell et al., 2005; Newmann et al., 2001). This relationship may be due to the fact that artifact protocols are, by definition, more narrowly connected to teaching practice. If these instruments are to become more widely used in teacher evaluation, there will need to be a stronger understanding of teacher choice in selecting assignments and teacher-provided description and reflection. There will also have to be a stronger understanding of the role of grade-level, school, and district curricular decisions that could prove thorny when attributing scores to individual teachers. Teacher Beliefs This category represents a mix of various kinds of measures that have been used to target different constructs about teaching. They include measures that range from personal characteristics and teacher beliefs to abilities to make judgments on others’ teaching, typically through some type of survey or questionnaire methodology. Collectively, this body of work has tried to identify proxy measures of beliefs, attitudes, and understandings that could predict who would become a good teacher and that could provide guidance for individuals and systems as to whether individuals were suited to the profession of teaching, generally, and to particular teaching specialties and job placements, more specifically. 17 APA-HTA_V3-12-0603-020.indd 17 04/10/12 7:24 PM Gitomer and Bell CH O LO G IC A L A SS O CI A TI O N believes that teachers in general can determine student outcomes. This work highlights the continuing challenges in clarifying the personality constructs of interest. Ashton and Webb (1986), Gibson and Dembo (1984), and Woolfolk and Hoy (1990) all made the distinction between beliefs about what teachers in general can do to affect student outcomes (teacher efficacy) and what he or she as an individual could do to affect student outcomes (personal efficacy). Guskey and Passaro (1994) rejected this distinction as an artifact of instrument design and instead argued that two factors of efficacy— internal and external locus of control—reflected the extent to which teachers viewed themselves as having the ability to influence student learning. This work builds on the finding of Armor et al. (1976), who did find a modest relationship between student achievement gains and a composite measure of teacher beliefs based on the following statements: U N CO RR EC TE D PR O O FS © A M The regrettable fact is that many of the studies have not produced significant results. Many others have produced only pedestrian findings. For example, it is said after the usual inventory tabulation that good teachers are friendly, cheerful, sympathetic, and morally virtuous rather than cruel, depressed, unsympathetic, and morally depraved. But when this has been said, not very much that is especially useful has been revealed . . . . What is needed is not research leading to the self-evident but to the discovery of specific and distinctive features of teacher personality and of the effective teacher. (Getzels & Jackson, 1963, p. 574) ER In the ensuing years, efforts have been undertaken to make progress beyond this earlier state of affairs. A large body of work has focused on teacher efficacy—that is, the extent to which an individual N PS Y 1. When it comes right down to it, a teacher really can’t do much because most of a student’s motivation and performance depends on his or her home environment. 2. If I try really hard, I can get through to even the most difficult or unmotivated student. A IC Almost 50 years ago, Getzels and Jackson (1963) reviewed the extant literature linking personality characteristics to teaching quality. Finding relationships somewhat elusive, they highlighted three substantial obstacles that remain relevant in the 21st century. First, they raised the problem of defining personality. Although personality theory has certainly evolved substantially over the last half-century, the identification of personality characteristics that are theoretically and empirically important to teaching is still underspecified. Second, they argued that instrumentation and theory to measure personality was relatively weak. The reliance on correlations of measures without strong theories that link personality constructs to practice continues to persist (e.g., Fang, 1996). The third fundamental challenge is the limitation of the criterion measures—what are the measures of teacher quality that personality measures are referenced against? Typical criterion measures that Getzels and Jackson (1963) reviewed included principal ratings, teacher self-reports, and experience. As can be seen throughout this review, although great effort has been and is being made in defining quality of teaching, the issues are hardly resolved. Reviewing a large body of research, their conclusions were humbling: Students who showed the greatest gains had teachers who disagreed with the first statement and agreed with the second. The field continues to be characterized by, at best, modest correlations between measures of personality, dispositions, and beliefs and academic outcome measures. This, however, has not stopped the search for such measures. Metzger and Wu (2008) reviewed the available evidence for a widely used commercially available product, Gallup’s Teacher Perceiver Interview (TPI). They attributed the modest findings to possibilities that teachers’ responses in these self-report instruments may not be accurate reflections of their operating belief systems and that the manifestation of characteristics may be far more context bound than general instruments acknowledge. They concluded, as others have, that the constructs being examined are “slippery” (Metzger & Wu, 2008, p. 934) and multifaceted, making it very difficult to detect relationships. The validity argument for this group of measures is weak. 18 APA-HTA_V3-12-0603-020.indd 18 04/10/12 7:24 PM Evaluating Teaching and Teachers U N CO RR EC TE D PR O O FS © A M ER IC A TI O A CI SS O A L A IC G LO O CH Y PS N Both teachers and students contribute to teaching quality. The measures used to assess teaching quality through the assessment of student beliefs and practices may be considered. As Figure 20.1 indicates, some of the instruments being used to assess teacher beliefs and practices also assess student beliefs and practices. For example, on the holistic observation protocol called the Classroom Assessment Scoring System (CLASS) developed by Pianta, Hamre, Haynes, Mintz, and La Paro (2007), raters are trained to observe both teacher practices and student practices. Secondary classrooms that, for example, receive high scores on the quality of feedback dimension of CLASS would have students engaging in back-and-forth exchanges with the teacher, demonstrating persistence, and explaining their thinking in addition to all of the teacher’s actions specified in that dimension. This focus on both teacher and student practices is common across the observation protocols reviewed in this section. Many instruments are designed to measure student beliefs and practices on a wide range of topics from intelligence to self-efficacy to critical thinking (e.g., Dweck, 2002; Stein, Haynes, Redding, Ennis, & Cecil, 2007; Usher & Pajares, 2009). A summary of this research is outside the scope of this chapter, but only one identified belief instrument is being used to evaluate teachers. On the basis of a decade of work by Ferguson and his colleagues in the Tripod Project (Ferguson, 2007), the MET project is using the Tripod assessment to determine the degree to which students’ perceptions on seven topics are predictive of other aspects of teaching quality (Bill and Melinda Gates Foundation, 2011b). Preliminary evidence suggests the assessment is internally reliable (coefficient alpha > .80) when administered in such a way that there are no stakes for students and teachers (i.e., a research setting). Results on how the instrument functions in situations in which there are consequences for teachers have not yet been published. Student Knowledge Value-added models. Over recent years, there has been great enthusiasm for incorporating measures of student achievement into estimates of how well teachers are performing. This approach has led policy makers and researchers to advocate for the use of value-added measures to evaluate individual teachers. Value-added measures use complex analytic methods applied to longitudinal student achievement data to estimate teacher effects that are separate from other factors shaping student learning. Comprehensive, nontechnical treatments of valueadded approaches are presented by Braun (2005b) and the National Research Council and National Academy of Education (2010). The attraction of valued-added methods to many is that they are objective measures that avoid the complexities associated with human judgment. They are also relatively low cost once data systems are in place, and they do not require the human capital and ongoing attention required by many of the previously described measures. Finally, policy makers are attracted to the idea of applying a uniform metric to all teachers, provided test scores are available. Although these models are promising, they have important methodological and political limitations that represent challenges to the validity of inferences based on VAM (Braun, 2005b; Clotfelter et al., 2005; Gitomer, 2008a; Kupermintz, 2003; Ladd, 2007; Lockwood et al., 2007; National Research Council and National Academy of Education, 2010; Raudenbush, 2004; Reardon & Raudenbush, 2009). These challenges are summarized into two broad and related categories. These challenges are actually not unique to VAM. However, because VAM has been so widely endorsed in policy circles and because it is viewed as having an objective credibility that other measures do not, it is particularly important to highlight these challenges with respect to VAM. A first validity challenge concerns the nature of the construct. One distinguishes between teacher and teaching effectiveness because a variety of factors may influence the association of scores with an individual teacher. For example, school resources, particularly those targeted at instruction (e.g., Cohen, Raudenbush, & Ball, 2003), specific curricula (e.g., Cohen, 2010; Tyack & Cuban, 1995), and district polices that provide financial, technical, and professional support to achieve instructional goals N Student Beliefs and Student Practices 19 APA-HTA_V3-12-0603-020.indd 19 04/10/12 7:24 PM Gitomer and Bell TI O N continues to attempt to address these validity challenges and to understand the most appropriate use of VAM within evaluation systems. Researchers and policy makers vary in their confidence that these issues will be to the improvement of educational practice (for two distinct perspectives, see Baker et al., 2010; Glazerman et al., 2010). U N CO RR EC TE D PR O O FS © A M ER N PS Y CH O LO G IC A L A SS O CI A Student learning objectives. Evaluation policies must include all teachers. If student achievement is to be a core component of these evaluation systems, policy makers must address the fact that there are no annual achievement test data appropriate to evaluate roughly 50%–70% of teachers, either because of the subjects or grade levels that they teach. One of the solutions proposed has been the development of measures using student learning objectives (SLOs; Community Training and Assistance Center, 2008). In these models, teachers articulate a small set of objectives and appropriate assessments to demonstrate that students are learning important concepts and skills in their classrooms. SLOs are reviewed by school administrators for acceptability. Teachers are evaluated on the basis of how well the SLOs are achieved on the basis of assessment results. Because of the limited applicability of VAM, SLOs are being considered for use in numerous state teaching evaluation systems (e.g., Rhode Island, Maryland, and New York). Many of these models include the development of common SLOs for use across a state. The integrity of the process rests on the quality of the objectives and the rigor with which they are produced and reviewed inside the educational system. To date, there is a very limited set of data to judge the validity of these efforts. Available studies have found, first, that developing high-quality objectives that identify important learning goals is challenging. The Community Training and Assistance Center (2004) reported that for the first 3 years of a 4-year study, a majority of teachers produced SLOs that needed improvement. Teachers failed to identify important and coherent learning goals and had low expectations for students. Studies do report, however, that teachers with stronger learning goals tend to have students who demonstrate better achievement (Community Training and Assistance Center, 2004; Lussier & Forgione, 2010). There are A IC (e.g., Ladd, 2007) all can influence what gets taught and how it gets taught, potentially influencing the student test scores that are used to produce VAM estimates and inferences about the teacher. There are other interpretive challenges as well: Other adults (both parents and teachers) may contribute to student test results, and the limits of student tests may inappropriately constrain the inference to the teacher (for a broad discussion of construct-relevant concerns, see Baker et al., 2010). A second set of issues concerns the internal validity of VAM. One aspect of internal validity requires that VAM estimates are attributable to the experience of being in the classroom and not attributable to preexisting differences between students across different classrooms. Furthermore, internal validity requires that VAM estimates are not attributable to other potential modeling problems. Substantial treatment of these methodological issues associated with VAM is provided elsewhere (Harris & McCaffrey, 2010; McCaffrey, Lockwood, Koretz, & Hamilton, 2003; National Research Council and National Academy of Education, 2010; Reardon & Raudenbush, 2009). Key challenges include the fact that students are not randomly assigned to teachers within and across schools. This makes it difficult to interpret whether VAM effects are attributable to teachers or the entering characteristics of students (e.g., Clotfelter et al., 2005; Rothstein, 2009). Model assumptions that attempt to adjust for this sorting have been shown to be problematic (National Research Council and National Academy of Education, 2010; Reardon & Raudenbush, 2009). Finally, choices about the content of the test (e.g., Lockwood et al., 2007), the scaling (e.g., Ballou, 2008; Briggs, 2008; Martineau, 2006), and the fundamental measurement error inherent in achievement tests and especially growth scores can “undermine the trustworthiness of the results of value-added methods” (Linn, 2008, p. 13). Bringing together these two sets of validity concerns suggests that estimates of a particular teacher’s effectiveness may vary substantially as a function of the policies and practices in place for a given teacher, the assignment of students to teachers, and the particular tests and measurement models used to calculate VAM. Substantial research into VAM 20 APA-HTA_V3-12-0603-020.indd 20 04/10/12 7:24 PM Evaluating Teaching and Teachers also indications that across systems, SLOs can lead teachers to have stronger buy-in to the evaluation system than has been demonstrated with other evaluation approaches (Brodsky, DeCesare, & KramerWine, 2010). measurement questions will need to be addressed to address the Standards for Educational and Psychological Testing (AERA et al., 1999). Compensatory or Conjunctive Decisions One question concerns the nature of the decision embedded in the system. In a conjunctive system, individuals must satisfy a standard for each constituent measure, whereas in a compensatory system, individuals can do well on some measures and less well on others as long as a total score reaches some criterion. In a conjunctive model, the reliability of each individual measure ought to be sufficiently high such that decisions based on each individual measure are defensible. A compensatory model, such as that used by NBPTS, does not carry the same burden, but it does lead to situations in which someone can satisfy an overall requirement and perform quite poorly on constituent parts. One compromise that is sometimes taken is to adopt a compensatory model, yet set some minimum scores for particular measures. TI O N COMBINING MULTIPLE MEASURES CI SS O A L A IC G LO O CH Y PS N A FS © A M ER IC Standard 14.13—When decision makers integrate information from multiple tests or integrate test and nontest information, the role played by each test in the decision process should be clearly explicated, and the use of each test or test composite should be supported by validity evidence; Standard 14.16—Rules and procedures used to combine scores on multiple assessments to determine the overall outcome of a credentialing test should be reported to test takers, preferably before the test is administered. A Policy discussions are now facing the challenge of integrating information from the various measures considered thus far as well as measures that are specific to particular assessment purposes. The Standards for Educational and Psychological Testing (AERA et al., 1999) provide guidance on the use of multiple measures in decisions about employment and credentialing: TE D PR O O Current policies and practices are only beginning to be developed. For example, the U.S. Department of Education’s (2010) Race to the Top competition asks states to U N CO RR EC Design and implement rigorous, transparent, and fair evaluation systems for teachers and principals that (a) differentiate effectiveness using multiple rating categories that take into account data on student growth (as defined in this notice) as a significant factor, and (b) are designed and developed with teacher and principal involvement. (p. 34) How these multiple ratings are accounted for, however, is left unstated. As states and districts grapple with these issues, a number of fundamental Determining and Using Weighting Schemes Some proposed systems (e.g., Bill and Melinda Gates Foundation, 2010) are trying to establish a single metric of teacher effectiveness that is based on a composite of measures. Efforts like these attempt to determine the weighting of particular measures based on statistical models that will maximize the variance accounted for by particular measures. At least two complexities will need to be kept in mind by whatever weighting scheme is used. First, if two measures have the same “true” relationship with a criterion variable, the one that is scored more reliably will be more predictive of the criterion and thus will be assigned a greater weight. Because of the reliability of scoring, some measures, or dimensions of measures, may be viewed as more predictive of the outcome than they actually are when compared with other less reliable measures. A second source of potential complexity is that measures that have greater variability across individuals are likely to have a stronger impact on a total evaluation score and that the effective weighting will be far larger than the assigned weights would indicate. Imagine a system that derived a composite 21 APA-HTA_V3-12-0603-020.indd 21 04/10/12 7:24 PM Gitomer and Bell CO RR EC TE D PR O O FS © A M ER TI O A CI SS O IC A N PS Y CH O LO G IC A L A The exercise of judgment. Systems can range from those in which a single metric is derived from multiple measures via a mathematical model to ones in which decision makers are required to exercise a summative judgment that takes into account multiple measures. Systems that avoid judgment often do so because of a lack of trust in the judgment process. If judgment is valued, as it is in many highperforming education systems, then it will be imperative to ensure that judgments are executed in ways that are credible and transparent. Rare yet important teaching characteristics. Finally, there may be characteristics that do not contribute to variance on valued outcomes that should contribute to composite measures. For example, we may believe that teachers should not make factual errors in content or be verbally abusive to students. These might be rare events and do little to help distinguish between teachers; however, robust evaluation systems might want to include them to make standards of professional conduct clear. Weighting schemes that rely solely on quantitative measurable outcomes run the risk of ignoring these important characteristics. N CONCLUSION U An ambitious policy agenda that includes teacher evaluation as one of its cornerstones places an unprecedented obligation on the field of education measurement to design, develop, and validate measures of teaching quality. There is a pressing need for evaluation systems that can support the full range of purposes for which they are being considered—from employment and compensation decisions to professional development. Doing this responsibly obligates the field to uphold the fundamental principles and standards of education measurement in the face of enormous policy pressures. Well-intentioned policies will be successful only if they are supported by sound measurement practice. Building well-designed measures of effective teaching will require coordinated developments in theory, design, and implementation, along with ongoing monitoring processes. Through ongoing validation efforts, enhancements to each of these critical components should be realized. This process also will require implementation of imperfect systems that can be subject to continued examination and refinement. The discipline needs to continue to examine the fairness and validity of interpretations and develop processes that ensure high-quality and consistent implementation of whichever measures are being employed. Such quality control can range from ensuring quality judgments from individuals rating teacher performance to ensuring that adequate data quality controls are in place for valueadded modeling. It is important that sound measurement practices be developed and deployed for these emerging evaluation systems, but there may be an additional benefit to careful measurement work in this area. Theories of teaching quality continue to be underdeveloped. Sound measures can contribute to both the testing of theory and the evolution of theories about teaching. For example, as educators understand more about how contextual factors influence teaching quality, theories of teaching will evolve. Understanding the relationship between context and teaching quality also may lead to the evolution and improvement of school and district decisions that shape student learning. For the majority of instruments reviewed in this chapter, their design can be considered first generation. Whether measures of teacher knowledge, instructional collections, or observation methods, there is a great deal to be done in terms of design of protocols, design of assessments and items, training and calibration of raters, aggregation of scores, and psychometric modeling. Even the understanding of expected psychometric performance on each class of N teaching quality score based on value-added and principal evaluation scores and also imagine that each was assigned a weight of 50%. Now imagine that the principal did not differentiate teachers much, if at all. In this case, even though each measure was assigned a weight of 50%, the value-added measure actually contributes almost all the variance to the total score. Thus, it is important not to just assign an intended weight but also to understand the effective weight given the characteristics of the scores when implemented (e.g., range, variance, measurement error, etc.). 22 APA-HTA_V3-12-0603-020.indd 22 04/10/12 7:24 PM Evaluating Teaching and Teachers U N CO RR EC TE D PR O O FS © A M ER IC A References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Authors. Y CH O LO G IC A L A SS O CI A TI O N American Federation of Teachers. (2010). A continuous improvement model for teacher development and evaluation (Working paper). Washington, DC: Author. Armor, D., Conroy-Oseguera, P., Cox, M., King, N., McDonnell, L., Pascal, A., & Zellman, G. (1976). Analysis of the school preferred reading programs in selected Los Angeles minority schools (Report No. R-2007-LAUSD). Santa Monica, CA: RAND. Aschbacher, P. R. (1999). Developing indicators of classroom practice to monitor and support school reform (CSE Technical Report No. 513). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing (CRESST)/UCLA. Ashton, P. T., & Webb, R. B. (1986). Making a difference: Teacher efficacy and student achievement. White Plains, NY: Longman. Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., . . . Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers (EPI Briefing Paper No. 278). Washington, DC: Economic Policy Institute. Baker, S. K., Gersten, R., Haager, D., & Dingle, M. (2006). Teaching practice and the reading growth of first-grade English learners: Validation of an observation instrument. Elementary School Journal, 107, 199–220. doi:10.1086/510655 Ball, D. L., Thames, M. H., & Phelps, G. (2008). Content knowledge for teaching: What makes it special? Journal of Teacher Education, 59, 389–407. doi:10.1177/0022487108324554 Ballou, D. (2008, October). Value-added analysis: Issues in the economics literature. Paper presented at the workshop of the Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Educational Accountability, National Research Council, Washington, DC. Retrieved from http://www7.nationalacademies.org/ bota/VAM%20Analysis%20-%20Ballou.pdf Barnette, J. J., & Gorham, K. (1999). Evaluation of teacher preparation graduates by NCATE accredited institutions: Techniques used and barriers. Research in the Schools, 6(2), 55–62. Bell, C. A., Gitomer, D. H., McCaffrey, D., Hamre, B., Pianta, R., & Qi, Y. (in press). An argument approach to observation protocol validity. Educational Assessment, 17(2–3), 1–26. Bell, C. A., Little, O. M., & Croft, A. J. (2009, April). Measuring teaching practice: A conceptual review. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, CA. Bell, C. A., & Youngs, P. (2011). Substance and show: Understanding responses to NCATE accreditation. Teaching and Teacher Education, 27, 298–307. doi:10.1016/j.tate.2010.08.012 PS N measures is at a preliminary stage. Importantly, most of the work on these measures done to date has been conducted in the context of research studies. There is little empirical understanding of how these measures will work in practice, with all the unintended consequences, incentives, disincentives, and competing priorities that characterize education policy. There is at least one crucial aspect of the current policy conversation that may prove to be the Achilles’ heel of the new systems being developed, should it go unchecked. In general, all of the currently envisioned systems layer additional tasks, costs, and data management burdens on school, district, and state resources. Observations take principals’ time. SLOs take teachers’, principals’, and districts’ time. Student questionnaires take students’ time. Data systems that track all of these new measures require money and time. And the list goes on. These systems are massive because they are intended to apply to all teachers in every system. Serious consideration has not been given to how institutions can juggle existing resource demands with these new demands. The resource pressures these evaluation systems place on institutions may result in efficiencies, but they may also result in significant pressure to cut measurement corners that could pose threats to the validity of the systems. Such unintended consequences must be monitored carefully. Although the new evaluation systems will require substantial resources, the justification for moving beyond measures that simply assign a ranking is that these kinds of measures can provide helpful information to stakeholders about both high-quality teaching and the strengths and weaknesses of teachers and school organizations in providing students access to that teaching. Actualizing such a useful system will require commitments by researchers, policy makers, and practitioners alike to proceed in ways that support valid inferences about teaching quality. 23 APA-HTA_V3-12-0603-020.indd 23 04/10/12 7:24 PM Gitomer and Bell Bill and Melinda Gates Foundation. (2011a). Learning about teaching: Initial findings from the Measures of Effective Teaching project. Retrieved from http:// www.metproject.org/downloads/Preliminary_ Findings-Research_Paper.pdf Handbook of research on teaching (pp. 328–375). New York, NY: Macmillan. Buddin, R., & Zamarro, G. (2009). Teacher qualifications and student achievement in urban elementary schools. Journal of Urban Economics, 66, 103–115. doi:10.1016/j.jue.2009.05.001 Bill and Melinda Gates Foundation. (2011b). Intensive partnerships for effective teaching. Retrieved from http://www.gatesfoundation.org/college-readyeducation/Pages/intensive-partnerships-for-effectiveteaching.aspx TI O N Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429. doi:10.1016/0895-4356(93)90018-V California Commission on Teacher Credentialing. (2008). California teaching performance assessment. Sacramento, CA: Author. CI A Borko, H. (2004). Professional development and teacher learning: Mapping the terrain. Educational Researcher, 33(8), 3–15. doi:10.3102/00131 89X033008003 SS O Cantrell, S., Fullerton, J., Kane, T. J., & Staiger, D. O. (2008). National Board certification and teacher effectiveness: Evidence from a random assignment experiment (NBER Working Paper No. 14608). Cambridge, MA: National Bureau of Economic Research. A L A Borko, H., Stecher, B., & Kuffner, K. (2007). Using artifacts to characterize reform-oriented instruction: The Scoop Notebook and rating guide (CSE Technical Report No. 707). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing (CRESST)/UCLA. LO G IC Carnegie Forum on Education and the Economy. (1986). A nation prepared: Teachers for the 21st century. New York, NY: Carnegie. Clotfelter, C., Ladd, H., & Vigdor, J. (2005). Who teaches whom? Race and the distribution of novice teachers. Economics of Education Review, 24, 377–392. doi:10.1016/j.econedurev.2004.06.008 Y CH O Borko, H., Stecher, B. M., Alonzo, A. C., Moncure, S., & McClam, S. (2005). Artifact packages for characterizing classroom practice: A pilot study. Educational Assessment, 10, 73–104. doi:10.1207/ s15326977ea1002_1 A M ER PS N A IC Boston, M., & Wolf, M. K. (2006). Assessing academic rigor in mathematics instruction: The development of the Instructional Quality Assessment Toolkit (CSE Technical Report No. 672). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing (CRESST)/UCLA. O O FS © Braun, H. I. (2005a). Using student progress to evaluate teachers: A primer on value-added models. Princeton, NJ: ETS. Retrieved from http://www.ets.org/Media/ Research/pdf/PICVAM.pdf TE D PR Braun, H. I. (2005b). Value-added modeling: What does due diligence require? In R. Lissitz (Ed.), Valueadded models in education: Theory and applications (pp. 19–39). Maple Grove, MN: JAM Press. U N CO RR EC Briggs, D. (2008, November). The goals and uses of valueadded models. Paper presented at the workshop of the Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Educational Accountability, National Research Council, Washington, DC. Retrieved from http:// www7.nationalacademies.org/bota/VAM%20 Goals%20and%20Uses%20paper%20-%20Briggs.pdf Brodsky, A., DeCesare, D., & Kramer-Wine, J. (2010). Design and implementation considerations for alternative teacher compensation systems. Theory Into Practice, 49, 213–222. doi:10.1080/00405841.2010.487757 Brophy, J., & Good, T. L. (1986). Teacher behavior and student achievement. In M. C. Wittrock (Ed.), Cochran-Smith, M., & Zeichner, K. M. (Eds.). (2005). Studying teacher education: The report of the AERA panel on research and teacher education. Mahwah, NJ: Erlbaum. Cohen, D. (2010). Teacher quality: An American educational dilemma. In M. M. Kennedy (Ed.), Teacher assessment and the quest for teacher quality: A handbook (pp. 375–402). San Francisco, CA: Jossey-Bass. Cohen, D., Raudenbush, S., & Ball, D. (2003). Resources, instruction, and research. Educational Evaluation and Policy Analysis, 25, 119–142. doi:10.3102/01623737025002119 Community Training and Assistance Center. (2004). Catalyst for change: Pay for performance in Denver. Boston, MA: Author. Community Training and Assistance Center. (2008). Tying earning to learning: The link between teacher compensation and student learning objectives. Boston, MA: Author. Connecticut State Department of Education, Bureau of Program and Teacher Evaluation. (2001). A guide to the BEST program for beginning teachers. Hartford, CT: Author. Danielson, C. (1996). Enhancing professional practice: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development. Danielson, C., & McGreal, T. L. (2000). Teacher evaluation to enhance professional practice. Alexandria, VA: Association for Supervision and Curriculum Development. 24 APA-HTA_V3-12-0603-020.indd 24 04/10/12 7:24 PM Evaluating Teaching and Teachers Desimone, L., Porter, A. C., Garet, M., Suk Yoon, K., & Birman, B. (2002). Effects of professional development on teachers’ instruction: Results from a three-year longitudinal study. Educational Evaluation and Policy Analysis, 24, 81–112. doi:10.3102/01623737024002081 of admissions and licensure testing (ETS Teaching and Learning Report Series No. ETS RR-03-25). Princeton, NJ: ETS. Gitomer, D. H., & Qi, Y. (2010). Score trends for Praxis II. Washington, DC: U.S. Department of Education, Office of Planning, Evaluation and Policy Development, Policy and Program Studies Service. Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., & Whitehurst, G. (2010). Evaluating teachers the important role of value-added. Washington, DC: Brown Center on Education. Educational Testing Service. (2001). PRAXIS III: Classroom performance assessments orientation guide. Princeton, NJ: Author. Goe, L. (2007). The link between teacher quality and student outcomes. Washington, DC: National Comprehensive Center for Teacher Quality. Fang, Z. (1996). A review of research on teacher beliefs and practices. Educational Research, 38, 47–65. doi:10.1080/0013188960380104 Goldhaber, D., & Hansen, M. (2010). Race, gender, and teacher testing: How informative a tool is teacher licensure testing? American Educational Research Journal, 47, 218–251. doi:10.3102/0002831209348970 L A SS O CI A TI O N Dweck, C. S. (2002). The development of ability conceptions. In A. Wigfield & J. S. Eccles (Eds.), Development of achievement motivation (pp. 57–88). San Diego, CA: Academic Press. doi:10.1016/B978012750053-9/50005-X IC A Ferguson, R. F. (2007). Toward excellence with equity: An emerging vision for closing the achievement gap. Boston, MA: Harvard Education Press. G Gonzales, P., Williams, T., Jocelyn, L., Roey, S., Kastberg, D., & Brenwald, S. (2008). Highlights from TIMSS 2007: Mathematics and science achievement of U.S. fourth- and eighth-grade students in an international context (NCES 2009-001 Revised). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Retrieved from http://nces.ed.gov/ pubs2009/2009001.pdf IC O CH Y PS N A Getzels, J. W., & Jackson, P. W. (1963). The teacher’s personality and characteristics. In N. L. Gage (Ed.), Handbook of research on teaching (pp. 506–582). Chicago, IL: Rand McNally. LO Gersten, R., Baker, S. K., Haager, D., & Graves, A. W. (2005). Exploring the role of teacher quality in predicting reading outcomes for first-grade English learners. Remedial and Special Education, 26, 197– 206. doi:10.1177/07419325050260040201 A M ER Gibson, S., & Dembo, M. (1984). Teacher efficacy: A construct validation. Journal of Educational Psychology, 76, 569–582. doi:10.1037/0022-0663.76.4.569 O FS © Gitomer, D. H. (2007). Teacher quality in a changing policy landscape: Improvements in the teacher pool (ETS Policy Information Report No. PIC-TQ). Princeton, NJ: ETS. TE D PR O Gitomer, D. H. (2008a). Crisp measurement and messy context: A clash of assumptions and metaphors— Synthesis of Section III. In D. H. Gitomer (Ed.), Measurement issues and the assessment for teacher quality (pp. 223–233). Thousand Oaks, CA: Sage. U N CO RR EC Gitomer, D. H. (2008b). Reliability and NBPTS assessments. In L. Ingvarson & J. Hattie (Eds.), Assessing teachers for professional certification: The first decade of the National Board for professional teaching standards (pp. 231–253). Greenwich, CT: JAI Press. doi:10.1016/S1474-7863(07)11009-7 Gitomer, D. H., & Bell, C. A. (2012, August). The instructional challenge in improving instruction: Lessons from a classroom observation protocol. Paper presented at the European Association for Research on Learning and Instruction Sig 18 Conference, Zurich, Switzerland. Gitomer, D. H., Latham, A. S., & Ziomek, R. (1999). The academic quality of prospective teachers: The impact Gordon, R., Kane, T. J., & Staiger, D. O. (2006). Identifying effective teachers using performance on the job (Hamilton Project Discussion Paper). Washington, DC: Brookings Institution. Grossman, P., Loeb, S., Cohen, J., Hammerness, K., Wyckoff, J., Boyd, D., & Lankford, H. (2010, May). Measure for measure: The relationship between measures of instructional practice in middle school English language arts and teachers’ value-added scores (NBER Working Paper No. 16015). Cambridge, MA: National Bureau of Economic Research. Gullickson, A. R. (2008). The personnel evaluation standards: How to assess systems for evaluating educators. Thousand Oaks, CA: Corwin Press. Guskey, T. R., & Passaro, P. D. (1994). Teacher efficacy: A study of construct dimensions. American Educational Research Journal, 31, 627–643. Haertel, E. H. (1991). New forms of teacher assessment. Review of Research in Education, 17, 3–29. Hamre, B. K., Pianta, R. C., Downer, J. T., DeCoster, J., Mashburn, A. J., Jones, S., . . . Hakigami, A. (in press). Teaching through interactions: Testing a developmental framework of teacher effectiveness in over 4,000 classrooms. Elementary School Journal. Haney, W. M., Madaus, G., & Kreitzer, A. (1987). Charms talismanic: Testing teachers for the improvement of education. Review of Research in Education, 14, 169–238. 25 APA-HTA_V3-12-0603-020.indd 25 04/10/12 7:24 PM Gitomer and Bell Hanushek, E. A. (2002). Teacher quality. In L. T. Izumi & W. M. Evers (Eds.), Teacher quality (pp. 1–13). Stanford, CA: Hoover Institution Press. A report to the Wallace Foundation. Seattle, WA: The Center for the Study of Teaching and Policy. Horizon Research. (2000). Inside classroom observation and analytic protocol. Chapel Hill, NC: Author. Harris, D., & McCaffrey, D. (2010). Value-added: Assessing teachers’ contributions to student achievement. In M. M. Kennedy (Ed.), Teacher assessment and the quest for teacher quality: A handbook (pp. 251–282). San Francisco, CA: Jossey-Bass. Howard, B. B., & Gullickson, A. R. (2010). Setting standards in teacher evaluation. In M. M. Kennedy (Ed.), Teacher assessment and the quest for teacher quality: A handbook (pp. 337–354). San Francisco, CA: Jossey-Bass. TI O N Harris, D. N., & Sass, T. R. (2006). Value-added models and the measurement of teacher quality. Unpublished manuscript. A Ingvarson, L., & Hattie, J. (Eds.). (2008). Assessing teachers for professional certification: The first decade of the National Board for Professional Teaching Standards. Greenwich, CT: JAI Press. SS O CI Heneman, H. G., Milanowski, A., Kimball, S. M., & Odden, A. (2006). Standards-based teacher evaluation as a foundation for knowledge- and skill-based pay (CPRE Policy Briefs No. RB-45). Philadelphia, PA: Consortium for Policy Research in Education, University of Pennsylvania. A L A Jaeger, R. M. (1999). Some psychometric criteria for judging the quality of teacher certification tests. Paper commissioned by the Committee on Assessment and Teacher Quality. Greensboro: University of North Carolina. IC Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., & Ball, D. L. (2008). Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and Instruction, 26, 430–511. doi:10.1080/07370000802177235 G LO Y CH O © A M ER PR O O FS Hill, H. C., Schilling, S. G., & Ball, D. L. (2004). Developing measures of teachers’ mathematics knowledge for teaching. Elementary School Journal, 105, 11–30. doi:10.1086/428763 EC TE D Hill, H. C., Umland, K. L., & Kapitula, L. R. (2011). A validity argument approach to evaluating valueadded scores. American Educational Research Journal, 48, 794–831. doi:10.3102/0002831210387916 CO RR Hines, L. M. (2007). Return of the thought police? The history of teacher attitude adjustment. Education Next, 7(2), 58–65. Retrieved from http://educationnext.org/return-of-the-thought-police U N Hirsch, E., & Sioberg, A. (2010). Using teacher working conditions survey data in the North Carolina educator evaluation process. Santa Cruz, CA: New Teacher Center. Retrieved from http://ncteachingconditions. org/sites/default/files/attachments/NC10_brief_ TeacherEvalGuide.pdf Honig, M. I., Copland, M. A., Rainey, L., Lorton, J. A., & Newton, M. (2010, April). School district central office transformation for teaching and learning improvement: N PS Johnson, S. M., Kardos, S. K., Kauffman, D., Liu, E., & Donaldson, M. L. (2004). The support gap: New teachers’ early experiences in high-income and lowincome schools. Education Policy Analysis Archives, 12(61). Retrieved from http://epaa.asu.edu/ojs/ article/viewFile/216/342 A IC Hill, H. C., Dean, C., & Goffney, I. M. (2007). Assessing elemental and structural validity: Data from teachers, non-teachers, and mathematicians. Measurement: Interdisciplinary Research and Perspectives, 5(2–3), 81–92. doi:10.1080/15366360701486999 Hill, H. C., Rowan, B., & Ball, D. L. (2005). Effects of teachers’ mathematical knowledge for teaching on student achievement. American Educational Research Journal, 42, 371–406. doi:10.3102/00028312042002371 Johnson, S. M., Birkeland, S. E., Kardos, S. K., Kauffman, D., Liu, E., & Peske, H. G. (2001, September/October). Retaining the next generation of teachers: The importance of school-based support. Harvard Education Letter. Retrieved from http://www.umd.umich.edu/ casl/natsci/faculty/zitzewitz/curie/TeacherPrep/99.pdf Junker, B., Weisberg, Y., Matsumura, L. C., Crosson, A., Wolf, M. K., Levison, A., & Resnick, L. (2006). Overview of the Instructional Quality Assessment (CSE Technical Report No. 671). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing (CRESST)/UCLA. Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64). New York, NY: Praeger. Kane, T. J., Rockoff, J. E., & Staiger, D. O. (2006). What does certification tell us about teacher effectiveness? Evidence from New York City. New York, NY: National Bureau of Economic Research. Kardos, S. K., & Johnson, S. M. (2007). On their own and presumed expert: New teachers’ experiences with their colleagues. Teachers College Record, 109, 2083–2106. Kellor, E. M. (2002). Performance-based licensure in Connecticut (CPRE-UW Working Paper Series TC-02–10). Madison, WI: Consortium for Policy Research in Education. 26 APA-HTA_V3-12-0603-020.indd 26 04/10/12 7:24 PM Evaluating Teaching and Teachers sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Journal of Educational Measurement, 44, 47–67. doi:10.1111/ j.1745-3984.2007.00026.x Klein, S. P., & Stecher, B. (1991). Developing a prototype licensing examination for secondary school teachers. Journal of Personnel Evaluation in Education, 5, 169–190. doi:10.1007/BF00117336 Lussier, D. F., & Forgione, P. D., Jr. (2010). Supporting and rewarding accomplished teaching: Insights from Austin, Texas. Theory Into Practice, 49, 233–242. doi: 10.1080/00405841.2010.487771 Koh, K., & Luke, A. (2009). Authentic and conventional assessment in Singapore schools: An empirical study of teacher assignments and student work. Assessment in Education: Principles, Policy, and Practice, 16, 291–318. Luyten, H. (2003). The size of school effects compared to teacher effects: An overview of the research literature. School Effectiveness and School Improvement, 14, 31–51. doi:10.1076/sesi.14.1.31.13865 CI A TI O N Kennedy, M. M. (2010). Approaches to annual performance assessment. In M. M. Kennedy (Ed.), Teacher assessment and the quest for teacher quality: A handbook (pp. 225–250). San Francisco, CA: Jossey-Bass. Malmberg, L. E., Hagger, H., Burn, K., Mutton, T., & Colls, H. (2010). Observed classroom quality during teacher education and two years of professional practice. Journal of Educational Psychology, 102, 916–932. doi:10.1037/a0020920 A SS O Koretz, D. (2008). Measuring up: What educational testing really tells us. Cambridge, MA: Harvard University Press. A L Kornfeld, J., Grady, K., Marker, P. M., & Ruddell, M. R. (2007). Caught in the current: A self-study of statemandated compliance in a teacher education program. Teachers College Record, 109, 1902–1930. G IC Martineau, J. A. (2006). Distorting value-added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 31, 35–62. doi:10.3102/10769986031001035 O LO Krauss, S., Baumert, J., & Blum, W. (2008). Secondary mathematics teachers’ pedagogical content knowledge and content knowledge: Validation of the COACTIV Constructs. ZDM—The International Journal on Mathematics Education, 40, 873–892. PS N ER IC A Kupermintz, H. (2003). Teacher effects and teacher effectiveness: A validity investigation of the Tennessee Value Added Assessment System. Educational Evaluation and Policy Analysis, 25, 287–298. doi:10.3102/01623737025003287 Y CH Matsumura, L. C., Garnier, H., Slater, S. C., & Boston, M. (2008). Toward measuring instructional interactions at-scale. Educational Assessment, 13, 267–300. doi:10.1080/10627190802602541 FS © A M Kyriakides, L., & Creemers, B. P. M. (2008). A longitudinal study on the stability over time of school and teacher effects on student outcomes. Oxford Review of Education, 34, 521–545. doi:10.1080/03054980701782064 PR O O Ladd, H. F. (2007, November). Holding schools accountable revisited. Paper presented at APPAM Fall Research Conference, Washington, DC. EC TE D La Paro, K. M., Pianta, R. C., & Stuhlman, M. (2004). The classroom assessment scoring system: Findings from the prekindergarten year. Elementary School Journal, 104, 409–426. doi:10.1086/499760 U N CO RR Linn, R. L. (2008, November 13–14). Measurement issues associated with value-added models. Paper presented at the workshop of the Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Educational Accountability, National Research Council, Washington, DC. Retrieved from http://www7.nationalacademies.org/ bota/VAM_Robert_Linn_Paper.pdf Livingston, S., & Zieky, M. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: ETS. Lockwood, J. R., McCaffrey, D. F., Hamilton, L. S., Stecher, B. M., Le, V., & Martinez, J. F. (2007). The Matsumura, L. C., & Pascal, J. (2003). Teachers’ assignments and student work: Opening a window on classroom practice (CSE Report No. 602). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing (CRESST)/UCLA. Matsumura, L. C., Slater, S. C., Junker, B., Peterson, M., Boston, M., Steele, M., & Resnick, L. (2006). Measuring reading comprehension and mathematics instruction in urban middle schools: A pilot study of the Instructional Quality Assessment (CSE Technical Report No. 681). Los Angeles, CA: Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing (CRESST)/UCLA. McCaffrey, D. F. (2011, April). Sources of variance and mode effects in measures of teaching in algebra classes. Paper presented at annual meeting of the National Council on Measurement, New Orleans, LA. McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Santa Monica, CA: RAND. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council of Education and Macmillan. Metzger, S. A., & Wu, M. J. (2008). Commercial teacher selection instruments: The validity of selecting teachers through beliefs, attitudes, and values. 27 APA-HTA_V3-12-0603-020.indd 27 04/10/12 7:24 PM Gitomer and Bell Review of Educational Research, 78, 921–940. doi:10.3102/0034654308323035 Milanowski, A. (2004). The relationship between teacher performance evaluation scores and student achievement: Evidence from Cincinnati. Peabody Journal of Education, 79(4), 33–53. doi:10.1207/ s15327930pje7904_3 CI A TI O N Mitchell, K., Shkolnik, J., Song, M., Uekawa, K., Murphy, R., & Means, B. (2005). Rigor, relevance, and results: The quality of teacher assignments and student work in new and conventional high schools. Washington, DC: American Institutes for Research and SRI. Ohio Department of Education. (2006). Report on the quality of teacher education in Ohio, 2004–2005. Columbus, OH: Author. Pacheco, A. (2008). Mapping the terrain of teacher quality. In D. H. Gitomer (Ed.), Measurement issues and assessment for teacher quality (pp. 160–178). Thousand Oaks, CA: Sage. Pearlman, M. (2008). The design architecture of NBPTS certification assessments. In L. Ingvarson & J. Hattie (Eds.), Assessing teachers for professional certification: The first decade of the National Board for Professional Teaching Standards (pp. 55–91). Greenwich, CT: JAI Press. doi:10.1016/S1474-7863(07)11003-6 Phelps, G. (2009). Just knowing how to read isn’t enough! What teachers know about the content of reading. Educational Assessment, Evaluation, and Accountability, 21, 137–154. doi:10.1007/s11092-009-9070-6 Phelps, G., & Schilling, S. (2004). Developing measures of content knowledge for teaching reading. Elementary School Journal, 105, 31–48. doi:10.1086/428764 Pianta, R. C., Hamre, B. K., Haynes, N. J., Mintz, S. L., & La Paro, K. M. (2007). Classroom assessment scoring system manual, middle/secondary version. Charlottesville: University of Virginia. Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2007). Classroom assessment scoring system. Baltimore, MD: Brookes. Pianta, R. C., Mashburn, A. J., Downer, J. T., Hamre, B. K., & Justice, L. (2008). Effects of web-mediated professional development resources on teacherchild interactions in pre-kindergarten classrooms. Early Childhood Research Quarterly, 23, 431–451. doi:10.1016/j.ecresq.2008.02.001 Piburn, M., & Sawada, D. (2000). Reformed Teaching Observation Protocol (RTOP) reference manual. Tempe: Arizona State University. Pilley, J. G. (1941). The National Teacher Examination Service. School Review, 49, 177–186. doi:10.1086/440636 Popham, W. J. (1992). Appropriate expectations for content judgments regarding teacher licensure tests. Applied Measurement in Education, 5, 285–301. doi:10.1207/s15324818ame0504_1 Programme for International Student Assessment. (2006). PISA 2006 science competencies for tomorrow’s world. Organisation for Economic Co-operation and Development. Retrieved from http://www.oei.es/eval uacioneducativa/InformePISA2006-FINALingles.pdf Pullin, D. (1999). Criteria for evaluating teacher tests: A legal perspective. Washington, DC: National Academies Press. Pullin, D. (2010). Judging teachers: The law of teacher dismissals. In M. M. Kennedy (Ed.), Teacher assessment and the quest for teacher quality: A handbook (pp. 297–333). San Francisco, CA: Jossey-Bass. A L A SS O Molnar, A., Smith, P., Zahorik, J., Palmer, A., Halbach, A., & Ehrle, K. (1999). Evaluating the SAGE program: A pilot program in targeted pupil-teacher reduction in Wisconsin. Educational Evaluation and Policy Analysis, 21, 165–177. LO G IC Moss, P. A., & Schutz, A. (1999). Risking frankness in educational assessment. Phi Delta Kappan, 80, 680–687. PS Y CH O Moss, P. A., Sutherland, L. M., Haniford, L., Miller, R., Johnson, D., Geist, P. K., . . . Pecheone, R. L. (2004). Interrogating the generalizability of portfolio assessments of beginning teachers: A qualitative study. Education Policy Analysis Archives, 12(32), 1–70. ER A IC National Research Council. (2001). Testing teacher candidates: The role of licensure tests in improving teacher quality. Washington, DC: National Academies Press. N National Board for Professional Teaching Standards. (2010). Retrieved from http://nbpts.org/the_standards © A M National Research Council. (2008). Assessing accomplished teaching: Advanced-level certification programs. Washington, DC: National Academies Press. PR O O FS National Research Council and National Academy of Education. (2010). Getting value out of value-added: Report of a workshop. Washington, DC: National Academies Press. EC TE D Newmann, F. M., Bryk, A. S., & Nagaoka, J. K. (2001). Authentic intellectual work and standardized tests: Conflict or coexistence? Chicago, IL: Consortium on Chicago School Research. CO RR No Child Left Behind Act of 2001, Pub. L. No. 107-110, § 115 Stat 1425 (2002). U N Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26, 237–257. doi:10.3102/ 01623737026003237 Oakes, J. (1987). Tracking in secondary schools: A contextual perspective. Santa Monica, CA: RAND. Odden, A., & Kelley, C. (2002). Paying teachers for what they know and can do: New and smarter compensation strategies to improve student learning. Thousand Oaks, CA: Corwin Press. 28 APA-HTA_V3-12-0603-020.indd 28 04/10/12 7:24 PM Evaluating Teaching and Teachers Schilling, S. G., & Hill, H. C. (2007). Assessing measures of mathematical knowledge for teaching: A validity argument approach. Measurement: Interdisciplinary Research and Perspectives, 5(2–3), 70–80. doi:10.1080/15366360701486965 Raudenbush, S. W. (2004). What are value-added models estimating and what does this imply for statistical practice? Journal of Educational and Behavioral Statistics, 29, 121–129. doi:10.3102/10769986029001121 Raudenbush, S. W., Martinez, A., Bloom, H., Zhu, P., & Lin, F. (2010). Studying the reliability of group-level measures with implications for statistical power: A six-step paradigm (Working paper). Chicago, IL: University of Chicago. N Shkolnik, J., Song, M., Mitchell, K., Uekawa, K., Knudson, J., & Murphy, R. (2007). Changes in rigor, relevance, and student learning in redesigned high schools. Washington, DC: American Institutes for Research and SRI. TI O Raudenbush, S. W., Martinez, A., & Spybrook, J. (2007). Strategies for improving precision in group-randomized experiments. Educational Evaluation and Policy Analysis, 29, 5–29. doi:10.3102/0162373707299460 CI A Shulman, L. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2), 4–14. SS O Stecher, B. M., Vernez, G., & Steinberg, P. (2010). Reauthorizing No Child Left Behind: Facts and recommendations. Santa Monica, CA: RAND. A Raver, C. C., Jones, S. M., Li-Grining, C. P., Metzger, M., Smallwood, K., & Sardin, L. (2008). Improving preschool classroom processes: Preliminary findings from a randomized trial implemented in Head Start settings. Early Childhood Research Quarterly, 23, 10–26. doi:10.1016/j.ecresq.2007.09.001 IC A L Stein, B., Haynes, A., Redding, M., Ennis, T., & Cecil, M. (2007). Assessing critical thinking in STEM and beyond. In M. Iskander (Ed.), Innovations in E-learning, instruction technology, assessment, and engineering education (pp. 79–82). Dordrecht, the Netherlands: Springer. doi:10.1007/978-1-4020-6262-9_14 O FS © A M Rowan, B., Camburn, E., & Correnti, R. (2004). Using teacher logs to measure the enacted curriculum in large-scale surveys: Insights from the study of instructional improvement. Elementary School Journal, 105, 75–101. doi:10.1086/428803 TE D PR O Rowan, B., & Correnti, R. (2009). Studying reading instruction with teacher logs: Lessons from A Study of Instructional Improvement. Educational Researcher, 38(2), 120–131. doi:10.3102/0013189X09332375 CO RR EC Rowan, B., Correnti, R., & Miller, R. J. (2002). What large-scale, survey research tells us about teacher effects on student achievement: Insights from the Prospects study of elementary schools. Teachers College Record, 104, 1525–1567. doi:10.1111/ 1467-9620.00212 U N Rowan, B., Jacob, R., & Correnti, R. (2009). Using instructional logs to identify quality in educational settings. New Directions for Youth Development, 2009(121), 13–31. doi:10.1002/yd.294 Samaras, A. P., Francis, S. L., Holt, Y. D., Jones, T. W., Martin, D. S., Thompson, J. L., & Tom, A. R. (1999). Lived experiences and reflections of joint NCATE-state reviews. Teacher Educator, 35, 68–83. doi:10.1080/08878739909555218 LO O Y PS N ER IC A Rothstein, J. (2009). Student sorting and bias in value added estimation: Selection on observables and unobservables. Education Finance and Policy, 4, 537– 571. doi:10.1162/edfp.2009.4.4.537 Szpara, M. Y., & Wylie, E. C. (2007). Writing differences in teacher performance assessments: An investigation of African American language and edited American English. Applied Linguistics, 29, 244–266. doi:10.1093/applin/amm003 CH Reardon, S. F., & Raudenbush, S. W. (2009). Assumptions of value-added models for estimating school effects. Educational Finance and Policy 4, 492–519. doi:10.1162/edfp.2009.4.4.492 G Ravitch, D. (2010). The death and life of the great American school system: How testing and choice are undermining education. New York, NY: Basic Books. Taylor, B. M., Pearson, P. D., Peterson, D. S., & Rodriguez, M. C. (2003). Reading growth in highpoverty classrooms: The influence of teacher practices that encourage cognitive engagement in literacy learning. Elementary School Journal, 104, 3–28. doi:10.1086/499740 Toch, T., & Rothman, R. (2008). Rush to judgment: Teacher evaluation in public education. Washington, DC: Education Sector. Tucker, P. D., Stronge, J. H., Gareis, C. R., & Beers, C. S. (2003). The efficacy of portfolios for teacher evaluation and professional development: Do they make a difference? Educational Administration Quarterly, 39, 572–602. doi:10.1177/0013161X03257304 Turque, B. (2010, July 24). Rhee dismisses 241 D.C. teachers; union vows to contest firings. Washington Post. Retrieved from http://www.washingtonpost.com/wpdyn/content/article/2010/07/23/AR2010072303093.html Tyack, D., & Cuban, L. (1995). Tinkering toward utopia: A century of public school reform. Cambridge, MA: Harvard University Press. U.S. Department of Education. (2010). Race to the Top program: Executive summary. Retrieved from http:// www2.ed.gov/programs/racetothetop/executivesummary.pdf Usher, E. L., & Pajares, F. (2009). Sources of self-efficacy in mathematics: A validation study. Contemporary 29 APA-HTA_V3-12-0603-020.indd 29 04/10/12 7:24 PM Gitomer and Bell Educational Psychology, 34, 89–101. doi:10.1016/j. cedpsych.2008.09.002 Wilson, M., Hallam, P. J., Pecheone, R., & Moss, P. (2006, April). Using student achievement test scores as evidence of external validity for indicators of teacher quality: Connecticut’s Beginning Educator Support and Training Program. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Wilson, S. (2008). Measuring teacher quality for professional entry. In D. H. Gitomer (Ed.), Measurement issues and assessment for teaching quality (pp. 8–29). Thousand Oaks, CA: Sage. Wilson, S. M., & Youngs, P. (2005). Research on accountability processes in teacher education. In M. Cochran-Smith & K. M. Zeichner (Eds.), Studying teacher education: The report of the AERA panel on research and teacher education (pp. 591–643). Mahwah, NJ: Erlbaum. Woolfolk, A. E., & Hoy, W. K. (1990). Prospective teachers’ sense of efficacy and beliefs about control. Journal of Educational Psychology, 82, 81–91. doi:10.1037/0022-0663.82.1.81 Wasley, P. (2006, June 16). Accreditor of education schools drops controversial “social justice” language. Chronicle of Higher Education, p. A13. TI O N Wayne, A. J., & Youngs, P. (2003). Teacher characteristics and student achievement gains: A review. Review of Educational Research, 73, 89–122. doi:10.3102/00346543073001089 SS O CI A Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project. U N CO RR EC TE D PR O O FS © A M ER IC A N PS Y CH O LO G IC A L A Wenzel, S., Nagaoka, J. K., Morris, L., Billings, S., & Fendt, C. (2002). Documentation of the 1996–2002 Chicago Annenberg Research Project Strand on Authentic Intellectual Demand exhibited in assignments and student work: A technical process manual. Chicago, IL: Consortium on Chicago School Research. 30 APA-HTA_V3-12-0603-020.indd 30 04/10/12 7:24 PM