CHAPTER 20
EVALUATING TEACHING
AND TEACHERS
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
Y
CH
O
LO
G
IC
A
L
on students’ academic growth to achieve funding
(U.S. Department of Education, 2010). These and
other recent research and policy developments are
changing the way the assessment of teaching is
understood. The goal of this chapter is to provide an
overview and structure to facilitate readers’ understanding of the emerging landscape and attendant
assessment issues.
As well described in a number of recent reports,
current evaluation processes suffer from a number
of problems (Toch & Rothman, 2008; Weisberg,
Sexton, Mulhern, & Keeling, 2009). For example,
the New Teachers Project surveyed evaluation
practices in several districts large and small and
found that teachers were almost all rated highly. In
systems that used binary ratings (i.e., satisfactory or
unsatisfactory), almost 99% of teachers were rated
satisfactory. To complicate matters, the same administrators who gave all teachers high marks also
recognized that staff members varied greatly in performance and some were actually poor teachers. In
addition to an inability to sort teachers, current processes generally do not give teachers useful information to improve their practice and policy makers do
not believe the credibility of the evaluation process
(Weisberg et al., 2009).
Measures of teaching should be seen from a
validity perspective, and thus, it is critical to begin
with the purpose and use of the assessment. As
Messick (1989) argued, validity is not an inherent
PS
N
Almost everything related to the assessment and
evaluation of teaching in the United States is undergoing restructuring. Purposes and uses, data sources,
analytic methods, assessment contexts, and policy
are all being developed, refined, and reconsidered
within a cauldron of research, development, and
policy activity. For example, the District of Columbia made headlines when it announced the firing of
241 teachers based, in part, on poor performance
results from their new evaluation system, IMPACT
(Turque, 2010). The Bill and Melinda Gates Foundation has funded the Measures of Effective Teaching
(MET) study, a $45 million study designed to test
the ways in which a range of measures including
scores on observation protocols, student engagement
data, and value-added test scores might be combined
into a single teaching evaluation metric (Bill and
Melinda Gates Foundation, 2011a). The Foundation
is also spending $290 million in four communities
in intensive partnerships to reform how teachers are
recruited, developed, rewarded, and retained (Bill
and Melinda Gates Foundation, 2011b). In addition
to pressure from districts and private funders,
unions have also pressed for revised standards of
teacher evaluation (e.g., American Federation of
Teachers [AFT], 2010). Perhaps the most consequential contemporary effort is the federally funded
Race to the Top Fund that encourages states to
implement teacher evaluation systems based on multiple measures with a significant component based
A
SS
O
CI
A
TI
O
N
Drew H. Gitomer and Courtney A. Bell
The authors would like to thank Andrew Croft, Daniel Eignor, Laura Goe, Heather Hill, Daniel McCaffrey, and Joan Snowden for their careful review
of the manuscript. A special thank you to Andrew Croft and Evelyn Fisch for their assistance in preparing the manuscript.
Each of the authors contributed equally to the preparation of this chapter.
DOI: 10.1037/14049-020
APA Handbook of Testing and Assessment in Psychology: Vol. 3. Testing and Assessment in School Psychology and Education,
K. F. Geisinger (Editor-in-Chief)
Copyright © 2013 by the American Psychological Association. All rights reserved.
APA-HTA_V3-12-0603-020.indd 1
1
04/10/12 7:24 PM
Gitomer and Bell
A
SS
O
CI
A
TI
O
N
composite measures in the context of psychological
assessments within clinical contexts, current validity
research does not address how scores from multiple
measures might be combined or considered jointly
in the evaluation of teachers. The validity argument
for inferences based on multiple measures introduces an additional layer of complexity because support is needed for the composite inference and not
simply inferences based on individual measures. As
almost all the current teacher evaluation schemes
are contemplating some use of multiple measures,
more specific guidance is needed.
A
L
TEACHER OR TEACHING QUALITY?
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
N
PS
Y
CH
O
LO
G
IC
The current policy press is to develop measures that
allow for inferences about teacher effectiveness.
Using particular measures, the goal is to be able to
make some type of claim about the qualities of a
teacher. Yet, to varying degrees, the measures we
examine do not tell us only about the teacher. A
broad range of contextual factors also contributes to
the evidence of the teaching quality, which is more
directly observable.
To illustrate why context affects the validity of
what inferences can be made from the observation
of a performance, consider a scenario from medicine. Assume that under the same conditions, two
surgeons would operate using the same processes
and their respective patients would have the same
outcomes. But, as described in the following example, such simplifying assumptions that conditions
are invariant often do not hold.
A
IC
property of an instrument, but rather it is an evaluation of the inferences and actions made in light of a
set of intended purposes. Given the extraordinary
and unprecedented focus on evaluating teacher
quality, this chapter is focused on measures being
used to make inferences about the quality of practicing teachers, and to a lesser degree, the inferences
made about prospective teachers who are undergoing professional preparation. These measures are
examined through the perspective of modern validity frameworks used to consider the quality of
assessments more generally. Building on M. T.
Kane’s (2006) thinking, the strength of the validity
evidence is considered while paying careful attention to the purposes of various teaching evaluation
instruments.
In considering the validity of inferences made
about teacher quality, the focus is on three issues that
may be at the forefront of discussions about teacher
evaluation for the foreseeable future. The first issue
concerns the validity argument for particular instruments. Guided by M. T. Kane (2006), Messick
(1989), and the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association
[APA], and National Council on Measurement in
Education [NCME], 1999), the respective validity
arguments for a range of measures being used to evaluate teachers are summarized briefly.
The second issue concerns an often underresearched aspect of any validity argument—casual
attribution of scores to particular teachers. Observing teaching or assessing student learning provides
a set of observables that produce a score or set of
scores. Policy makers and many researchers are
seeking ways to establish a causal relationship that
attributes these scores to an individual teacher. Yet
the nature of teaching and the context in which it
occurs raises questions about the extent to which
causal attribution can be made. To date, issues of
casual attribution have not yet been adequately dealt
with across instruments and processes used to measure teaching.
The final issue concerns the consideration of
multiple measures in an evaluation. Although the
most recent Standards for Educational and Psychological Testing (AERA et al., 1999) discussed the use of
Imagine Two Surgeons
We would like to evaluate them on
the quality of their surgical skills using
multiple measures. We will use the size
of the scar, the rate of infection, the quality of pain management, and patient satisfaction as our measures of the quality
of their surgical skills. One is in Miami,
Florida, the other in Moshe, Tanzania.
Both must remove a benign tumor from a
53-year-old man’s abdomen. The surgeon
in Miami has a #10 blade steel scalpel
that is designed for cutting muscle and
2
APA-HTA_V3-12-0603-020.indd 2
04/10/12 7:24 PM
Evaluating Teaching and Teachers
the sole control of the teacher and how much might
be attributable to contextual factors that influence
what the teacher does and how well students learn?
For example, although one can judge the quality of
the content being taught, that content is frequently
influenced by district-imposed curricula and texts.
Social interactions that occur among students are
certainly a function of the teacher’s establishment of
a classroom climate, but students also bring a set of
interpersonal dynamics into the classroom. Teachers
may design homework assignments or assessments,
but others may be compelled to use assessments and
assignments developed by the school district. How
do parental actions differentially support the intentions of teachers? The point is that it may be impossible to disentangle the individual teacher from all of
the classroom interactions and outside variables that
influence student outcomes (Braun, 2005a). Outside
the classroom, there are additional contextual effects
(e.g., interactions within schools and the larger
community) that are difficult to isolate (e.g.,
Pacheco, 2008). At a minimum, if we are to ascribe
casual attribution for student learning to teachers,
we must attempt to understand these complexities
and use analytic processes and methods that can
educate stakeholders about the quality and limitations of those casual attributions.
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
IC
G
LO
O
CH
Y
PS
N
It is possible that neither patient will get an
infection and both will be satisfied with the care
they received. But it is also possible, perhaps
likely, that the patient in Miami will have a smaller
scar than the Moshe patient, due to the knife used;
and the Miami patient will have better pain management than the Moshe patient because of access to an
anesthesiologist. So even in one surgery, one would
expect the Miami surgeon to carry out a more effective surgery than the Moshe surgeon. And over a
number of years, as these surgeons do 100 similar
surgeries, it becomes increasingly likely that the
Moshe surgeon will have poorer surgical outcomes
than the Miami surgeon.
But has the quality of each surgeon’s respective
skills really been judged? The quality of medical
care the two patients have received has been evaluated. Are surgical skill and medical care the same
thing? Perhaps all that has really been learned is that
if someone had a tumor, he or she would like it
removed in Miami, not Moshe. The point is that
even in medicine, with its more objective outcomes
of scar size and infection rate, it is not always so
obvious to attribute surgical outcomes to the surgeon alone. There are many factors beyond the surgeon’s control that can contribute to her success. Of
course, the best conditions in the world will not,
over time, make an incompetent surgeon appear to
be expert.
Now, imagine walking into a classroom and
observing a lesson in order to make judgments
about a teacher. How much of what is seen is under
A
L
A
SS
O
CI
A
TI
O
N
skin. The surgeon in Moshe has a wellsharpened utility knife that is used for a
range of surgical purposes. The excision
in Miami will occur in a sterile operating room with no external windows, fans
and filters to circulate and clean the air,
an anesthesiologist, and a surgical nurse.
The excision in Moshe will occur in a
clean operating room washed with well
water and bleach, windows opened a
crack to allow the breeze to circulate the
stiflingly hot air, no fans or filters, and a
nurse borrowed from the pediatrics unit
because she was the only available help.
The Purposes of Evaluating Teaching
For a range of reasons, there has been a push for
improved teacher evaluation models. The push is
strong, in part, because it comes from different constituencies with varying purposes for evaluating
teaching. These purposes include educational
accountability, strategic management of human capital, professional development of teachers, and the
evaluation of instructional policies. The confluence
of underlying constituencies and a wide range of
purposes have led to intense research and development activity around teacher effectiveness measures.
The first and perhaps most broadly agreed on
purpose for teaching evaluation is public accountability. The time period during which this chapter is
being written is an era of a pervasive emphasis on
educational accountability. Concerns about persistent achievement gaps between Black and White,
poor and rich, and English language speakers and
3
APA-HTA_V3-12-0603-020.indd 3
04/10/12 7:24 PM
Gitomer and Bell
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
N
PS
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
actual teaching quality (Goe, 2007; Wayne &
Youngs, 2003).
Stakeholders have grown increasingly frustrated
with the lack of an apparent relationship between
student achievement and measures used to evaluate
teachers (e.g., Weisberg et al., 2009). This has led to
a perspective with a far more empirical view of what
defines effective teaching. Largely emanating from
individuals who are not representative of the traditional educational research and measurement communities, another goal of teaching evaluation has
become prominent—the strategic management of
human capital (Odden & Kelley, 2002). This view
rests on basic economic approaches to managing the
supply of teachers by incentives and disincentives
for individuals with specific characteristics. The
logic suggests that if the supply of “effective”
teachers can be increased by replacing “ineffective”
teachers, overall achievement would increase and
the achievement gap would decrease (Gordon, Kane,
& Staiger, 2006). In this view, the evaluation of
teaching is the foundation for managing people via
retention, firing, placement, and compensation policies
(Heneman, Milanowski, Kimball, & Odden, 2006).
A parsimonious definition of teaching quality
guides the measurement approach of human capital
management. This is characterized in the following
remark by Hanushek (2002): “I use a simple definition of teacher quality: good teachers are ones who
get large gains in student achievement for their
classes; bad teachers are just the opposite” (p. 3).
Hanushek adopted this definition because it is
empirically based, straightforward, and in his and
others’ views, tractable. Most of all, such a definition
avoids defining quality by the execution of particular teaching processes or the possession of specific
teacher characteristics, factors that have had modest,
if any, relationships to valued outcomes (e.g.,
Cochran-Smith & Zeichner, 2005).
Although recent approaches to the strategic management of human capital have raised the stakes
substantially for how teacher evaluations are used,
most policies broaden teacher evaluation to include
other factors besides student achievement growth.
A
IC
English language learners, coupled with concerns
about U.S. academic performance relative to other
countries (Gonzales et al., 2008; Programme for
International Student Assessment [PISA], 2006),
have led policy makers to implement unprecedented
policies that focus on achievement and other measurable outcomes. Nowhere is this press for a public
accounting on measurable outcomes stronger than
in the K–12 accountability policy of the No Child
Left Behind revision of the Elementary and Secondary Education Act in 2002 (No Child Left Behind
Act, 2002).
Supported by a growing body of research that
identifies teachers as the major school-related determinant of student success (Nye, Konstantopoulos, &
Hedges, 2004; Raudenbush, Martinez, & Spybrook,
2007), perhaps it was only a matter of time before
the public accounting of student performance gave
way to a public accounting of teacher performance.
The purpose of teaching evaluation in this way of
thinking is to document publicly measurable outcomes that drive decision making and ensure the
public’s financial investment in teachers is maximized. It is important to recognize that out-ofschool factors continue to be most predictive of
student outcomes; but for the variance that can be
accounted for by schools, teachers are the greatest
source of variation in student test score gains (Nye
et al., 2004; Raudenbush, 2004). Estimates of the
size of teachers’ contribution varies with the underlying analytic model employed (Kyriakides &
Creemers, 2008; Luyten, 2003; Rowan, Correnti, &
Miller, 2002).
Earlier efforts to account publicly for teaching
quality have not been particularly useful or insightful. Characteristics valued in existing compensation
systems, such as postbaccalaureate educational
course-taking, credit and degree attainment, and
years on the job have modest associations with student achievement (e.g., Clotfelter, Ladd, & Vigdor,
2005; Harris & Sass, 2006; T. J. Kane, Rockoff, &
Staiger, 2006).1 In addition, widely used surface
markers of professional preparation, such as certification status and coursework, only weakly predict
1
The relationship of student achievement growth and teacher experience does increase for the first several years of teaching, but levels off after only
a few years (e.g., Nye, Konstantopoulos, & Hedges, 2004).
4
APA-HTA_V3-12-0603-020.indd 4
04/10/12 7:24 PM
Evaluating Teaching and Teachers
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
and school and class size (Molnar et al., 1999) relate
to teacher practice. Often, the types of measures
used for this purpose are logs or other surveys that
ask teachers to report on the frequency of important
activities or practices. By evaluating teaching,
researchers and evaluators can assess the degree to
which policies intended to shape teaching and learning are working as intended.
In this chapter, the classes of measures that can
be used to support the evaluation of teaching for one
or more purposes are described: educational
accountability, strategic management of human capital, professional development of teachers, and the
evaluation of instructional policies. The next section
looks briefly at the history of assessing teaching
quality and considers the ways in which these multiple purposes have played out in recent history.
CH
O
A Selective History of Assessing
Teaching Quality
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
Y
Only a few years ago S. Wilson (2008) characterized
the U.S. national system of assessing teacher quality
as “undertheorized, conceptually incoherent,
technically unsophisticated, and uneven” (p. 24).
Although Wilson focused on the system of assessments used to characterize teacher quality, the same
characterization can be leveled at the constituent
measures and practices that make up what she
referred to as a “carnival of assessment” (p. 14).
Three dominant assessment purposes at the “carnival” are described, each of which renders judgments
about preservice, in-service, and master teaching,
respectively. Across the three purposes, there are
both strengths and weaknesses that lay the foundation for understanding current research and development activities.
By far, the most common purpose of formal
assessment in teaching occurs for beginning licensure, in which the state ensures that candidates have
sufficient knowledge, typically of content and basic
skills, so that the state can warrant that the individual will “do no harm.”2 These tests have almost
always been, and continue to be, standardized
assessments that require teacher candidates to meet
a particular state-established passing standard to be
PS
N
Nevertheless, student achievement growth estimates
typically are a dominant factor in making determinations of effectiveness.
In addition to strategic management of human
capital, teacher evaluation has been viewed as a
means for improving individual and organizational
capacity. There have been longstanding concerns
that the professional development of teachers,
beginning even in preservice, is disconnected from
the particular needs of individual teachers and
inconsistent with understandings of how teachers
learn (Borko, 2004) and the supports they need to
teach well (Johnson et al., 2001; Johnson, Kardos,
Kauffman, Liu, & Donaldson, 2004; Kardos & Johnson, 2007). There is also increasing research that
documents how organizational variables—the alignment of curriculum, the presence of professional
learning communities and effective leadership, and
the quality of reform implementation—are related to
the nature and quality of teaching (Honig, Copland,
Rainey, Lorton, & Newton, 2010). With capacity
building as a goal, teaching evaluation can be a tool
that can diagnose the practices most in need of
improvement. The goal of teaching evaluation from
this perspective is to improve what teachers and
schools know and are able to do around instruction.
The measures used toward this end vary dramatically from low-inference checklists of desired behaviors to high-inference holistic rubrics of underlying
teaching quality values to school-level measures
of teaching contexts (Hirsch & Sioberg, 2010;
Kennedy, 2010).
Finally, researchers and evaluators use teaching
evaluation to assess whether and how various education policies are working. Deriving from both
measurement and evaluation perspectives, teaching
evaluation has been used to investigate the degree to
which school and curricular reforms and their
implementation influence instruction (e.g., Rowan,
Camburn, & Correnti, 2004; Rowan & Correnti,
2009; Rowan, Jacob, & Correnti, 2009), the impacts
of professional development (Desimone, Porter,
Garet, Suk Yoon, & Birman, 2002; Malmberg, Hagger, Burn, Mutton, & Colls, 2010), and how particular policies such as academic tracking (Oakes, 1987)
2
That is, in the legal context of licensure, failure to demonstrate sufficient knowledge or skill on an assessment would present some probability of
causing harm in the workplace (M. T. Kane, 1982).
5
APA-HTA_V3-12-0603-020.indd 5
04/10/12 7:24 PM
Gitomer and Bell
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
N
PS
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
to improve individuals’ capacity while continuing to
adhere to the “do no harm” principle. Pass rates,
particularly given multiple opportunities to complete the assessment as is characteristic of these
systems, are very high (more than 95%; e.g.,
Connecticut State Department of Education, Bureau
of Program and Teacher Evaluation, 2001; Ohio
Department of Education, 2006).
Taken together, the licensure testing processes
serve the function of preventing a relatively small
proportion of individuals from becoming teachers,
but they do not support inferences about the quality
of teachers or teaching beyond minimal levels of
competence. Furthermore, because these instruments are disconnected from practice either by not
being able to sort teaching or not being close
enough to practice to provide information about
what a teacher is and is not able to do beyond minimal levels, this group of assessment practices provides modest accountability and capacity-building
information.
In addition to supporting judgments about individual teacher candidates, beginning teacher assessment is also influenced by teacher education
program accreditation. Almost all states use some
combination of teacher testing and program accreditation to regulate and hold programs accountable for
the quality of teachers entering classrooms (S. M.
Wilson & Youngs, 2005). Accreditation is governed
by state and regional accrediting agencies as well as
by two national organizations: National Council for
Accreditation of Teacher Education (NCATE) and
Teacher Education Accreditation Council (TEAC).3
Accreditation requirements vary but generally
include a site visit and a paper review of program
offerings, program coherence, and the alignment of
program standards with national organizations’ subject matter teaching standards. In some accreditation processes, programs must provide evidence that
graduates can teach competently and have acquired
relevant subject matter knowledge and teaching
experiences. That evidence can come from whatever
assessments the program uses and there are few, if
any, common assessments. These processes require
much of teacher education programs (e.g., Barnette &
A
IC
awarded a license (S. M. Wilson & Youngs, 2005).
Tests have differed in terms of the proportion of
multiple-choice versus constructed-response
questions, whether they are paper-and-pencil or
computer-delivered, whether they are linear or
adaptive, and the methodology by which passing
standards are set. State licensure requirements vary
in terms of the number and types of tests required
and are not guided by a coherent theory of teaching
and learning. Tests are designed most often by
testing companies in collaboration with states.
Although this system results in adequate levels of
standardization within a state, the tests are criticized
as being disconnected from teaching practice
and based on incomplete views of teaching (e.g.,
Klein & Stecher, 1991).
Such tests are designed to support inferences
about the knowledge of prospective teachers. They
publicly account for what teachers know prior to
entering a classroom. They deliberately have not
been designed to encourage inferences about the
quality of teaching, although the implicit assumption on the part of many is that higher scores are
associated with higher levels of teaching proficiency,
however defined. When researchers have investigated this assumption, the evidence of any relationship has been weak, at best (Buddin & Zamarro,
2009; Goldhaber & Hansen, 2010; National
Research Council, 2008).
A number of states more recently adopted the
view that, to attain a full license, there ought to be
some direct evidence of teaching. Treating the initial
license as provisional, they adopted measures that
involved more direct evidence of teaching. Almost
always grounded in professional standards for teachers, assessments included live classroom observations, interviews, and teacher-developed portfolios
that contain artifacts of classroom practice such as
planning documents, assignments, and videotapes
(California Commission on Teacher Credentialing,
2008; Connecticut State Department of Education,
Bureau of Program and Teacher Evaluation, 2001;
Educational Testing Service [ETS], 2001). These
assessments are intended to provide both public
accountability and formative information about how
3
As of October 2010, NCATE and TEAC announced the merger of the two organizations into a unified body, The Council for the Accreditation of
Educator Preparation (CAEP).
6
APA-HTA_V3-12-0603-020.indd 6
04/10/12 7:24 PM
Evaluating Teaching and Teachers
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
can be equally applied to particular measures within
an evaluative system.
It is fair to say that there is a substantial chasm
between the values expressed in these standards and
the state of teacher evaluation practice for preservice
and in-service teachers. It is rare to find an evaluation system in which there is any information collected as to the validity or reliability of judgments.
Often principals and other administrators are reluctant to give anything but acceptable ratings because
of the ensuing responsibilities to continue to monitor and support individuals determined to be in
need of improvement. It is extremely rare that
teachers—tenured or not—are removed from their
jobs simply because of poor instructional performance. Routinely, the propriety and accuracy of the
evaluation is challenged at great cost to the school
system (Pullin, 2010).
Current policies are attempting to transform this
historic state of affairs by largely defining teaching
effectiveness as the extent to which student test
scores improve on the basis of year-to-year comparisons. The methods that are used necessarily force a
distribution of individual teachers and are explicitly
tied to student outcomes. The details of these methods and the challenges they present are discussed in
subsequent sections of this chapter.
One other major teacher evaluation purpose that
was first implemented in the mid-1990s is the
National Board for Professional Teaching Standards
(NBPTS) certification system. Growing out of the
report A Nation Prepared: Teachers for the 21st Century (Carnegie Forum on Education and the Economy, 1986), a system of assessments was designed
to recognize and support highly accomplished
teachers. All NBPTS-certified teachers are expected
to demonstrate accomplished teaching that aligns
with five core propositions about what teachers
should know and be able to do as well as subjectand age range–specific standards that detail the
characteristics of highly accomplished teachers
(NBPTS, 2010).
The architecture of the NBPTS system is
described by Pearlman (2008) and was used to guide
PS
N
Gorham, 1999; Kornfeld, Grady, Marker, & Ruddell, 2007; Samaras et al., 1999) and may produce
changes in program structure and processes (Bell &
Youngs, 2011); however, there is no research that
documents the effects of accreditation on preservice
teacher learning or K–12 pupil learning.
A second dominant assessment purpose occurs
once teachers are hired and teaching in classrooms.
States and districts typically set policies concerning
the frequency of evaluation and its general processes. In states with collective bargaining, evaluation is often negotiated by the administration and
the union. Despite the variety of agencies that have
responsibility for the content of annual evaluations,
evaluations are remarkably similar (Weisberg
et al., 2009). They are administered by a range of
stakeholders—coaches, principals, central office
staff, and peers—and use a wide range of instruments each with its own idiosyncratic view of quality teaching.4 Although evaluations apply to all
teachers, the systematic and consistent application
of evaluative judgments are rare (e.g., Howard &
Gullickson, 2010).
Whereas traditional assessment practices for preservice teachers have had standards but are disconnected from teaching practice, in-service assessment
practices have been connected to practice but lack
rigorous standards. This has led in-service teaching
evaluation to be viewed as a bankrupt and uninformative enterprise (Toch & Rothman, 2008; Weisberg et al., 2009). Evaluations are often viewed as
bureaucratic functions that provide little or no useful information to teachers, administrators, institutions, or the public.
Howard and Gullickson (2010) have made the
case that teacher evaluation efforts should meet the
Personnel Evaluation Standards (Gullickson, 2008)
that include the following: propriety standards,
addressing legal and ethical issues; utility standards,
addressing how evaluation reports will be used and
by whom; feasibility standards, addressing the practicality and feasibility of evaluation systems; and
accuracy standards, addressing the validity and credibility of the evaluative inferences. These standards
4
Annual teaching evaluations are generally idiosyncratic within and across districts; however, there are examples of more coherent district-level practices in places like Cincinnati and Toledo. Increasingly, as a part of the Teacher Incentive Fund (TIF) grants, districts are experimenting with pilot
programs that have higher technical quality.
7
APA-HTA_V3-12-0603-020.indd 7
04/10/12 7:24 PM
Gitomer and Bell
A
N
PS
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
evidence suggests that even the most common
assessment practices have had a modest impact on
the structures and capacity of the system to improve
educational performance. Looking across the practices, there is no common view of quality teaching,
and sound measurement principles are missing from
at least some core practices of in-service evaluation.
These findings, along with political reluctance to
make evaluative judgments (e.g., Weisberg et al.,
2009), have led many researchers and policy makers
to conclude that the measures that make up the
field’s most common assessments will be unable to
satisfy the ambitious purposes of accountability,
human resource management, and instructional
improvement that are driving current policy
demands around evaluation.
Thus, the chapter next reviews measures the field
is developing and implementing to support purposes
ranging from accountability to capacity building.
The primary features of different classes of measures, the nature of inferences they potentially can
support, and current validation approaches and
challenges are described.
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
the development of 25 separate certificates, each
addressing a unique combination of subject area and
age ranges of students. For all certificates, teachers
participate in a year-long assessment process that
contains two major components. The first requires
teachers to develop a portfolio that is designed to
provide a window into practice. Portfolio entries
require teachers to write about particular aspects of
their practice as well as include artifacts that provide
evidence of this practice. Artifacts can include videos and samples of student work. In all cases, teachers are able to choose the lesson(s) they want to
showcase, given the general constraints of the portfolio entry. Examples of a portfolio entry include
videos of the teacher leading a whole-class discussion or teaching an important concept.
The second major component of NBPTS certification is the assessment center activities. Candidates
go to a testing center and, under standardized testing conditions, respond to six constructed-response
prompts about important content and content pedagogy questions within their certificate area. To
achieve certification, candidates need to attain a
total score across all assessment tasks that exceeds a
designated passing standard. On balance, research
suggests that the NBPTS system is able to identify
teachers who are better able to support student
achievement—as measured by standardized test
scores—than are teachers who attempt certification
but do not pass the assessment, but the differences
are quite modest (National Research Council, 2008).
The states in which teachers have been most
likely to participate in the NBPTS system are those
that have provided monetary rewards or salary supplements for certification. This has led to NBPTS
being very active in a relatively small number of
states, with only limited participation in other states.
In contrast to assessment policies that shape preservice and in-service teaching, NBPTS takes a nuanced
and professional view of teaching via a standardized
assessment system that is tied to teaching practice.
However, it is voluntary, touches relatively few
teachers in most states, and is expensive.
Although this discussion does not cover all
teacher evaluation practices, it does provide a synopsis of the most common formal assessment and
evaluation systems for teachers. Taken together, the
Conceptualizing Measures of
Teaching Quality
Teaching quality is defined in many different ways
and operationalized by the particular sets of measures used to characterize quality. Every measure
brings with it, either explicitly or implicitly, a particular perspective as to which aspects and qualities
of teaching should receive attention and how evidence ought to be valued. For example, there have
been heated political arguments about whether dispositions toward teaching ought to be assessed as
part of teacher education (Hines, 2007; Wasley,
2006). Although there is general agreement that the
impact on students ought to be a critical evaluative
consideration, the indicators of impact are not
agreed upon. Some are satisfied with a focus on subject areas that are tested in schools. Others want
both to broaden the academic focus and emphasize
outcomes that are associated with mature participation in a democratic society (Koretz, 2008; Ravitch,
2010).
Although reasonable people disagree about what
distinguishes high-quality teaching, it is important
8
APA-HTA_V3-12-0603-020.indd 8
04/10/12 7:24 PM
Evaluating Teaching and Teachers
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
value-added models [VAM]) does not capture the
whole domain of teaching quality.
In many fields, it is reasonable to expect that particular classes of measures are associated with particular stages of educational or professional
development. For teaching, that has been partially
true, particularly with content knowledge measures
being used as a requirement to attain a teaching
license. By and large, however, the measures
reviewed here are being considered for use throughout the professional span during which teachers are
assessed. At the time of the writing of this chapter,
how the measures actually are used to meet particular assessment purposes remains to be seen.
Nevertheless, because of the lack of any inherent
relationship between category of measure and particular use, the remainder of this chapter is organized by construct focus rather than assessment
purpose.
PS
Y
MEASURES OF TEACHING QUALITY
N
In this section, an overview of measures that have
been developed to support inferences about constructs associated with teaching quality is presented.
M
ER
IC
A
to identify clearly the constructs that comprise
teaching quality and how those constructs may be
understood relative to the measures used in evaluation systems. Figure 20.1 describes a model we have
developed that presumes that teaching quality is
interactional and constructive. Within specific
teaching and learning contexts, teachers and students construct a set of interactions that is defined
as teaching quality. Six broad constructs make up
the domain of teaching quality. These are teachers’
knowledge, practices, and beliefs, and students’
knowledge, practices, and beliefs. The domain of
teaching quality and by extension the constructs
themselves are intertwined with critical contextual
features, such as the curriculum, school leadership,
district policies, and so on. Therefore, by definition,
specific instruments measure both context and construct. As can be seen in the figure, instruments may
detect multiple constructs or a single teaching quality construct. For example, observation protocols
allow the observer to gather evidence on both
teacher and student practices, whereas assessments
of content knowledge for teaching only measure
teachers’ knowledge. Finally, the figure suggests
that any one measure (e.g., a knowledge test or
Contextual
Curriculum
Factors
School Leadership
Policy
A
©
Community
Students & Colleagues
Resources
Teaching
Quality
TEACHER CONSTRUCTS
Teacher
Knowledge
Teacher
Practices
TE
D
PR
O
O
FS
TARGET
DOMAIN
MEASURES
U
N
CO
RR
EC
CONSTRUCTS
Content Knowledge
for Teaching Tests
Knowledge of
Teaching Tests
STUDENT CONSTRUCTS
Teacher
Beliefs
Student
Beliefs
Student
Practices
Belief
Instruments
Growth
Models
Observation
Measures
Artifact
Measures
Teacher
Portfolios
Student
Knowledge
Value-Added
Methods
Graduation
Rates
Course Taking
Patterns
Student
Portfolios
FIGURE 20.1. The contextual factors, constructs, and measures associated with teaching quality.
9
APA-HTA_V3-12-0603-020.indd 9
04/10/12 7:24 PM
Gitomer and Bell
For each set of measures, their core design characteristics, the range of uses to which they have
been put, and the status of evidence to support a
validity argument for the evaluation of teaching are
reviewed.
TI
O
A
CI
SS
O
A
L
A
IC
G
LO
O
CH
Y
PS
N
A
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
Knowledge of content. Knowledge of content has
been a mainstay of the teacher licensure process
since the 1930s, with the advent of the National
Teacher Examinations (Pilley, 1941). With the
requirement for highly qualified teachers in the
reauthorization of the Elementary and Secondary
Education Act (No Child Left Behind Act, 2002),
all states now require teachers to demonstrate some
level of content knowledge about the subjects for
which they are licensed to teach.
These assessments typically consist of multiplechoice questions that sample content derived from
extant disciplinary and teaching standards and then
confirmed through surveys of practitioners and
teacher educators. Individual states set passing
scores for candidates based on standard-setting processes (Livingston & Zieky, 1982) that are used to
define a minimum-level “do no harm” threshold.
Some states also require tests that assess knowledge
of pedagogy and content-specific pedagogy. Although
some of these tests may include constructedresponse formats, the basic approach to design and
validation support is similar for both content and
pedagogical tests.5
The validity argument for these kinds of tests has
long been a source of debate. M. T. Kane (1982) discussed two possible interpretations: one concerned
with the ability of the licensure test to predict future
professional performance and the other to evaluate
the current competence on a set of skills and knowledge that was deemed necessary but not sufficient
for professional practice. M. T. Kane (1982) argued
that the latter interpretation was appropriate for
N
Teacher Knowledge
licensure tests as any single instrument would be
insufficient to capture the set of complex and coordinated skills, understandings, and experiences
necessary for professional competence.
In endorsing the much more limited competence
interpretation, M. T. Kane (1982) argued that establishing content validity was the critical task for a
licensure test validity argument. Evidence is
expected to demonstrate the adequacy of content
needed for minimal job performance, both in terms
of content representation and expectations for meeting the passing standard. Processes that include job
analysis and standard-setting studies typically are
used to provide such evidence. The adequacy of
scores is typically supported through standard psychometric analyses that include test form equating,
reliability, scaling, differential item functioning
(DIF), and group performance studies. Other scholars have agreed that it is both inappropriate and
infeasible to expect a broader validity argument
(e.g., Jaeger, 1999; Popham, 1992).
Even under this relatively constrained set of
requirements, the status of validity evidence in practice is uneven. In its 2001 report, the National
Research Council reviewed the validity evidence of
the two primary organizations that design, develop,
and administer these assessments. ETS6 was viewed
as having evidence to support the content validity
argument, although some assessments were using
studies that were dated. National Evaluation Systems (NES)7 tests were typically unavailable, and so,
the study panel concluded that for a very substantial
amount of teacher licensure testing, the available
evidence did not satisfy even the most basic requirements of available information articulated in the
Standards for Educational and Psychological Testing
(AERA et al., 1999).
M. T. Kane’s (1982) position that content validation is by itself sufficient to establish the validity
of licensure assessments has been argued against
5
While all states require demonstrations of content knowledge, some also require candidates to pass assessments of basic knowledge of reading, writing, and mathematics. We do not include these tests in our analysis because these instruments test knowledge and skills that are equally germane for
any college student, not just teacher candidates.
6
The authors of this chapter were both employees of ETS as this chapter was written. The statements included here are a description of the conclusions
of the National Research Council (2001) study report Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. We believe
our statements are a fair representation of the study findings.
7
National Evaluation Systems was acquired by Pearson Education in 2006 and is now known as Evaluation Systems of Pearson.
10
APA-HTA_V3-12-0603-020.indd 10
04/10/12 7:24 PM
Evaluating Teaching and Teachers
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
Knowledge of content for teaching. Teaching
involves much more than simple mastery of content
knowledge. Shulman (1986) argued persuasively
that teachers also needed to master a body of knowledge he identified as pedagogical content knowledge
(PCK). Shulman argued that PCK involves pedagogical strategies and representations that make content
understandable to others and also involves teachers
grasping what makes particular content challenging
for students to understand, what kinds of conceptions and misconceptions students might have, and
how different students might interact with the content in different ways.
Building on Shulman’s (1986) ideas, Hill, Ball,
and colleagues focused on mathematics and developed theory and assessments of what they called
mathematical knowledge for teaching (MKT; Ball,
Thames, & Phelps, 2008). MKT attempts to specify
the knowledge of mathematical content that is used
in practice, differentiating the advanced subject
knowledge that one might learn as a student majoring in a college discipline from the particular forms
of knowledge that teachers need to help their students learn concepts in K–12 education. Content
knowledge for teaching (CKT) is the more general
term applied to this type of knowledge as it used
across different subject matter domains (e.g., science, social studies, etc.). CKT incorporates what
Shulman called PCK and further specifies both the
content knowledge and the PCK that teachers need
to know in particular subject areas. The argument
that accompanies CKT suggests that teachers must
know mathematics differently than someone who
uses math in her daily life but is not charged with
teaching children math. For example, a common
task of teaching requires that teachers assess the
degree to which certain problems allow students to
practice a particular math objective. In Figure 20.2,
the teacher must be able to recognize whether a proportion can be used to solve the word problem.
Although adults may use proportions in their professional or personal lives, teachers must be able to
look at problems and determine whether that problem can be solved in a specific way that meets a
learning objective.
Ball et al. (2008) highlighted six forms of CKT
that fall into two categories—content knowledge
PS
N
strongly by other experts (e.g., Haertel, 1991;
Haney, Madaus, & Kreitzer, 1987; Moss & Schutz,
1999; Pullin, 1999). A number of researchers and
policy makers, including the authors of the
National Research Council (2001) study, have
argued that these assessments ought to be evaluated using the predictive criterion interpretation,
including a demonstration of a relationship
between scores on the licensure test and other
measures associated with teaching. To that end,
researchers have conducted studies relating scores
on teacher licensure assessments to student gains
in achievement by studying practicing teachers
who varied on their licensure test scores, including
those who would not have met the passing standard in one state even if they scored sufficiently
high to teach in another. There is some evidence of
a quite modest relationship between licensure test
scores and student outcomes (e.g., Goldhaber &
Hansen, 2010). Gitomer and Qi (2010), however,
observed that the licensure tests were successful in
identifying individuals who performed so substantially below the passing standard that such individuals would not have ever become practicing
teachers in any locale and, thus, would not have
been part of the distribution studied by Goldhaber
and Hansen. Because these individuals do not
attain a license to teach, any studies examining the
relationships between test scores and other outcomes are attenuated.
In part because content knowledge tests have
been used so widely, there is a large body of evidence demonstrating disparate impact for minority
candidates. Scores and passing rates are significantly lower for African American candidates than
White candidates (Gitomer, 2007; Gitomer,
Latham, & Ziomek 1999; Gitomer & Qi, 2010),
which raises questions about the validity of the
assessments and whether bias is associated with
them. Although test developers attempt to ensure
fairness through their test development and analysis
processes (e.g., DIF analyses), it is imperative that
research continue not only to examine issues of bias
but also to pursue strategies to mitigate unequal
educational opportunities that many minority candidates have experienced (e.g., National Research
Council, 2001, 2008).
11
APA-HTA_V3-12-0603-020.indd 11
04/10/12 7:24 PM
Gitomer and Bell
Mr. Sucevic is working with his students on understanding the use of proportional relationships in solving problems. He wants to
select some problems from a mathematics workbook with which his students can practice. For each of the following problems,
indicate whether or not it would be answered by setting up and solving a proportional relationship.
Would Be Answered by
Would Not Be Answered by
Setting Up and Solving a
Setting Up and Solving a
Proportional Relationship
Proportional Relationship
TI
O
N
A) Cynthia is making cupcakes from a recipe that requires 4
eggs and 3 cups of milk. If she has only 2 eggs to make the
cupcakes, how many cups of milk must she use?
SS
O
CI
A
B) John and Robert are each reading their books at the same
rate. When John is on page 20, Robert is on page 15. What
page will John be on when Robert is on page 60?
L
A
C) Julie and Karen are riding their bikes at the same rate.
Julie rides 12 miles in 30 minutes. How many miles would
Karen ride in 35 minutes?
LO
G
IC
A
D) Rashida puts some money into an account that earns the
same rate each month. She does not remove or add any
money to the account. After 6 months, the balance in the
account is $1,093.44. What is the balance in the account
after 12 months?
PS
Y
CH
O
E) A square with area 16 square units can be inscribed in a
circle with area 8π square units. How many square units
are in the area of a square inscribed in a circle that has area
24π square units?
IC
A
N
FIGURE 20.2. A sample question from MET Mathematics 6–8.
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
and PCK. Each of the main categories has three subcategories. Content knowledge is composed of common content knowledge, specialized content
knowledge, and horizon content knowledge. Specialized content knowledge is knowledge that
enables work with students around content, but not
knowledge that other professionals using the same
content (e.g., mathematics) in their jobs might find
important. For example, a teacher needs to not only
carry out a mathematical operation (e.g., dividing
fractions) but also to understand why the operation
works so that different student solutions can be
understood as being mathematically reasonable or
not. Specialized content knowledge contrasts with
common content knowledge, which is knowledge
held by individuals who use that content in their
work and personal lives. Horizon content knowledge involves an understanding of how different
content is interrelated across curricular topics both
within and across school years.
PCK, the second organizing category of knowledge, is composed of knowledge of content and
students, knowledge of content and teaching, and
knowledge of content and curriculum. Knowledge
of content and students combines content knowledge and knowledge of how students’ interact with
and learn the subject. It includes, for example,
knowledge of what aspects of a subject students are
likely to find difficult, errors students might make,
and difficulties students might encounter in understanding a subject. Knowledge of content and teaching includes knowledge of the best examples to use,
how to link subject-specific tasks, and ways of
responding to students’ ideas and confusion that
will develop their understanding of the subject.
Finally, knowledge of content and curriculum
focuses on knowledge of how to sequence and organize a subject and of the material programs that can
be used to support students’ developing understanding of the subject.
Hill, Schilling, and Ball (2004) described the
developmental processes for constructing items of
these types and also provided information about the
psychometric quality and structure of assessment
12
APA-HTA_V3-12-0603-020.indd 12
04/10/12 7:24 PM
Evaluating Teaching and Teachers
A
IC
ER
M
A
©
FS
O
O
PR
TE
D
EC
RR
CO
N
U
CI
A
TI
O
N
2004), but assessments are being developed and
tested in English language arts. Most validity work
has been done on MKT, not the more general CKT.
Given the use of MKT as a research tool, there is a
relatively strong validity argument. However, the
validity argument for MKT for other uses is modest
(e.g., teacher preparation program evaluation) but
growing. The validity argument for assessments of
CKT, used in both research and for personnel decisions, is nascent but also growing.
SS
O
Teacher Practices
Y
CH
O
LO
G
IC
A
L
A
Observations. Scholarship on observation protocols goes back to the turn of the 20th century
(Kennedy, 2010). The actual practice of individuals
with authority using some type of protocol to make
evaluative decisions about a teacher likely dates back
even further. Kennedy (2010) suggested that for
more than half of the 20th century the protocols in
use have been general, poorly defined, idiosyncratic,
heavily subjective, and often focused on teachers’
personal characteristics rather than teaching.
Given the view of teaching as one involving
socially and culturally situated interactions between
teachers and students to support the construction of
knowledge, instruments that are unable to detect
these types of interactions are not reviewed. This
means that the instruments from the productive history of process–product research in the 1970s and
1980s that used observation protocols to assess
teaching quality are not included (for a review of
this research, see Brophy & Good, 1986). Instead,
the focus is on the relatively new and small number
of instruments and associated research that has been
developed and used over roughly the past 25 to
30 years. These instruments are designed to measure
whole-class instruction (e.g., not tutoring situations) and adopt the view that teaching and learning
occur through interactions that support the construction of knowledge.
The observation protocols currently in use generally adhere to the following description: The protocol begins with an observer developing a record of
evidence from the classroom for some defined segment of time, typically without making any evaluative judgments. At the end of the segment, observers
use a set of scoring criteria or rubric that typically
PS
N
forms built with these items. Schilling and Hill
(2007) have laid out a validity argument for these
kinds of assessments and have conducted a research
program to marshal evidence to evaluate the argument. To date, these assessments have been used in
the context of research, particularly in the context of
examining the impact of professional development
and curricular interventions. They have not been
used as part of licensure or other high-stakes testing
programs. Thus, the validity argument pertains to
use as a research tool.
In one study, Hill, Dean, and Goffney (2007)
conducted cognitive interviews of problem solutions
by teachers, nonteachers, and mathematicians.
Although they found that mathematical knowledge
itself was critically important to solving the problems, they observed important differences that were
not simply attributable to content knowledge. Mathematicians, for example, sometimes had difficulty
interpreting nonstandard solutions, the kinds of
solutions that students often generate. Although
mathematicians could reason their way through
problems, it was teachers who could call on their
prior experiences with students to reason through
other problems. Krauss, Baumert, and Blum (2008)
developed another measure of PCK and also found
strong but not uniform relationships with content
knowledge—in some cases, teachers brought unique
understandings that allowed them to solve problems
more effectively than others who had far stronger
mathematical content knowledge. Other studies
have found modest relationships between CKT measures and student achievement gains (Hill, Rowan,
& Ball, 2005) and relationships with judgments of
classroom instruction through observation (Hill et al.,
2008). The lack of studies that address questions of
impact on subgroups of teachers (e.g., special education teachers, teachers of color, or teachers of English language learners) likely is due to the purposes
and scope of the existing research studies.
The studies to date have typically relied on relatively small samples. Studies currently being conducted will yield data based on far larger samples
and broader sets of measures of teacher quality (e.g.,
Bill and Melinda Gates Foundation, 2010). To date,
there has been only limited work in other content
domains (e.g., Phelps, 2009; Phelps & Schilling,
13
APA-HTA_V3-12-0603-020.indd 13
04/10/12 7:24 PM
Gitomer and Bell
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
N
PS
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
2005; La Paro, Pianta, & Stuhlman, 2004; Piburn &
Sawada, 2000). Because much of the research on
these protocols has happened in the context of
university-based research projects, the raters
themselves are often graduate students or faculty
members. With this rater group, trainers are able
to teach raters to see teaching and learning through
the lens of their respective protocol at acceptable
levels of interrater agreement (e.g., Hill et al.,
2008). Initial qualification of raters typically
requires agreement with master codes at some
prespecified level (e.g., 80% exact match on a
4-point scale).
Among both researchers and practitioners, the
best methods and standards for judging rater agreement on holistic observation protocols are evolving.
The most simple and common way of judging agreement is to calculate the proportion of scores on
which raters agree. For many protocols, agreement
requires an exact match in scores (e.g., Danielson &
McGreal, 2000). But for others with larger scales,
raters are deemed to agree if their scores do not differ by more than 1 score point (e.g., Pianta et al.,
2007). Such models do not take into account the
overall variation in scores assigned—raters may
appear to agree by virtue of not using more than a
very narrow range of the scale. More sophisticated
analyses make use of rater agreement metrics that
take into account the distribution of scores, including Cohen’s kappa,8 intraclass correlations, and
variance component decomposition.
Emerging models attempt to understand a range
of factors that might affect rater quality and agreement. For example, in addition to rater main effects,
Raudenbush, Martinez, Bloom, Zhu, and Lin (2010)
consider how rater judgments can interact with the
classrooms, days, and lesson segments observed. To
the extent that these variance components (or facets,
if g-study approaches are used; see Volume 1, Chapter 3, this handbook) can be estimated, it may be
possible to develop observation scores that adjust
for such rater effects. When using these models, preliminary findings suggest there are substantial training challenges in obtaining high levels of agreement,
A
IC
includes a set of Likert scales to make both low- and
high-inference judgments based on the record of
evidence. Those judgments result in numerical
scores. Although some of the protocols have been
used to evaluate thousands of teachers (e.g., Charlotte Danielson’s Framework for Teaching has been
the most widely used), the protocols have rarely
been used for summative consequential decisions,
although this is changing rapidly. Despite the fact
that many districts are considering or have already
begun using these observation protocols for consequential decisions, there is still much not known
about the strength of the validity argument for these
protocols as a group as well as the strength of the
validity argument for individual protocols. Although
there are exceptions, the instruments have been
used in both live and video-based observation settings. Bell et al. (in press) have recently used an
argument approach to evaluate the validity of one
observation protocol.
Protocols tend to fall into two broad categories—
protocols for use across subject areas and those
intended for use in specific subject areas (Baker,
Gersten, Haager, & Dingle, 2006; Danielson, 1996;
Grossman et al., 2010; Hill et al., 2008; Horizon
Research, 2000; Pianta, La Paro, & Hamre, 2007;
Taylor, Pearson, Peterson, & Rodriguez, 2003).
There are subject-specific protocols in mathematics,
science, and English language arts, but none are evident for social studies classrooms (e.g., social studies, history, government, geography, etc.). There are
more protocols for use at the elementary grades than
the secondary ones. Many of the subject-specific protocols have been studied in K–3, K–5, or K–8 classrooms, so it is unclear whether or how the protocols
might function differently in high school classrooms.
These protocols reflect a particular perspective on
teaching quality—some privilege a belief in constructivist perspectives on teaching and others are more
agnostic to the particular teaching methods used.
Observation protocols are generally developed
and vetted within a community of practice that has a
corresponding set of teaching standards (Danielson &
McGreal, 2000; Gersten, Baker, Haager, & Graves,
8
It is important to note that kappa can be sensitive to skewed or uneven distributions and, therefore, may be of limited value depending on the particular score distributions on a given instrument (e.g., Byrt, Bishop, & Carlin, 1993).
14
APA-HTA_V3-12-0603-020.indd 14
04/10/12 7:24 PM
Evaluating Teaching and Teachers
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
U
N
Instructional collections and artifacts. A second
group of instruments to measure teaching quality
has emerged in the past 15 to 20 years. Instructional
collections (sometimes referred to as portfolios) and
artifacts have a shorter history than observations.
Research began in earnest on these types of instruments in the early to mid-1990s with peer-reviewed
articles and book chapters beginning to appear in
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
the late 1990s. Thus far that work has produced a
relatively small number of instruments used and
studied by a relatively small number of researchers. In contrast to observation protocols that were
largely designed as professional development tools,
the design and development of instructional collections and artifact protocols gave more attention
to psychometric quality from the outset. Even so,
research remains highly uneven—a moderate number of studies with very small numbers of teachers and a handful of studies with large numbers of
teachers. Claims about such protocols as a group
should therefore be taken as preliminary.
Instructional collections are evidence collection
and scoring protocols that typically involve one or
more holistic judgments about a range of evidence
that often addresses the multiple constructs that
comprise the teaching quality construct in Figure 20.1.
Instructional collections draw inferences from evidence that can include lesson plans, assignments,
assessments, student work samples, videos of classroom interactions, reflective writings, interviews,
observations, notes from parents, evidence of community involvement, and awards or recognitions.
These protocols identify what types of evidence the
teacher is expected to submit within broad guidelines; the teacher is able to choose the specific materials upon which the judgment is based. Often the
teacher provides both an explicit rationale for the
selection of evidence in the collection and a reflective analysis to help the reader or evaluator of the
collection make sense of the evidence.
Artifact protocols can be thought of as a particular type of instructional collection that is much narrower. The protocols most widely researched are
designed to measure the intellectual rigor and quality of the assignments teachers give students as well
as the student work that is produced in response to
those assignments (e.g., Borko, Stecher, & Kuffner,
2007; Newmann, Bryk, & Nagaoka, 2001). These
protocols are designed to be independent of the academic difficulty of a particular course of study. For
example, an advanced physics assignment would
receive low scores if students were simply asked to
provide definitions. The judgments made focus on
an important but limited part of the teaching quality
domain, focusing almost exclusively on teacher and
PS
N
particularly with higher inference instruments (e.g.,
Gitomer & Bell, 2012; McCaffrey, 2011). As observation systems are included in evaluation systems,
systems will need to ensure not only that raters are
certified but also that effective monitoring and continuing calibration processes are in place. In general,
there is little or no information provided about
whether or how raters are calibrated over time (Bell,
Little, & Croft, 2009).
A research literature is now beginning to amass
around these observation protocols. Research is
being conducted examining the extent to which
empirical results support the underlying structure of
the instruments (e.g., La Paro et al., 2004) and
changes in practice as the result of teacher education
(Malmberg et al., 2010) and professional development (Pianta, Mashburn, Downer, Hamre, & Justice,
2008; Raver et al., 2008). A number of studies are
now being reported that examine the relationship of
observation scores to student achievement gains
(Bell et al., in press; Bill and Melinda Gates Foundation, 2011b; Grossman et al., 2010; Hamre et al., in
press; Hill, Umland, & Kapitula, 2011; Milanowski,
2004). Thus, over the next 5 to 10 years, a very
strong body of research is likely to emerge that will
provide information about the validity and potential
of classroom observation tools.
It is important to understand that these protocols
are designed to evaluate the quality of classroom
practice. As described in Figure 20.1, factors such as
curriculum, school policy, and environment as well
as the characteristics of the students in the classroom are being detected by these observation protocols. Thus, causal claims about the teacher require
another inferential step and are not transparent.
Furthermore, given the high-stakes uses to which
these instruments are being applied, the state of the
current validity argument is weak.
15
APA-HTA_V3-12-0603-020.indd 15
04/10/12 7:24 PM
Gitomer and Bell
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
N
PS
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
committees were consulted extensively. As a class of
protocols, there has been significant attention to raters and score quality. Although there have been
graduate students, principals, and other education
professionals trained to rate, raters have overwhelmingly been teachers with experience at the grade
level and subject area being assessed (Aschbacher,
1999; Boston & Wolf, 2006; Matsumura et al., 2006;
Matsumura, Garnier, Slater, & Boston, 2008; Newmann et al., 2001). Training on both instructional
collection and artifact protocols is usually intensive
(e.g., 3 to 5 days for artifacts and sometimes longer
for instructional collections) and makes use of
benchmark and training papers. For almost all protocols, raters are required to pass a certification test
before scoring. Although the quality of the training
as judged by interrater agreement varies across protocols and studies, the literature suggests it is possible to train raters to acceptable levels of agreement
(more than 70%) with significant effort (Borko
et al., 2007; Gitomer, 2008b; Ingvarson & Hattie,
2008; Matsumura et al., 2006; M. Wilson, Hallam,
Pecheone, & Moss, 2006).
As with observations, score accuracy is often a
challenge to the validity of interpretations of evidence for instructional collections. Accuracy problems, most often in the form of rater drift and bias,
have been addressed by putting in place procedures
for bias training (e.g., Ingvarson & Hattie, 2008)
and retraining raters, rescoring, and, in some cases,
modeling rater severity using Rasch models
(Gitomer, 2008b; Kellor, 2002; National Research
Council, 2008; Shkolnik et al., 2007; Wenzel, Nagaoka, Morris, Billings, & Fendt, 2002). Because there
is such a wide range of practices to account for rater
agreement across the instruments and purposes of
those instruments, it is difficult to generalize about
the quality of scores except to say it is uneven.
Instructional collections and artifact protocols
examine evidence that is often produced as a regular
part of teaching and learning. Perhaps in part
because of this closeness to practice, instructional
collections have high levels of face validity, and
for at least some protocols, teachers report that
A
IC
student practices, with much less teacher description and analysis called for than with other instructional collections. The protocols circumscribe what
types of assignments are assessed, often asking for
a mix of four to six typical and challenging assignments that produce written student work. Often
researchers sample assignments across the school
year and allow for some teacher choice in which
assignment is assessed.
Both artifact and instructional collection instruments have been used for various purposes, ranging
from formative feedback for the improvement of
teaching practice to licensure and high-stakes
advanced certification decisions. For example, the
Scoop Notebook is an instructional collection protocol that has been used to improve professional practice (Borko et al., 2007; Borko, Stecher, Alonzo,
Moncure, & McClam, 2005). The portfolio protocol
for NBPTS certification is used as a part of a voluntary high-stakes assessment for advanced certification status (e.g., Cantrell, Fullerton, Kane, &
Staiger, 2008; National Research Council, 2008;
Szpara & Wylie, 2007). Related protocols have been
used as part of licensure (e.g., California Commission on Teacher Credentialing, 2008; Connecticut
State Department of Education, Bureau of Program
and Teacher Evaluation, 2001), and three artifact
protocols documented in the research literature
have been used for the improvement of practice,
judgments about school quality, and the evaluation
of school reform models (Junker et al., 2006; Koh &
Luke, 2009; Matsumura & Pascal, 2003; Mitchell
et al., 2005; Newmann et al., 2001). These protocols
vary in the degree to which they require the teacher
to submit evidence that is naturalistic (i.e., already
exists as a regular part of teaching practice) or evidence that is created specifically for inclusion in the
assessment (e.g., written reflections or interviews).
All of the protocols reviewed have been developed to reflect a community’s view of quality teaching. In the high-stakes assessments (e.g., the
now-redesigned Connecticut’s Beginning Educator
Support and Training [BEST] portfolio assessment9
and the NBPTS observation protocol), stakeholder
9
BEST has been redesigned and, as of the 2009–2010 school year, is now known as the Teacher Education and Mentoring (TEAM) Program. This
paper considers BEST as it existed before the redesign.
16
APA-HTA_V3-12-0603-020.indd 16
04/10/12 7:24 PM
Evaluating Teaching and Teachers
FS
©
A
M
ER
IC
A
U
N
CO
RR
EC
TE
D
PR
O
O
Validity challenges to the measurement of teacher
practices. Across these different measures of
teacher and practice, valid inferences about teaching quality will depend in large part on the ability
to address the following issues. First, claims about
teacher effectiveness must take into account contextual factors that individuals do not control. For
example, as teachers are required to focus on test
preparation activities, an increasingly common practice in recent years (Stecher, Vernez, & Steinberg,
2010), qualities of instruction valued by particular
protocols may become less visible. Teachers who
work within certain curricula may be judged to be
more effective, not necessarily because of their own
abilities, but because they are working with a curriculum that supports practices valued by particular measurement instruments (e.g., Cohen, 2010).
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
Causal claims based on any single instrument may
be inappropriate and can be better justified by considering evidence from multiple measures.
Second, issues of bias and fairness need to be
examined and addressed. As with other assessment
measures, there must be vigilance to ensure that
measures do not, for construct-irrelevant reasons,
privilege teachers with particular backgrounds.
Aside from the NBPTS and Connecticut’s previous
BEST instructional collection research, there is very
little research to suggest the field understands the
bias and fairness implications of specific protocols.
This is understandable given the more formative
uses of many of the instruments; however, as stakes
are attached, this will not be an acceptable state of
affairs for either legal or ethical reasons.
Finally, implementation of these protocols is critical to the validity of the instrument for specific
uses. Even if there is validity evidence for a particular measure, such evidence is dependent on implementing the protocols in particular ways, for
example, with well-trained and calibrated raters.
Using a protocol that has yielded valid inferences in
one context with a specific set of processes in place
does not guarantee that inferences made in a similar
context with different implementation processes will
yield those valid inferences. States and districts will
have to monitor implementation specifics closely,
given the budgetary and human capital constraints
under which they will operate.
PS
N
preparing an instructional collection improves their
practice (e.g., Moss et al., 2004; Tucker, Stronge,
Gareis, & Beers, 2003). Across protocols, however,
teachers often feel they are burdensome.
Evidence is modest and mixed on the relationship to teaching practice and student achievement,
depending on the instrument under investigation
(e.g., National Research Council, 2008; M. Wilson
et al., 2006). Instruments that focus on evaluating
the products of classroom interactions rather than
the teacher’s commentary on those products in collections seem to have stronger evidence for a relationship to student learning (e.g., Cantrell et al.,
2008; Matsumura et al., 2006). Consistent with this
trend, there is a somewhat stronger, more moderate
relationship between scores on artifact protocols
and student achievement (Matsumura & Pascal,
2003; Matsumura et al., 2006; Mitchell et al., 2005;
Newmann et al., 2001). This relationship may be
due to the fact that artifact protocols are, by definition, more narrowly connected to teaching practice.
If these instruments are to become more widely used
in teacher evaluation, there will need to be a stronger understanding of teacher choice in selecting
assignments and teacher-provided description and
reflection. There will also have to be a stronger
understanding of the role of grade-level, school, and
district curricular decisions that could prove thorny
when attributing scores to individual teachers.
Teacher Beliefs
This category represents a mix of various kinds of
measures that have been used to target different
constructs about teaching. They include measures
that range from personal characteristics and
teacher beliefs to abilities to make judgments on
others’ teaching, typically through some type of
survey or questionnaire methodology. Collectively,
this body of work has tried to identify proxy measures of beliefs, attitudes, and understandings that
could predict who would become a good teacher
and that could provide guidance for individuals
and systems as to whether individuals were suited
to the profession of teaching, generally, and to particular teaching specialties and job placements,
more specifically.
17
APA-HTA_V3-12-0603-020.indd 17
04/10/12 7:24 PM
Gitomer and Bell
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
believes that teachers in general can determine student outcomes. This work highlights the continuing
challenges in clarifying the personality constructs of
interest. Ashton and Webb (1986), Gibson and
Dembo (1984), and Woolfolk and Hoy (1990) all
made the distinction between beliefs about what
teachers in general can do to affect student outcomes (teacher efficacy) and what he or she as an
individual could do to affect student outcomes (personal efficacy). Guskey and Passaro (1994) rejected
this distinction as an artifact of instrument design
and instead argued that two factors of efficacy—
internal and external locus of control—reflected the
extent to which teachers viewed themselves as having the ability to influence student learning. This
work builds on the finding of Armor et al. (1976),
who did find a modest relationship between student
achievement gains and a composite measure of
teacher beliefs based on the following statements:
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
The regrettable fact is that many of the
studies have not produced significant
results. Many others have produced only
pedestrian findings. For example, it is
said after the usual inventory tabulation
that good teachers are friendly, cheerful,
sympathetic, and morally virtuous rather
than cruel, depressed, unsympathetic,
and morally depraved. But when this has
been said, not very much that is especially useful has been revealed . . . . What
is needed is not research leading to the
self-evident but to the discovery of specific and distinctive features of teacher
personality and of the effective teacher.
(Getzels & Jackson, 1963, p. 574)
ER
In the ensuing years, efforts have been undertaken to make progress beyond this earlier state of
affairs. A large body of work has focused on teacher
efficacy—that is, the extent to which an individual
N
PS
Y
1. When it comes right down to it, a teacher really
can’t do much because most of a student’s motivation and performance depends on his or her
home environment.
2. If I try really hard, I can get through to even the
most difficult or unmotivated student.
A
IC
Almost 50 years ago, Getzels and Jackson (1963)
reviewed the extant literature linking personality
characteristics to teaching quality. Finding relationships somewhat elusive, they highlighted three substantial obstacles that remain relevant in the 21st
century. First, they raised the problem of defining
personality. Although personality theory has certainly
evolved substantially over the last half-century, the
identification of personality characteristics that are
theoretically and empirically important to teaching is
still underspecified. Second, they argued that instrumentation and theory to measure personality was relatively weak. The reliance on correlations of measures
without strong theories that link personality constructs to practice continues to persist (e.g., Fang,
1996). The third fundamental challenge is the limitation of the criterion measures—what are the measures of teacher quality that personality measures are
referenced against? Typical criterion measures that
Getzels and Jackson (1963) reviewed included principal ratings, teacher self-reports, and experience. As
can be seen throughout this review, although great
effort has been and is being made in defining quality
of teaching, the issues are hardly resolved.
Reviewing a large body of research, their conclusions were humbling:
Students who showed the greatest gains had teachers who disagreed with the first statement and
agreed with the second.
The field continues to be characterized by, at
best, modest correlations between measures of personality, dispositions, and beliefs and academic outcome measures. This, however, has not stopped the
search for such measures. Metzger and Wu (2008)
reviewed the available evidence for a widely used
commercially available product, Gallup’s Teacher
Perceiver Interview (TPI). They attributed the modest findings to possibilities that teachers’ responses
in these self-report instruments may not be accurate
reflections of their operating belief systems and that
the manifestation of characteristics may be far more
context bound than general instruments acknowledge. They concluded, as others have, that the constructs being examined are “slippery” (Metzger &
Wu, 2008, p. 934) and multifaceted, making it very
difficult to detect relationships. The validity argument for this group of measures is weak.
18
APA-HTA_V3-12-0603-020.indd 18
04/10/12 7:24 PM
Evaluating Teaching and Teachers
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
TI
O
A
CI
SS
O
A
L
A
IC
G
LO
O
CH
Y
PS
N
Both teachers and students contribute to teaching
quality. The measures used to assess teaching quality through the assessment of student beliefs and
practices may be considered. As Figure 20.1 indicates, some of the instruments being used to assess
teacher beliefs and practices also assess student
beliefs and practices. For example, on the holistic
observation protocol called the Classroom Assessment Scoring System (CLASS) developed by Pianta,
Hamre, Haynes, Mintz, and La Paro (2007), raters
are trained to observe both teacher practices and
student practices. Secondary classrooms that, for
example, receive high scores on the quality of feedback dimension of CLASS would have students
engaging in back-and-forth exchanges with the
teacher, demonstrating persistence, and explaining
their thinking in addition to all of the teacher’s
actions specified in that dimension. This focus on
both teacher and student practices is common
across the observation protocols reviewed in this
section.
Many instruments are designed to measure student beliefs and practices on a wide range of topics
from intelligence to self-efficacy to critical thinking
(e.g., Dweck, 2002; Stein, Haynes, Redding, Ennis, &
Cecil, 2007; Usher & Pajares, 2009). A summary of
this research is outside the scope of this chapter, but
only one identified belief instrument is being used to
evaluate teachers. On the basis of a decade of work
by Ferguson and his colleagues in the Tripod Project
(Ferguson, 2007), the MET project is using the Tripod assessment to determine the degree to which
students’ perceptions on seven topics are predictive
of other aspects of teaching quality (Bill and Melinda
Gates Foundation, 2011b). Preliminary evidence
suggests the assessment is internally reliable
(coefficient alpha > .80) when administered in such
a way that there are no stakes for students and
teachers (i.e., a research setting). Results on how
the instrument functions in situations in which
there are consequences for teachers have not yet
been published.
Student Knowledge
Value-added models. Over recent years, there has
been great enthusiasm for incorporating measures
of student achievement into estimates of how well
teachers are performing. This approach has led policy makers and researchers to advocate for the use of
value-added measures to evaluate individual teachers. Value-added measures use complex analytic
methods applied to longitudinal student achievement data to estimate teacher effects that are separate from other factors shaping student learning.
Comprehensive, nontechnical treatments of valueadded approaches are presented by Braun (2005b)
and the National Research Council and National
Academy of Education (2010).
The attraction of valued-added methods to many
is that they are objective measures that avoid the
complexities associated with human judgment. They
are also relatively low cost once data systems are in
place, and they do not require the human capital
and ongoing attention required by many of the previously described measures. Finally, policy makers
are attracted to the idea of applying a uniform metric to all teachers, provided test scores are available.
Although these models are promising, they have
important methodological and political limitations
that represent challenges to the validity of inferences
based on VAM (Braun, 2005b; Clotfelter et al., 2005;
Gitomer, 2008a; Kupermintz, 2003; Ladd, 2007;
Lockwood et al., 2007; National Research Council
and National Academy of Education, 2010; Raudenbush, 2004; Reardon & Raudenbush, 2009).
These challenges are summarized into two broad
and related categories. These challenges are actually
not unique to VAM. However, because VAM has
been so widely endorsed in policy circles and
because it is viewed as having an objective credibility that other measures do not, it is particularly
important to highlight these challenges with respect
to VAM.
A first validity challenge concerns the nature of
the construct. One distinguishes between teacher
and teaching effectiveness because a variety of factors may influence the association of scores with an
individual teacher. For example, school resources,
particularly those targeted at instruction (e.g.,
Cohen, Raudenbush, & Ball, 2003), specific curricula (e.g., Cohen, 2010; Tyack & Cuban, 1995), and
district polices that provide financial, technical, and
professional support to achieve instructional goals
N
Student Beliefs and Student Practices
19
APA-HTA_V3-12-0603-020.indd 19
04/10/12 7:24 PM
Gitomer and Bell
TI
O
N
continues to attempt to address these validity challenges and to understand the most appropriate use
of VAM within evaluation systems. Researchers and
policy makers vary in their confidence that these
issues will be to the improvement of educational
practice (for two distinct perspectives, see Baker
et al., 2010; Glazerman et al., 2010).
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
N
PS
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
Student learning objectives. Evaluation policies
must include all teachers. If student achievement is
to be a core component of these evaluation systems,
policy makers must address the fact that there are no
annual achievement test data appropriate to evaluate roughly 50%–70% of teachers, either because of
the subjects or grade levels that they teach. One of
the solutions proposed has been the development
of measures using student learning objectives (SLOs;
Community Training and Assistance Center, 2008).
In these models, teachers articulate a small set of
objectives and appropriate assessments to demonstrate that students are learning important concepts
and skills in their classrooms. SLOs are reviewed
by school administrators for acceptability. Teachers
are evaluated on the basis of how well the SLOs are
achieved on the basis of assessment results. Because
of the limited applicability of VAM, SLOs are being
considered for use in numerous state teaching evaluation systems (e.g., Rhode Island, Maryland, and
New York). Many of these models include the development of common SLOs for use across a state.
The integrity of the process rests on the quality
of the objectives and the rigor with which they are
produced and reviewed inside the educational system. To date, there is a very limited set of data to
judge the validity of these efforts. Available studies
have found, first, that developing high-quality objectives that identify important learning goals is challenging. The Community Training and Assistance
Center (2004) reported that for the first 3 years of a
4-year study, a majority of teachers produced SLOs
that needed improvement. Teachers failed to identify important and coherent learning goals and had
low expectations for students. Studies do report,
however, that teachers with stronger learning goals
tend to have students who demonstrate better
achievement (Community Training and Assistance
Center, 2004; Lussier & Forgione, 2010). There are
A
IC
(e.g., Ladd, 2007) all can influence what gets taught
and how it gets taught, potentially influencing the
student test scores that are used to produce VAM
estimates and inferences about the teacher. There
are other interpretive challenges as well: Other
adults (both parents and teachers) may contribute to
student test results, and the limits of student tests
may inappropriately constrain the inference to the
teacher (for a broad discussion of construct-relevant
concerns, see Baker et al., 2010).
A second set of issues concerns the internal
validity of VAM. One aspect of internal validity
requires that VAM estimates are attributable to the
experience of being in the classroom and not attributable to preexisting differences between students
across different classrooms. Furthermore, internal
validity requires that VAM estimates are not attributable to other potential modeling problems. Substantial treatment of these methodological issues
associated with VAM is provided elsewhere (Harris &
McCaffrey, 2010; McCaffrey, Lockwood, Koretz, &
Hamilton, 2003; National Research Council and
National Academy of Education, 2010; Reardon &
Raudenbush, 2009). Key challenges include the fact
that students are not randomly assigned to teachers
within and across schools. This makes it difficult to
interpret whether VAM effects are attributable to
teachers or the entering characteristics of students
(e.g., Clotfelter et al., 2005; Rothstein, 2009). Model
assumptions that attempt to adjust for this sorting
have been shown to be problematic (National
Research Council and National Academy of Education, 2010; Reardon & Raudenbush, 2009). Finally,
choices about the content of the test (e.g., Lockwood et al., 2007), the scaling (e.g., Ballou, 2008;
Briggs, 2008; Martineau, 2006), and the fundamental measurement error inherent in achievement tests
and especially growth scores can “undermine the
trustworthiness of the results of value-added methods” (Linn, 2008, p. 13).
Bringing together these two sets of validity concerns suggests that estimates of a particular teacher’s
effectiveness may vary substantially as a function of
the policies and practices in place for a given teacher,
the assignment of students to teachers, and the
particular tests and measurement models used
to calculate VAM. Substantial research into VAM
20
APA-HTA_V3-12-0603-020.indd 20
04/10/12 7:24 PM
Evaluating Teaching and Teachers
also indications that across systems, SLOs can lead
teachers to have stronger buy-in to the evaluation
system than has been demonstrated with other evaluation approaches (Brodsky, DeCesare, & KramerWine, 2010).
measurement questions will need to be addressed to
address the Standards for Educational and Psychological Testing (AERA et al., 1999).
Compensatory or Conjunctive Decisions
One question concerns the nature of the decision
embedded in the system. In a conjunctive system,
individuals must satisfy a standard for each constituent measure, whereas in a compensatory system,
individuals can do well on some measures and less
well on others as long as a total score reaches some
criterion. In a conjunctive model, the reliability of
each individual measure ought to be sufficiently high
such that decisions based on each individual measure
are defensible. A compensatory model, such as that
used by NBPTS, does not carry the same burden, but
it does lead to situations in which someone can satisfy an overall requirement and perform quite poorly
on constituent parts. One compromise that is sometimes taken is to adopt a compensatory model, yet set
some minimum scores for particular measures.
TI
O
N
COMBINING MULTIPLE MEASURES
CI
SS
O
A
L
A
IC
G
LO
O
CH
Y
PS
N
A
FS
©
A
M
ER
IC
Standard 14.13—When decision makers
integrate information from multiple tests
or integrate test and nontest information,
the role played by each test in the decision process should be clearly explicated,
and the use of each test or test composite
should be supported by validity evidence;
Standard 14.16—Rules and procedures
used to combine scores on multiple
assessments to determine the overall
outcome of a credentialing test should be
reported to test takers, preferably before
the test is administered.
A
Policy discussions are now facing the challenge of
integrating information from the various measures
considered thus far as well as measures that are specific to particular assessment purposes. The Standards for Educational and Psychological Testing
(AERA et al., 1999) provide guidance on the use of
multiple measures in decisions about employment
and credentialing:
TE
D
PR
O
O
Current policies and practices are only beginning
to be developed. For example, the U.S. Department
of Education’s (2010) Race to the Top competition
asks states to
U
N
CO
RR
EC
Design and implement rigorous, transparent, and fair evaluation systems for
teachers and principals that (a) differentiate effectiveness using multiple rating
categories that take into account data
on student growth (as defined in this
notice) as a significant factor, and (b) are
designed and developed with teacher and
principal involvement. (p. 34)
How these multiple ratings are accounted for,
however, is left unstated. As states and districts
grapple with these issues, a number of fundamental
Determining and Using Weighting
Schemes
Some proposed systems (e.g., Bill and Melinda Gates
Foundation, 2010) are trying to establish a single
metric of teacher effectiveness that is based on a
composite of measures. Efforts like these attempt to
determine the weighting of particular measures
based on statistical models that will maximize the
variance accounted for by particular measures.
At least two complexities will need to be kept in
mind by whatever weighting scheme is used. First, if
two measures have the same “true” relationship with
a criterion variable, the one that is scored more reliably will be more predictive of the criterion and thus
will be assigned a greater weight. Because of the reliability of scoring, some measures, or dimensions of
measures, may be viewed as more predictive of the
outcome than they actually are when compared with
other less reliable measures.
A second source of potential complexity is that
measures that have greater variability across individuals are likely to have a stronger impact on a total
evaluation score and that the effective weighting will
be far larger than the assigned weights would indicate. Imagine a system that derived a composite
21
APA-HTA_V3-12-0603-020.indd 21
04/10/12 7:24 PM
Gitomer and Bell
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
TI
O
A
CI
SS
O
IC
A
N
PS
Y
CH
O
LO
G
IC
A
L
A
The exercise of judgment. Systems can range
from those in which a single metric is derived from
multiple measures via a mathematical model to ones
in which decision makers are required to exercise a
summative judgment that takes into account multiple measures. Systems that avoid judgment often
do so because of a lack of trust in the judgment
process. If judgment is valued, as it is in many highperforming education systems, then it will be imperative to ensure that judgments are executed in ways
that are credible and transparent.
Rare yet important teaching characteristics.
Finally, there may be characteristics that do not contribute to variance on valued outcomes that should
contribute to composite measures. For example, we
may believe that teachers should not make factual
errors in content or be verbally abusive to students.
These might be rare events and do little to help distinguish between teachers; however, robust evaluation systems might want to include them to make
standards of professional conduct clear. Weighting
schemes that rely solely on quantitative measurable
outcomes run the risk of ignoring these important
characteristics.
N
CONCLUSION
U
An ambitious policy agenda that includes teacher
evaluation as one of its cornerstones places an
unprecedented obligation on the field of education
measurement to design, develop, and validate
measures of teaching quality. There is a pressing
need for evaluation systems that can support the
full range of purposes for which they are being
considered—from employment and compensation
decisions to professional development. Doing this
responsibly obligates the field to uphold the fundamental principles and standards of education measurement in the face of enormous policy pressures.
Well-intentioned policies will be successful only if
they are supported by sound measurement practice.
Building well-designed measures of effective
teaching will require coordinated developments in
theory, design, and implementation, along with
ongoing monitoring processes. Through ongoing
validation efforts, enhancements to each of these
critical components should be realized. This process
also will require implementation of imperfect systems that can be subject to continued examination
and refinement. The discipline needs to continue to
examine the fairness and validity of interpretations
and develop processes that ensure high-quality and
consistent implementation of whichever measures
are being employed. Such quality control can range
from ensuring quality judgments from individuals
rating teacher performance to ensuring that adequate data quality controls are in place for valueadded modeling.
It is important that sound measurement practices
be developed and deployed for these emerging evaluation systems, but there may be an additional
benefit to careful measurement work in this area.
Theories of teaching quality continue to be underdeveloped. Sound measures can contribute to both the
testing of theory and the evolution of theories about
teaching. For example, as educators understand
more about how contextual factors influence teaching quality, theories of teaching will evolve. Understanding the relationship between context and
teaching quality also may lead to the evolution and
improvement of school and district decisions that
shape student learning.
For the majority of instruments reviewed in this
chapter, their design can be considered first generation. Whether measures of teacher knowledge,
instructional collections, or observation methods,
there is a great deal to be done in terms of design of
protocols, design of assessments and items, training
and calibration of raters, aggregation of scores, and
psychometric modeling. Even the understanding of
expected psychometric performance on each class of
N
teaching quality score based on value-added and
principal evaluation scores and also imagine that
each was assigned a weight of 50%. Now imagine
that the principal did not differentiate teachers
much, if at all. In this case, even though each measure was assigned a weight of 50%, the value-added
measure actually contributes almost all the variance
to the total score. Thus, it is important not to just
assign an intended weight but also to understand
the effective weight given the characteristics of the
scores when implemented (e.g., range, variance,
measurement error, etc.).
22
APA-HTA_V3-12-0603-020.indd 22
04/10/12 7:24 PM
Evaluating Teaching and Teachers
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
References
American Educational Research Association, American
Psychological Association, & National Council on
Measurement in Education. (1999). Standards for
educational and psychological testing. Washington,
DC: Authors.
Y
CH
O
LO
G
IC
A
L
A
SS
O
CI
A
TI
O
N
American Federation of Teachers. (2010). A continuous
improvement model for teacher development and evaluation (Working paper). Washington, DC: Author.
Armor, D., Conroy-Oseguera, P., Cox, M., King, N.,
McDonnell, L., Pascal, A., & Zellman, G. (1976).
Analysis of the school preferred reading programs in
selected Los Angeles minority schools (Report No.
R-2007-LAUSD). Santa Monica, CA: RAND.
Aschbacher, P. R. (1999). Developing indicators of classroom practice to monitor and support school reform
(CSE Technical Report No. 513). Los Angeles, CA:
Center for the Study of Evaluation, National Center
for Research on Evaluation, Standards, and Student
Testing (CRESST)/UCLA.
Ashton, P. T., & Webb, R. B. (1986). Making a difference: Teacher efficacy and student achievement. White
Plains, NY: Longman.
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel,
E., Ladd, H. F., Linn, R. L., . . . Shepard, L. A. (2010).
Problems with the use of student test scores to evaluate
teachers (EPI Briefing Paper No. 278). Washington,
DC: Economic Policy Institute.
Baker, S. K., Gersten, R., Haager, D., & Dingle, M.
(2006). Teaching practice and the reading growth of
first-grade English learners: Validation of an observation instrument. Elementary School Journal, 107,
199–220. doi:10.1086/510655
Ball, D. L., Thames, M. H., & Phelps, G. (2008).
Content knowledge for teaching: What makes it
special? Journal of Teacher Education, 59, 389–407.
doi:10.1177/0022487108324554
Ballou, D. (2008, October). Value-added analysis: Issues
in the economics literature. Paper presented at the
workshop of the Committee on Value-Added
Methodology for Instructional Improvement,
Program Evaluation, and Educational Accountability,
National Research Council, Washington, DC.
Retrieved from http://www7.nationalacademies.org/
bota/VAM%20Analysis%20-%20Ballou.pdf
Barnette, J. J., & Gorham, K. (1999). Evaluation of
teacher preparation graduates by NCATE accredited
institutions: Techniques used and barriers. Research
in the Schools, 6(2), 55–62.
Bell, C. A., Gitomer, D. H., McCaffrey, D., Hamre,
B., Pianta, R., & Qi, Y. (in press). An argument approach to observation protocol validity.
Educational Assessment, 17(2–3), 1–26.
Bell, C. A., Little, O. M., & Croft, A. J. (2009, April).
Measuring teaching practice: A conceptual review.
Paper presented at the Annual Meeting of the
American Educational Research Association, San
Diego, CA.
Bell, C. A., & Youngs, P. (2011). Substance and show:
Understanding responses to NCATE accreditation. Teaching and Teacher Education, 27, 298–307.
doi:10.1016/j.tate.2010.08.012
PS
N
measures is at a preliminary stage. Importantly,
most of the work on these measures done to date
has been conducted in the context of research studies. There is little empirical understanding of how
these measures will work in practice, with all the
unintended consequences, incentives, disincentives,
and competing priorities that characterize education
policy.
There is at least one crucial aspect of the current
policy conversation that may prove to be the Achilles’ heel of the new systems being developed, should
it go unchecked. In general, all of the currently envisioned systems layer additional tasks, costs, and data
management burdens on school, district, and state
resources. Observations take principals’ time. SLOs
take teachers’, principals’, and districts’ time. Student questionnaires take students’ time. Data systems that track all of these new measures require
money and time. And the list goes on. These systems
are massive because they are intended to apply to all
teachers in every system. Serious consideration has
not been given to how institutions can juggle existing resource demands with these new demands. The
resource pressures these evaluation systems place on
institutions may result in efficiencies, but they may
also result in significant pressure to cut measurement corners that could pose threats to the validity
of the systems. Such unintended consequences must
be monitored carefully.
Although the new evaluation systems will require
substantial resources, the justification for moving
beyond measures that simply assign a ranking is that
these kinds of measures can provide helpful information to stakeholders about both high-quality teaching and the strengths and weaknesses of teachers and
school organizations in providing students access to
that teaching. Actualizing such a useful system will
require commitments by researchers, policy makers,
and practitioners alike to proceed in ways that support valid inferences about teaching quality.
23
APA-HTA_V3-12-0603-020.indd 23
04/10/12 7:24 PM
Gitomer and Bell
Bill and Melinda Gates Foundation. (2011a). Learning
about teaching: Initial findings from the Measures
of Effective Teaching project. Retrieved from http://
www.metproject.org/downloads/Preliminary_
Findings-Research_Paper.pdf
Handbook of research on teaching (pp. 328–375). New
York, NY: Macmillan.
Buddin, R., & Zamarro, G. (2009). Teacher qualifications and student achievement in urban elementary
schools. Journal of Urban Economics, 66, 103–115.
doi:10.1016/j.jue.2009.05.001
Bill and Melinda Gates Foundation. (2011b). Intensive
partnerships for effective teaching. Retrieved from
http://www.gatesfoundation.org/college-readyeducation/Pages/intensive-partnerships-for-effectiveteaching.aspx
TI
O
N
Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46,
423–429. doi:10.1016/0895-4356(93)90018-V
California Commission on Teacher Credentialing.
(2008). California teaching performance assessment.
Sacramento, CA: Author.
CI
A
Borko, H. (2004). Professional development and
teacher learning: Mapping the terrain. Educational
Researcher, 33(8), 3–15. doi:10.3102/00131
89X033008003
SS
O
Cantrell, S., Fullerton, J., Kane, T. J., & Staiger, D. O.
(2008). National Board certification and teacher effectiveness: Evidence from a random assignment experiment (NBER Working Paper No. 14608). Cambridge,
MA: National Bureau of Economic Research.
A
L
A
Borko, H., Stecher, B., & Kuffner, K. (2007). Using artifacts to characterize reform-oriented instruction: The
Scoop Notebook and rating guide (CSE Technical
Report No. 707). Los Angeles, CA: Center for the
Study of Evaluation, National Center for Research
on Evaluation, Standards, and Student Testing
(CRESST)/UCLA.
LO
G
IC
Carnegie Forum on Education and the Economy. (1986).
A nation prepared: Teachers for the 21st century. New
York, NY: Carnegie.
Clotfelter, C., Ladd, H., & Vigdor, J. (2005). Who teaches
whom? Race and the distribution of novice teachers. Economics of Education Review, 24, 377–392.
doi:10.1016/j.econedurev.2004.06.008
Y
CH
O
Borko, H., Stecher, B. M., Alonzo, A. C., Moncure,
S., & McClam, S. (2005). Artifact packages for
characterizing classroom practice: A pilot study.
Educational Assessment, 10, 73–104. doi:10.1207/
s15326977ea1002_1
A
M
ER
PS
N
A
IC
Boston, M., & Wolf, M. K. (2006). Assessing academic
rigor in mathematics instruction: The development of
the Instructional Quality Assessment Toolkit (CSE
Technical Report No. 672). Los Angeles, CA: Center
for the Study of Evaluation, National Center for
Research on Evaluation, Standards, and Student
Testing (CRESST)/UCLA.
O
O
FS
©
Braun, H. I. (2005a). Using student progress to evaluate
teachers: A primer on value-added models. Princeton,
NJ: ETS. Retrieved from http://www.ets.org/Media/
Research/pdf/PICVAM.pdf
TE
D
PR
Braun, H. I. (2005b). Value-added modeling: What does
due diligence require? In R. Lissitz (Ed.), Valueadded models in education: Theory and applications
(pp. 19–39). Maple Grove, MN: JAM Press.
U
N
CO
RR
EC
Briggs, D. (2008, November). The goals and uses of valueadded models. Paper presented at the workshop of
the Committee on Value-Added Methodology for
Instructional Improvement, Program Evaluation,
and Educational Accountability, National Research
Council, Washington, DC. Retrieved from http://
www7.nationalacademies.org/bota/VAM%20
Goals%20and%20Uses%20paper%20-%20Briggs.pdf
Brodsky, A., DeCesare, D., & Kramer-Wine, J. (2010).
Design and implementation considerations for alternative teacher compensation systems. Theory Into Practice,
49, 213–222. doi:10.1080/00405841.2010.487757
Brophy, J., & Good, T. L. (1986). Teacher behavior
and student achievement. In M. C. Wittrock (Ed.),
Cochran-Smith, M., & Zeichner, K. M. (Eds.). (2005).
Studying teacher education: The report of the AERA
panel on research and teacher education. Mahwah, NJ:
Erlbaum.
Cohen, D. (2010). Teacher quality: An American educational dilemma. In M. M. Kennedy (Ed.), Teacher
assessment and the quest for teacher quality: A handbook (pp. 375–402). San Francisco, CA: Jossey-Bass.
Cohen, D., Raudenbush, S., & Ball, D. (2003).
Resources, instruction, and research. Educational
Evaluation and Policy Analysis, 25, 119–142.
doi:10.3102/01623737025002119
Community Training and Assistance Center. (2004).
Catalyst for change: Pay for performance in Denver.
Boston, MA: Author.
Community Training and Assistance Center. (2008).
Tying earning to learning: The link between teacher
compensation and student learning objectives. Boston,
MA: Author.
Connecticut State Department of Education, Bureau of
Program and Teacher Evaluation. (2001). A guide to
the BEST program for beginning teachers. Hartford,
CT: Author.
Danielson, C. (1996). Enhancing professional practice: A
framework for teaching. Alexandria, VA: Association
for Supervision and Curriculum Development.
Danielson, C., & McGreal, T. L. (2000). Teacher evaluation to enhance professional practice. Alexandria,
VA: Association for Supervision and Curriculum
Development.
24
APA-HTA_V3-12-0603-020.indd 24
04/10/12 7:24 PM
Evaluating Teaching and Teachers
Desimone, L., Porter, A. C., Garet, M., Suk Yoon,
K., & Birman, B. (2002). Effects of professional
development on teachers’ instruction: Results
from a three-year longitudinal study. Educational
Evaluation and Policy Analysis, 24, 81–112.
doi:10.3102/01623737024002081
of admissions and licensure testing (ETS Teaching
and Learning Report Series No. ETS RR-03-25).
Princeton, NJ: ETS.
Gitomer, D. H., & Qi, Y. (2010). Score trends for
Praxis II. Washington, DC: U.S. Department of
Education, Office of Planning, Evaluation and Policy
Development, Policy and Program Studies Service.
Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D.,
Raudenbush, S., & Whitehurst, G. (2010).
Evaluating teachers the important role of value-added.
Washington, DC: Brown Center on Education.
Educational Testing Service. (2001). PRAXIS III:
Classroom performance assessments orientation guide.
Princeton, NJ: Author.
Goe, L. (2007). The link between teacher quality and
student outcomes. Washington, DC: National
Comprehensive Center for Teacher Quality.
Fang, Z. (1996). A review of research on teacher beliefs
and practices. Educational Research, 38, 47–65.
doi:10.1080/0013188960380104
Goldhaber, D., & Hansen, M. (2010). Race, gender, and
teacher testing: How informative a tool is teacher
licensure testing? American Educational Research
Journal, 47, 218–251. doi:10.3102/0002831209348970
L
A
SS
O
CI
A
TI
O
N
Dweck, C. S. (2002). The development of ability conceptions. In A. Wigfield & J. S. Eccles (Eds.),
Development of achievement motivation (pp. 57–88).
San Diego, CA: Academic Press. doi:10.1016/B978012750053-9/50005-X
IC
A
Ferguson, R. F. (2007). Toward excellence with equity:
An emerging vision for closing the achievement gap.
Boston, MA: Harvard Education Press.
G
Gonzales, P., Williams, T., Jocelyn, L., Roey, S., Kastberg,
D., & Brenwald, S. (2008). Highlights from TIMSS
2007: Mathematics and science achievement of U.S.
fourth- and eighth-grade students in an international
context (NCES 2009-001 Revised). Washington,
DC: National Center for Education Statistics,
Institute of Education Sciences, U.S. Department
of Education. Retrieved from http://nces.ed.gov/
pubs2009/2009001.pdf
IC
O
CH
Y
PS
N
A
Getzels, J. W., & Jackson, P. W. (1963). The teacher’s
personality and characteristics. In N. L. Gage (Ed.),
Handbook of research on teaching (pp. 506–582).
Chicago, IL: Rand McNally.
LO
Gersten, R., Baker, S. K., Haager, D., & Graves, A. W.
(2005). Exploring the role of teacher quality in
predicting reading outcomes for first-grade English
learners. Remedial and Special Education, 26, 197–
206. doi:10.1177/07419325050260040201
A
M
ER
Gibson, S., & Dembo, M. (1984). Teacher efficacy: A construct validation. Journal of Educational Psychology,
76, 569–582. doi:10.1037/0022-0663.76.4.569
O
FS
©
Gitomer, D. H. (2007). Teacher quality in a changing policy landscape: Improvements in the teacher pool (ETS
Policy Information Report No. PIC-TQ). Princeton,
NJ: ETS.
TE
D
PR
O
Gitomer, D. H. (2008a). Crisp measurement and messy
context: A clash of assumptions and metaphors—
Synthesis of Section III. In D. H. Gitomer (Ed.),
Measurement issues and the assessment for teacher
quality (pp. 223–233). Thousand Oaks, CA: Sage.
U
N
CO
RR
EC
Gitomer, D. H. (2008b). Reliability and NBPTS assessments. In L. Ingvarson & J. Hattie (Eds.), Assessing
teachers for professional certification: The first decade
of the National Board for professional teaching standards (pp. 231–253). Greenwich, CT: JAI Press.
doi:10.1016/S1474-7863(07)11009-7
Gitomer, D. H., & Bell, C. A. (2012, August). The instructional challenge in improving instruction: Lessons
from a classroom observation protocol. Paper presented at the European Association for Research on
Learning and Instruction Sig 18 Conference, Zurich,
Switzerland.
Gitomer, D. H., Latham, A. S., & Ziomek, R. (1999). The
academic quality of prospective teachers: The impact
Gordon, R., Kane, T. J., & Staiger, D. O. (2006).
Identifying effective teachers using performance
on the job (Hamilton Project Discussion Paper).
Washington, DC: Brookings Institution.
Grossman, P., Loeb, S., Cohen, J., Hammerness, K.,
Wyckoff, J., Boyd, D., & Lankford, H. (2010, May).
Measure for measure: The relationship between measures of instructional practice in middle school English
language arts and teachers’ value-added scores (NBER
Working Paper No. 16015). Cambridge, MA:
National Bureau of Economic Research.
Gullickson, A. R. (2008). The personnel evaluation standards: How to assess systems for evaluating educators.
Thousand Oaks, CA: Corwin Press.
Guskey, T. R., & Passaro, P. D. (1994). Teacher efficacy: A study of construct dimensions. American
Educational Research Journal, 31, 627–643.
Haertel, E. H. (1991). New forms of teacher assessment.
Review of Research in Education, 17, 3–29.
Hamre, B. K., Pianta, R. C., Downer, J. T., DeCoster,
J., Mashburn, A. J., Jones, S., . . . Hakigami, A. (in
press). Teaching through interactions: Testing a
developmental framework of teacher effectiveness in
over 4,000 classrooms. Elementary School Journal.
Haney, W. M., Madaus, G., & Kreitzer, A. (1987).
Charms talismanic: Testing teachers for the improvement of education. Review of Research in Education,
14, 169–238.
25
APA-HTA_V3-12-0603-020.indd 25
04/10/12 7:24 PM
Gitomer and Bell
Hanushek, E. A. (2002). Teacher quality. In L. T. Izumi &
W. M. Evers (Eds.), Teacher quality (pp. 1–13).
Stanford, CA: Hoover Institution Press.
A report to the Wallace Foundation. Seattle, WA: The
Center for the Study of Teaching and Policy.
Horizon Research. (2000). Inside classroom observation
and analytic protocol. Chapel Hill, NC: Author.
Harris, D., & McCaffrey, D. (2010). Value-added:
Assessing teachers’ contributions to student achievement. In M. M. Kennedy (Ed.), Teacher assessment and the quest for teacher quality: A handbook
(pp. 251–282). San Francisco, CA: Jossey-Bass.
Howard, B. B., & Gullickson, A. R. (2010). Setting
standards in teacher evaluation. In M. M. Kennedy
(Ed.), Teacher assessment and the quest for teacher
quality: A handbook (pp. 337–354). San Francisco,
CA: Jossey-Bass.
TI
O
N
Harris, D. N., & Sass, T. R. (2006). Value-added models
and the measurement of teacher quality. Unpublished
manuscript.
A
Ingvarson, L., & Hattie, J. (Eds.). (2008). Assessing teachers for professional certification: The first decade of the
National Board for Professional Teaching Standards.
Greenwich, CT: JAI Press.
SS
O
CI
Heneman, H. G., Milanowski, A., Kimball, S. M., &
Odden, A. (2006). Standards-based teacher evaluation as a foundation for knowledge- and skill-based
pay (CPRE Policy Briefs No. RB-45). Philadelphia,
PA: Consortium for Policy Research in Education,
University of Pennsylvania.
A
L
A
Jaeger, R. M. (1999). Some psychometric criteria for judging the quality of teacher certification tests. Paper
commissioned by the Committee on Assessment and
Teacher Quality. Greensboro: University of North
Carolina.
IC
Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis,
J. M., Phelps, G. C., Sleep, L., & Ball, D. L. (2008).
Mathematical knowledge for teaching and the
mathematical quality of instruction: An exploratory study. Cognition and Instruction, 26, 430–511.
doi:10.1080/07370000802177235
G
LO
Y
CH
O
©
A
M
ER
PR
O
O
FS
Hill, H. C., Schilling, S. G., & Ball, D. L. (2004).
Developing measures of teachers’ mathematics
knowledge for teaching. Elementary School Journal,
105, 11–30. doi:10.1086/428763
EC
TE
D
Hill, H. C., Umland, K. L., & Kapitula, L. R. (2011). A
validity argument approach to evaluating valueadded scores. American Educational Research Journal,
48, 794–831. doi:10.3102/0002831210387916
CO
RR
Hines, L. M. (2007). Return of the thought police? The
history of teacher attitude adjustment. Education
Next, 7(2), 58–65. Retrieved from http://educationnext.org/return-of-the-thought-police
U
N
Hirsch, E., & Sioberg, A. (2010). Using teacher working
conditions survey data in the North Carolina educator evaluation process. Santa Cruz, CA: New Teacher
Center. Retrieved from http://ncteachingconditions.
org/sites/default/files/attachments/NC10_brief_
TeacherEvalGuide.pdf
Honig, M. I., Copland, M. A., Rainey, L., Lorton, J. A., &
Newton, M. (2010, April). School district central office
transformation for teaching and learning improvement:
N
PS
Johnson, S. M., Kardos, S. K., Kauffman, D., Liu, E., &
Donaldson, M. L. (2004). The support gap: New
teachers’ early experiences in high-income and lowincome schools. Education Policy Analysis Archives,
12(61). Retrieved from http://epaa.asu.edu/ojs/
article/viewFile/216/342
A
IC
Hill, H. C., Dean, C., & Goffney, I. M. (2007). Assessing
elemental and structural validity: Data from teachers, non-teachers, and mathematicians. Measurement:
Interdisciplinary Research and Perspectives, 5(2–3),
81–92. doi:10.1080/15366360701486999
Hill, H. C., Rowan, B., & Ball, D. L. (2005). Effects
of teachers’ mathematical knowledge for
teaching on student achievement. American
Educational Research Journal, 42, 371–406.
doi:10.3102/00028312042002371
Johnson, S. M., Birkeland, S. E., Kardos, S. K., Kauffman,
D., Liu, E., & Peske, H. G. (2001, September/October).
Retaining the next generation of teachers: The importance of school-based support. Harvard Education
Letter. Retrieved from http://www.umd.umich.edu/
casl/natsci/faculty/zitzewitz/curie/TeacherPrep/99.pdf
Junker, B., Weisberg, Y., Matsumura, L. C., Crosson,
A., Wolf, M. K., Levison, A., & Resnick, L. (2006).
Overview of the Instructional Quality Assessment
(CSE Technical Report No. 671). Los Angeles, CA:
Center for the Study of Evaluation, National Center
for Research on Evaluation, Standards, and Student
Testing (CRESST)/UCLA.
Kane, M. T. (1982). A sampling model for validity.
Applied Psychological Measurement, 6, 125–160.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.),
Educational measurement (pp. 17–64). New York,
NY: Praeger.
Kane, T. J., Rockoff, J. E., & Staiger, D. O. (2006).
What does certification tell us about teacher effectiveness? Evidence from New York City. New York, NY:
National Bureau of Economic Research.
Kardos, S. K., & Johnson, S. M. (2007). On their own
and presumed expert: New teachers’ experiences
with their colleagues. Teachers College Record, 109,
2083–2106.
Kellor, E. M. (2002). Performance-based licensure in
Connecticut (CPRE-UW Working Paper Series
TC-02–10). Madison, WI: Consortium for Policy
Research in Education.
26
APA-HTA_V3-12-0603-020.indd 26
04/10/12 7:24 PM
Evaluating Teaching and Teachers
sensitivity of value-added teacher effect estimates to
different mathematics achievement measures. Journal
of Educational Measurement, 44, 47–67. doi:10.1111/
j.1745-3984.2007.00026.x
Klein, S. P., & Stecher, B. (1991). Developing a prototype
licensing examination for secondary school teachers. Journal of Personnel Evaluation in Education, 5,
169–190. doi:10.1007/BF00117336
Lussier, D. F., & Forgione, P. D., Jr. (2010). Supporting
and rewarding accomplished teaching: Insights from
Austin, Texas. Theory Into Practice, 49, 233–242. doi:
10.1080/00405841.2010.487771
Koh, K., & Luke, A. (2009). Authentic and conventional
assessment in Singapore schools: An empirical study
of teacher assignments and student work. Assessment
in Education: Principles, Policy, and Practice, 16,
291–318.
Luyten, H. (2003). The size of school effects compared
to teacher effects: An overview of the research literature. School Effectiveness and School Improvement, 14,
31–51. doi:10.1076/sesi.14.1.31.13865
CI
A
TI
O
N
Kennedy, M. M. (2010). Approaches to annual performance assessment. In M. M. Kennedy (Ed.), Teacher
assessment and the quest for teacher quality: A handbook (pp. 225–250). San Francisco, CA: Jossey-Bass.
Malmberg, L. E., Hagger, H., Burn, K., Mutton, T., &
Colls, H. (2010). Observed classroom quality during
teacher education and two years of professional practice. Journal of Educational Psychology, 102, 916–932.
doi:10.1037/a0020920
A
SS
O
Koretz, D. (2008). Measuring up: What educational testing
really tells us. Cambridge, MA: Harvard University
Press.
A
L
Kornfeld, J., Grady, K., Marker, P. M., & Ruddell, M. R.
(2007). Caught in the current: A self-study of statemandated compliance in a teacher education program. Teachers College Record, 109, 1902–1930.
G
IC
Martineau, J. A. (2006). Distorting value-added: The use
of longitudinal, vertically scaled student achievement
data for growth-based, value-added accountability.
Journal of Educational and Behavioral Statistics, 31,
35–62. doi:10.3102/10769986031001035
O
LO
Krauss, S., Baumert, J., & Blum, W. (2008). Secondary
mathematics teachers’ pedagogical content knowledge and content knowledge: Validation of the
COACTIV Constructs. ZDM—The International
Journal on Mathematics Education, 40, 873–892.
PS
N
ER
IC
A
Kupermintz, H. (2003). Teacher effects and teacher effectiveness: A validity investigation of the Tennessee
Value Added Assessment System. Educational
Evaluation and Policy Analysis, 25, 287–298.
doi:10.3102/01623737025003287
Y
CH
Matsumura, L. C., Garnier, H., Slater, S. C., & Boston,
M. (2008). Toward measuring instructional interactions at-scale. Educational Assessment, 13, 267–300.
doi:10.1080/10627190802602541
FS
©
A
M
Kyriakides, L., & Creemers, B. P. M. (2008). A
longitudinal study on the stability over time
of school and teacher effects on student outcomes. Oxford Review of Education, 34, 521–545.
doi:10.1080/03054980701782064
PR
O
O
Ladd, H. F. (2007, November). Holding schools accountable revisited. Paper presented at APPAM Fall
Research Conference, Washington, DC.
EC
TE
D
La Paro, K. M., Pianta, R. C., & Stuhlman, M. (2004).
The classroom assessment scoring system: Findings
from the prekindergarten year. Elementary School
Journal, 104, 409–426. doi:10.1086/499760
U
N
CO
RR
Linn, R. L. (2008, November 13–14). Measurement issues
associated with value-added models. Paper presented
at the workshop of the Committee on Value-Added
Methodology for Instructional Improvement,
Program Evaluation, and Educational Accountability,
National Research Council, Washington, DC.
Retrieved from http://www7.nationalacademies.org/
bota/VAM_Robert_Linn_Paper.pdf
Livingston, S., & Zieky, M. (1982). Passing scores: A manual for setting standards of performance on educational
and occupational tests. Princeton, NJ: ETS.
Lockwood, J. R., McCaffrey, D. F., Hamilton, L. S.,
Stecher, B. M., Le, V., & Martinez, J. F. (2007). The
Matsumura, L. C., & Pascal, J. (2003). Teachers’ assignments and student work: Opening a window on classroom practice (CSE Report No. 602). Los Angeles,
CA: Center for the Study of Evaluation, National
Center for Research on Evaluation, Standards, and
Student Testing (CRESST)/UCLA.
Matsumura, L. C., Slater, S. C., Junker, B., Peterson, M.,
Boston, M., Steele, M., & Resnick, L. (2006). Measuring
reading comprehension and mathematics instruction in
urban middle schools: A pilot study of the Instructional
Quality Assessment (CSE Technical Report No. 681).
Los Angeles, CA: Center for the Study of Evaluation,
National Center for Research on Evaluation, Standards,
and Student Testing (CRESST)/UCLA.
McCaffrey, D. F. (2011, April). Sources of variance and
mode effects in measures of teaching in algebra classes.
Paper presented at annual meeting of the National
Council on Measurement, New Orleans, LA.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., &
Hamilton, L. S. (2003). Evaluating value-added models
for teacher accountability. Santa Monica, CA: RAND.
Messick, S. (1989). Validity. In R. L. Linn (Ed.),
Educational measurement (3rd ed., pp. 13–103).
New York, NY: American Council of Education and
Macmillan.
Metzger, S. A., & Wu, M. J. (2008). Commercial teacher
selection instruments: The validity of selecting
teachers through beliefs, attitudes, and values.
27
APA-HTA_V3-12-0603-020.indd 27
04/10/12 7:24 PM
Gitomer and Bell
Review of Educational Research, 78, 921–940.
doi:10.3102/0034654308323035
Milanowski, A. (2004). The relationship between
teacher performance evaluation scores and student
achievement: Evidence from Cincinnati. Peabody
Journal of Education, 79(4), 33–53. doi:10.1207/
s15327930pje7904_3
CI
A
TI
O
N
Mitchell, K., Shkolnik, J., Song, M., Uekawa, K., Murphy,
R., & Means, B. (2005). Rigor, relevance, and results:
The quality of teacher assignments and student work in
new and conventional high schools. Washington, DC:
American Institutes for Research and SRI.
Ohio Department of Education. (2006). Report on the
quality of teacher education in Ohio, 2004–2005.
Columbus, OH: Author.
Pacheco, A. (2008). Mapping the terrain of teacher quality. In D. H. Gitomer (Ed.), Measurement issues
and assessment for teacher quality (pp. 160–178).
Thousand Oaks, CA: Sage.
Pearlman, M. (2008). The design architecture of NBPTS
certification assessments. In L. Ingvarson & J. Hattie
(Eds.), Assessing teachers for professional certification:
The first decade of the National Board for Professional
Teaching Standards (pp. 55–91). Greenwich, CT: JAI
Press. doi:10.1016/S1474-7863(07)11003-6
Phelps, G. (2009). Just knowing how to read isn’t enough!
What teachers know about the content of reading.
Educational Assessment, Evaluation, and Accountability,
21, 137–154. doi:10.1007/s11092-009-9070-6
Phelps, G., & Schilling, S. (2004). Developing measures of
content knowledge for teaching reading. Elementary
School Journal, 105, 31–48. doi:10.1086/428764
Pianta, R. C., Hamre, B. K., Haynes, N. J., Mintz, S. L., &
La Paro, K. M. (2007). Classroom assessment
scoring system manual, middle/secondary version.
Charlottesville: University of Virginia.
Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2007).
Classroom assessment scoring system. Baltimore, MD:
Brookes.
Pianta, R. C., Mashburn, A. J., Downer, J. T., Hamre, B.
K., & Justice, L. (2008). Effects of web-mediated
professional development resources on teacherchild interactions in pre-kindergarten classrooms.
Early Childhood Research Quarterly, 23, 431–451.
doi:10.1016/j.ecresq.2008.02.001
Piburn, M., & Sawada, D. (2000). Reformed Teaching
Observation Protocol (RTOP) reference manual.
Tempe: Arizona State University.
Pilley, J. G. (1941). The National Teacher
Examination Service. School Review, 49, 177–186.
doi:10.1086/440636
Popham, W. J. (1992). Appropriate expectations for
content judgments regarding teacher licensure
tests. Applied Measurement in Education, 5, 285–301.
doi:10.1207/s15324818ame0504_1
Programme for International Student Assessment. (2006).
PISA 2006 science competencies for tomorrow’s world.
Organisation for Economic Co-operation and
Development. Retrieved from http://www.oei.es/eval
uacioneducativa/InformePISA2006-FINALingles.pdf
Pullin, D. (1999). Criteria for evaluating teacher tests:
A legal perspective. Washington, DC: National
Academies Press.
Pullin, D. (2010). Judging teachers: The law of teacher
dismissals. In M. M. Kennedy (Ed.), Teacher assessment and the quest for teacher quality: A handbook
(pp. 297–333). San Francisco, CA: Jossey-Bass.
A
L
A
SS
O
Molnar, A., Smith, P., Zahorik, J., Palmer, A., Halbach, A., &
Ehrle, K. (1999). Evaluating the SAGE program: A
pilot program in targeted pupil-teacher reduction
in Wisconsin. Educational Evaluation and Policy
Analysis, 21, 165–177.
LO
G
IC
Moss, P. A., & Schutz, A. (1999). Risking frankness
in educational assessment. Phi Delta Kappan, 80,
680–687.
PS
Y
CH
O
Moss, P. A., Sutherland, L. M., Haniford, L., Miller, R.,
Johnson, D., Geist, P. K., . . . Pecheone, R. L. (2004).
Interrogating the generalizability of portfolio assessments of beginning teachers: A qualitative study.
Education Policy Analysis Archives, 12(32), 1–70.
ER
A
IC
National Research Council. (2001). Testing teacher candidates: The role of licensure tests in improving teacher
quality. Washington, DC: National Academies Press.
N
National Board for Professional Teaching Standards.
(2010). Retrieved from http://nbpts.org/the_standards
©
A
M
National Research Council. (2008). Assessing accomplished teaching: Advanced-level certification programs.
Washington, DC: National Academies Press.
PR
O
O
FS
National Research Council and National Academy of
Education. (2010). Getting value out of value-added:
Report of a workshop. Washington, DC: National
Academies Press.
EC
TE
D
Newmann, F. M., Bryk, A. S., & Nagaoka, J. K. (2001).
Authentic intellectual work and standardized tests:
Conflict or coexistence? Chicago, IL: Consortium on
Chicago School Research.
CO
RR
No Child Left Behind Act of 2001, Pub. L. No. 107-110,
§ 115 Stat 1425 (2002).
U
N
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004).
How large are teacher effects? Educational Evaluation
and Policy Analysis, 26, 237–257. doi:10.3102/
01623737026003237
Oakes, J. (1987). Tracking in secondary schools: A contextual perspective. Santa Monica, CA: RAND.
Odden, A., & Kelley, C. (2002). Paying teachers for what
they know and can do: New and smarter compensation
strategies to improve student learning. Thousand Oaks,
CA: Corwin Press.
28
APA-HTA_V3-12-0603-020.indd 28
04/10/12 7:24 PM
Evaluating Teaching and Teachers
Schilling, S. G., & Hill, H. C. (2007). Assessing measures
of mathematical knowledge for teaching: A validity
argument approach. Measurement: Interdisciplinary
Research and Perspectives, 5(2–3), 70–80.
doi:10.1080/15366360701486965
Raudenbush, S. W. (2004). What are value-added models
estimating and what does this imply for statistical practice? Journal of Educational and Behavioral Statistics, 29,
121–129. doi:10.3102/10769986029001121
Raudenbush, S. W., Martinez, A., Bloom, H., Zhu, P., &
Lin, F. (2010). Studying the reliability of group-level
measures with implications for statistical power: A
six-step paradigm (Working paper). Chicago, IL:
University of Chicago.
N
Shkolnik, J., Song, M., Mitchell, K., Uekawa, K.,
Knudson, J., & Murphy, R. (2007). Changes in rigor,
relevance, and student learning in redesigned high
schools. Washington, DC: American Institutes for
Research and SRI.
TI
O
Raudenbush, S. W., Martinez, A., & Spybrook, J. (2007).
Strategies for improving precision in group-randomized
experiments. Educational Evaluation and Policy
Analysis, 29, 5–29. doi:10.3102/0162373707299460
CI
A
Shulman, L. (1986). Those who understand: Knowledge
growth in teaching. Educational Researcher, 15(2),
4–14.
SS
O
Stecher, B. M., Vernez, G., & Steinberg, P. (2010).
Reauthorizing No Child Left Behind: Facts and recommendations. Santa Monica, CA: RAND.
A
Raver, C. C., Jones, S. M., Li-Grining, C. P., Metzger,
M., Smallwood, K., & Sardin, L. (2008). Improving
preschool classroom processes: Preliminary findings
from a randomized trial implemented in Head Start
settings. Early Childhood Research Quarterly, 23,
10–26. doi:10.1016/j.ecresq.2007.09.001
IC
A
L
Stein, B., Haynes, A., Redding, M., Ennis, T., & Cecil,
M. (2007). Assessing critical thinking in STEM and
beyond. In M. Iskander (Ed.), Innovations in E-learning,
instruction technology, assessment, and engineering
education (pp. 79–82). Dordrecht, the Netherlands:
Springer. doi:10.1007/978-1-4020-6262-9_14
O
FS
©
A
M
Rowan, B., Camburn, E., & Correnti, R. (2004). Using
teacher logs to measure the enacted curriculum
in large-scale surveys: Insights from the study of
instructional improvement. Elementary School
Journal, 105, 75–101. doi:10.1086/428803
TE
D
PR
O
Rowan, B., & Correnti, R. (2009). Studying reading
instruction with teacher logs: Lessons from A Study
of Instructional Improvement. Educational Researcher,
38(2), 120–131. doi:10.3102/0013189X09332375
CO
RR
EC
Rowan, B., Correnti, R., & Miller, R. J. (2002). What
large-scale, survey research tells us about teacher
effects on student achievement: Insights from the
Prospects study of elementary schools. Teachers
College Record, 104, 1525–1567. doi:10.1111/
1467-9620.00212
U
N
Rowan, B., Jacob, R., & Correnti, R. (2009). Using
instructional logs to identify quality in educational
settings. New Directions for Youth Development,
2009(121), 13–31. doi:10.1002/yd.294
Samaras, A. P., Francis, S. L., Holt, Y. D., Jones, T.
W., Martin, D. S., Thompson, J. L., & Tom, A. R.
(1999). Lived experiences and reflections of joint
NCATE-state reviews. Teacher Educator, 35, 68–83.
doi:10.1080/08878739909555218
LO
O
Y
PS
N
ER
IC
A
Rothstein, J. (2009). Student sorting and bias in value
added estimation: Selection on observables and
unobservables. Education Finance and Policy, 4, 537–
571. doi:10.1162/edfp.2009.4.4.537
Szpara, M. Y., & Wylie, E. C. (2007). Writing differences
in teacher performance assessments: An investigation of African American language and edited
American English. Applied Linguistics, 29, 244–266.
doi:10.1093/applin/amm003
CH
Reardon, S. F., & Raudenbush, S. W. (2009).
Assumptions of value-added models for estimating school effects. Educational Finance and Policy 4,
492–519. doi:10.1162/edfp.2009.4.4.492
G
Ravitch, D. (2010). The death and life of the great American
school system: How testing and choice are undermining
education. New York, NY: Basic Books.
Taylor, B. M., Pearson, P. D., Peterson, D. S., &
Rodriguez, M. C. (2003). Reading growth in highpoverty classrooms: The influence of teacher practices that encourage cognitive engagement in literacy
learning. Elementary School Journal, 104, 3–28.
doi:10.1086/499740
Toch, T., & Rothman, R. (2008). Rush to judgment:
Teacher evaluation in public education. Washington,
DC: Education Sector.
Tucker, P. D., Stronge, J. H., Gareis, C. R., & Beers, C. S.
(2003). The efficacy of portfolios for teacher evaluation and professional development: Do they make a
difference? Educational Administration Quarterly, 39,
572–602. doi:10.1177/0013161X03257304
Turque, B. (2010, July 24). Rhee dismisses 241 D.C. teachers; union vows to contest firings. Washington Post.
Retrieved from http://www.washingtonpost.com/wpdyn/content/article/2010/07/23/AR2010072303093.html
Tyack, D., & Cuban, L. (1995). Tinkering toward utopia:
A century of public school reform. Cambridge, MA:
Harvard University Press.
U.S. Department of Education. (2010). Race to the Top
program: Executive summary. Retrieved from http://
www2.ed.gov/programs/racetothetop/executivesummary.pdf
Usher, E. L., & Pajares, F. (2009). Sources of self-efficacy
in mathematics: A validation study. Contemporary
29
APA-HTA_V3-12-0603-020.indd 29
04/10/12 7:24 PM
Gitomer and Bell
Educational Psychology, 34, 89–101. doi:10.1016/j.
cedpsych.2008.09.002
Wilson, M., Hallam, P. J., Pecheone, R., & Moss, P.
(2006, April). Using student achievement test scores
as evidence of external validity for indicators of teacher
quality: Connecticut’s Beginning Educator Support and
Training Program. Paper presented at the annual
meeting of the American Educational Research
Association, San Francisco, CA.
Wilson, S. (2008). Measuring teacher quality for professional entry. In D. H. Gitomer (Ed.), Measurement
issues and assessment for teaching quality (pp. 8–29).
Thousand Oaks, CA: Sage.
Wilson, S. M., & Youngs, P. (2005). Research on
accountability processes in teacher education. In
M. Cochran-Smith & K. M. Zeichner (Eds.), Studying
teacher education: The report of the AERA panel
on research and teacher education (pp. 591–643).
Mahwah, NJ: Erlbaum.
Woolfolk, A. E., & Hoy, W. K. (1990). Prospective
teachers’ sense of efficacy and beliefs about control. Journal of Educational Psychology, 82, 81–91.
doi:10.1037/0022-0663.82.1.81
Wasley, P. (2006, June 16). Accreditor of education
schools drops controversial “social justice” language.
Chronicle of Higher Education, p. A13.
TI
O
N
Wayne, A. J., & Youngs, P. (2003). Teacher characteristics and student achievement gains: A review.
Review of Educational Research, 73, 89–122.
doi:10.3102/00346543073001089
SS
O
CI
A
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D.
(2009). The widget effect: Our national failure to
acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project.
U
N
CO
RR
EC
TE
D
PR
O
O
FS
©
A
M
ER
IC
A
N
PS
Y
CH
O
LO
G
IC
A
L
A
Wenzel, S., Nagaoka, J. K., Morris, L., Billings, S., &
Fendt, C. (2002). Documentation of the 1996–2002
Chicago Annenberg Research Project Strand on
Authentic Intellectual Demand exhibited in assignments and student work: A technical process manual.
Chicago, IL: Consortium on Chicago School
Research.
30
APA-HTA_V3-12-0603-020.indd 30
04/10/12 7:24 PM