Description: Tags: REL 2007017

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

ISSUES &ANSWERS R E L 2 0 0 7– N o .

017

The predictive
At Pennsylvania State University
validit y of
selected
benchmark
assessments
used in the
Mid-Atlantic
Region

U.S. D e p a r t m e n t o f E d u c a t i o n
ISSUES & ANSWERS R E L 2 0 0 7 – N o . 0 17

At Pennsylvania State University

The predictive validity of selected


benchmark assessments used
in the Mid-Atlantic Region
November 2007

Prepared by
Richard S. Brown
University of Southern California
Ed Coughlin
Metiri Group

U.S. D e p a r t m e n t o f E d u c a t i o n
WA
ME
MT ND
VT
MN
OR NH
ID SD WI NY
MI
WY
IA PA
NE NJ
NV OH DE
IL IN
UT MD
WV
CA CO VA DC
KS MO KY
NC
TN
AZ OK
NM AR SC

AL GA
MS
LA
TX
AK

FL

At Pennsylvania State University

Issues & Answers is an ongoing series of reports from short-term Fast Response Projects conducted by the regional educa-
tional laboratories on current education issues of importance at local, state, and regional levels. Fast Response Project topics
change to reflect new issues, as identified through lab outreach and requests for assistance from policymakers and educa-
tors at state and local levels and from communities, businesses, parents, families, and youth. All Issues & Answers reports
meet Institute of Education Sciences standards for scientifically valid research.

November 2007

This report was prepared for the Institute of Education Sciences (IES) under Contract ED-06-CO-0029 by Regional Edu-
cational Laboratory Mid-Atlantic administered by Pennsylvania State University. The content of the publication does not
necessarily reflect the views or policies of IES or the U.S. Department of Education nor does mention of trade names, com-
mercial products, or organizations imply endorsement by the U.S. Government.

This report is in the public domain. While permission to reprint this publication is not necessary, it should be cited as:

Brown, R. S., & E. Coughlin. (2007). The predictive validity of selected benchmark assessments used in the Mid-Atlantic
Region (Issues & Answers Report, REL 2007–No. 017). Washington, DC: U.S. Department of Education, Institute of Educa-
tion Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-
Atlantic. Retrieved from http://ies.ed.gov/ncee/edlabs

This report is available on the regional educational laboratory web site at http://ies.ed.gov/ncee/edlabs.
iii

Summary
The predictive validity of selected
benchmark assessments used
in the Mid-Atlantic Region
This report examines the availability and on Measurement in Education, 1999). But the
quality of predictive validity data for availability of such information and its inter-
a selection of benchmark assessments pretability by district personnel vary across
identified by state and district personnel instruments. When the information is not
as in use within Mid-Atlantic Region juris- readily available, it is important for the user
dictions. The report finds that evidence to establish such evidence of validity. A major
is generally lacking of their predictive constraint on district testing programs is the
validity with respect to state assessment lack of resources and expertise to conduct
tests. validation studies of this type.

Many districts and schools across the United As an initial step in collecting evidence on the
States have begun to administer periodic as- validity of district tests, this study focuses on
sessments to complement end-of-year state the use of benchmark assessments to predict
testing and provide additional information for performance on state tests (predictive valid-
a variety of purposes. These assessments are ity). Based on a review of practices within the
used to provide information to guide instruc- school districts in the Mid-Atlantic Region,
tion (formative assessment), monitor student this report details the benchmark assessments
learning, evaluate teachers, predict scores on being used, in which states and grade levels,
future state tests, and identify students who are and the technical evidence available to sup-
likely to score below proficient on state tests. port the use of these assessments for predic-
tive purposes. The report also summarizes
Some of these assessments are locally devel- the findings of conversations with test pub-
oped, but many are provided by commercial lishing company personnel and of technical
test developers. Locally developed assessments reports, administrative manuals, and similar
are not usually adequately validated for any materials.
of these purposes, but commercially available
testing products should provide evidence of The key question this study addresses is: What
validity for the explicit purposes for which the evidence is there, for a selection of commonly
assessment has been developed (American used commercial benchmark assessments, of
Educational Research Association, American the predictive relationship of each instrument
Psychological Association, & National Council with respect to the state assessment?
iv Summary

The study investigates the evidence provided Moreover, nearly all of the criterion validity
to establish a relationship between district and studies showing a link between these bench-
state test scores, and between performance on mark assessments and state test scores in the
district-administered benchmark assessments Mid-Atlantic Region used the Pennsylvania
and proficiency levels on state assessments State System of Assessment (CTB/McGraw-Hill,
(for example, at what cutpoints on benchmark 2002a; Renaissance Learning, 2001a, 2002) as
assessments do students tend to qualify as the object of prediction. One study used the
proficient or advanced on state tests?). When Delaware Student Testing Program test as the
particular district benchmark assessments criterion measure at a single grade level, and
cover only a subset of state test content, the several studies for MAP and STAR were related
study sought evidence of whether district tests to the Stanford Achievement Test–Version 9
correlate not only with overall performance on (SAT–9) (Northwest Evaluation Association,
the state test but also with relevant subsections 2003, 2004; Renaissance Learning, 2001a, 2002)
of the state test. used in the District of Columbia. None of the
studies showed predictive or concurrent validity
While the commonly used benchmark assess- evidence for tests used in the other Mid-Atlantic
ments in the Mid-Atlantic Region jurisdic- Region jurisdictions. Thus, no predictive or con-
tions may possess strong internal psycho- current validity evidence was found for any of
metric characteristics, the report finds that the benchmark assessments reviewed here for
evidence is generally lacking of their predic- state assessments in Maryland and New Jersey.
tive validity with respect to the required
state or summative assessments. A review of To provide the Mid-Atlantic Region jurisdic-
the evidence for the four benchmark assess- tions with additional information on the pre-
ments considered—Northwest Evaluation dictive validity of the benchmark assessments
Association’s Measures of Academic Progress currently used, further research is needed
(MAP; Northwest Evaluation Association, linking these benchmark assessments and the
2003), Renaissance Learning’s STAR Math/ state tests currently in use. Additional research
STAR Reading (Renaissance Learning, 2001a, could help to develop the type of predictive
2002), Scholastic’s Study Island (Study Island, validity evidence school districts need to make
2006a), and CTB/McGraw Hill’s TerraNova informed decisions about which benchmark as-
(CTB/McGraw Hill, 2001b)—finds documen- sessments correspond to state assessment out-
tation of criterion validity of some sort for comes, so that instructional decisions meant to
three of them (STAR, MAP, and TerraNova), improve student learning as measured by state
but only one was truly a predictive study and tests have a reasonable chance of success.
demonstrated strong evidence of predictive
validity (TerraNova). November 2007
v

Table of contents

The importance of validity testing   1


Purposes of assessments   3
Review of previous research   4
About this study   4
Review of benchmark assessments   6
Northwest Evaluation Association’s Measures of Academic Progress (MAP) Math and Reading assessments   7
Renaissance Learning’s STAR Math and Reading assessments    8
Scholastic’s Study Island Math and Reading assessments    10
CTB/McGraw-Hill’s TerraNova Math and Reading assessments   10
Need for further research    12
Appendix A  Methodology   13
Appendix B  Glossary   16
Appendix C  Detailed findings of benchmark assessment analysis   18
Notes   26
References    27
Boxes
1 Key terms used in the report   2
2 Methodology and data collection   6
Tables
1 Mid-Atlantic Region state assessment tests   3
2 Benchmark assessments with significant levels of use in Mid-Atlantic Region jurisdictions   5
3 Northwest Evaluation Association’s Measures of Academic Progress: assessment description and use   8
4 Northwest Evaluation Association’s Measures of Academic Progress: predictive validity   8
5 Renaissance Learning’s STAR: assessment description and use   9
6 Renaissance Learning’s STAR: predictive validity   9
7 Scholastic’s Study Island: assessment description and use   10
8 Scholastic’s Study Island: predictive validity   10
9 CTB/McGraw-Hill’s TerraNova: assessment description and use   11
10 CTB/McGraw-Hill’s TerraNova: predictive validity   11
A1 Availability of assessment information   14
vi

C1 Northwest Evaluation Association’s Measures of Academic Progress: reliability coefficients   18


C2 Northwest Evaluation Association’s Measures of Academic Progress: predictive validity   18
C3 Northwest Evaluation Association’s Measures of Academic Progress: content/construct validity    19
C4 Northwest Evaluation Association’s Measures of Academic Progress: administration of the assessment   19
C5 Northwest Evaluation Association’s Measures of Academic Progress: reporting    19
C6 Renaissance Learning’s STAR: reliability coefficients   20
C7 Renaissance Learning’s STAR: content/construct validity    20
C8 Renaissance Learning’s STAR: appropriate samples for assessment validation and norming    21
C9 Renaissance Learning’s STAR: administration of the assessment   21
C10 Renaissance Learning’s STAR: reporting    22
C11 Scholastic’s Study Island: reliability coefficients   22
C12 Scholastic’s Study Island: content/construct validity    22
C13 Scholastic’s Study Island: appropriate samples for assessment validation and norming    23
C14 Scholastic’s Study Island: administration of the assessment   23
C15 Scholastic’s Study Island: reporting    23
C16 CTB/McGraw-Hill’s TerraNova: reliability coefficients   24
C17 CTB/McGraw-Hill’s TerraNova: content/construct validity    24
C18 CTB/McGraw-Hill’s TerraNova: appropriate samples for test validation and norming    24
C19 CTB/McGraw-Hill’s TerraNova: administration of the assessment    25
C20 CTB/McGraw-Hill’s TerraNova: reporting    25
The importance of validity testing 1

The importance of validity testing


This report
In a small Mid-Atlantic school district
examines the performance on the annual state assess-
ment had the middle school in crisis.
availability For a second year the school had failed
to achieve adequate yearly progress, and
and quality scores in reading and math were the low-
est in the county. The district assigned a
of predictive central office administrator, “Dr. Wil-
liams,” a former principal, to solve the
validity data problem. Leveraging Enhancing Education
Through Technology (EETT) grant money,
for a selection Dr. Williams purchased a comprehensive
computer-assisted instruction system to
of benchmark target reading and math skills for strug-
gling students. According to the sales rep-

assessments resentative, the system had been correlated


to state standards and included a bench-

identified by mark assessment tool that would provide


monthly feedback on each student so staff

state and district could monitor progress and make neces-


sary adjustments. A consultant recom-

personnel as in mended by the publisher of the assessment


tool was contracted to implement and
monitor the program. Throughout the year
use within Mid- the benchmark assessments showed steady
progress. EETT program evaluators,
Atlantic Region impressed by the ongoing data gathering
and analysis, selected the school for a web-
jurisdictions. based profile. When spring arrived, the
consultant and the assessment tool were
The report finds predicting that students would achieve
significant gains on the state assessment.
that evidence is But when the scores came in, the predicted
gains did not materialize. The data on the
generally lacking benchmark assessments seemed unrelated
to those on the state assessment. By the fall
of their predictive the assessment tool, the consultant, and
Dr. Williams had been removed from the
validity with school.1

respect to state This story points to the crucial role of predictive


validity—the ability of one measure to predict
assessment tests. performance on a second measure of the same
outcome—in the assessment process (see box 1
for definitions of key terms). The school in this
2 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Box 1
Key terms used in the report
Criterion. A standard or measure on Predictive validity. The ability of
Benchmark assessment. A bench- which a judgment may be based. one assessment tool to predict future
mark assessment is a formative performance either in some activity
assessment, usually with two or Criterion validity. The ability of a (success in college, for example) or
more equivalent forms so that the measure to predict performance on a on another assessment of the same
assessment can be administered to second measure of the same construct, construct.
the same children at multiple times computed as a correlation. If both
over a school year without evidence measures are administered at approxi- Reliability. The degree to which test
of practice effects (improvements mately the same time, this is described scores for a group of test takers are
in scores resulting from taking the as concurrent validity. If the second consistent over repeated applications
same version of a test multiple times). measure is taken after the first, the abil- of a measurement procedure and
In addition to formative functions, ity is described as predictive validity. hence are inferred to be dependable
benchmark assessments allow educa- and repeatable for an individual test
tors to monitor the progress of stu- Formative assessment. An assess- taker; the degree to which scores are
dents against state standards and to ment designed to provide informa- free of errors of measurement for a
predict performance on state exams. tion to guide instruction. given group.

example had accepted the publisher’s claim that information and its interpretability by district
performance on the benchmark assessments personnel vary across instruments. When such
would predict performance on the state assess- information is not readily available, it is impor-
ment. It did not. tant for the user to establish evidence of validity.
A major constraint on district testing programs
Many districts and schools across the United is the lack of resources and expertise to conduct
States have begun to administer periodic assess- validation studies of this type.
ments to complement end-of-year state testing and
provide additional information for a variety of The most recent edition of Standards for Edu-
purposes. These assessments are used to provide cational and Psychological Testing states that
information to guide instruction (formative as- predictive evidence indicates how accurately test
sessment), monitor student learning, evaluate data can predict criterion scores, or scores on
teachers, predict scores on future state tests, and other tests used to make judgments about student
identify students who are likely to score below performance, obtained at a later time (American
proficient on state tests. Educational Research Association, American
Psychological Association, & National Council on
Some of these assessments are locally devel- Measurement in Education, 1999, pp. 179–180).
oped, but many are provided by commercial test As an initial step in collecting evidence on the
developers. Locally developed assessments are validity of district tests, this study focuses on
not usually adequately validated for any of these use of benchmark assessments to predict per-
purposes, but commercially available testing formance on state tests. It investigates whether
products should provide validity evidence for the there is evidence of a relationship between district
explicit purposes for which the assessment has and state test scores and between performance
been developed (American Educational Research on locally administered benchmark assessments
Association, American Psychological Association, and proficiency levels on state tests (for example,
& National Council on Measurement in Educa- at what cutpoints on benchmark assessments do
tion, 1999). But the availability of this type of students tend to qualify as proficient or advanced
Purposes of assessments 3

Table 1
Mid-Atlantic Region state assessment tests
State or jurisdiction State assessment test Source
Retrieved March 14, 2007, from http://www.doe.k12.
Delaware Delaware Student Testing Program (DSTP) de.us/AAB/DSTP_publications_2005.html
Retrieved March 14, 2007, from http://www.k12.
District of Columbia Comprehensive Assessment dc.us/dcps/data/dcdatahome.html and www.
District of Columbia System (DC CAS)a greatschools.com
Maryland High School Assessments (HSA)b Retrieved March 14, 2007, from http://www.
Maryland
Maryland School Assessment (MSA) marylandpublicschools.org/MSDE/testing/
New Jersey Assessment of Skills and Knowledge
(NJ ASK 3–7) Retrieved March 14, 2007, from http://www.state.
New Jersey
Grade Eight Proficiency Assessment (GEPA) nj.us/njded/assessment/schedule.shtml
High School Proficiency Assessment (HSPA)
Retrieved March 14, 2007, from http://www.pde.
state.pa.us/a_and_t/site/default.asp?g=0&a_and_
Pennsylvania Pennsylvania State System of Assessment (PSSA) tNav=|630|&k12Nav=|1141|

a. In the 2005/06 school year the District of Columbia replaced the Scholastic Aptitude Test–Version 9 with the District of Columbia Comprehensive Assess-
ment System.
b. Beginning with the class of 2009, students are required to pass the Maryland High School Assessments in order to graduate. Students graduating before
2009 must also take the assessments, but are not required to earn a particular passing score.

on state tests?). (Table 1 lists the state assessment how assessment information is to be used and to
tests for each Mid-Atlantic Region jurisdiction.) validate the assessments for those uses.
When a district benchmark assessment covers
only a subset of state test content, the study looks This study examines the availability and quality of
for evidence that the assessment correlates not predictive validity data for a selection of bench-
only with overall performance on the state test but mark assessments that state and district personnel
also with relevant subsections of the state test. identified to be in use within the Mid-Atlantic Re-
gion jurisdictions (Delaware, District of Columbia,
Maryland, New Jersey, and Pennsylvania).
Purposes of assessments
For this review a benchmark assessment is defined
Assessments have to be judged against their as a formative assessment (providing data for in-
intended uses. There is no absolute criterion for structional decisionmaking), usually with two or
judging assessments. It is not possible to say, for more equivalent forms so that the assessment can
example, that a given assessment is good for any be administered to the same children at multiple
and all purposes; it is only possible to say, based times over a school year without evidence of prac-
on evidence, that the assessment has evidence tice effects (improvements in scores resulting from
of validity for specific purposes. Furthermore, taking the same version of a test multiple times).
professional assessment standards require that In addition to formative functions, benchmark as-
assessments be validated for all their intended sessments allow educators to monitor the progress
uses. A clear statement of assessment purposes of students against state standards and should
also provides essential guidance for test and as- predict performance on state exams.
sessment item developers. Different purposes may
require different content coverage, different types Frequently, benchmark assessments are used to
of items, and so on. Thus, it is critical to identify identify students who may not perform well on
4 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

the state exams or to evaluate how well schools are Revised (VanDerHeyden, Witt, Naquin, & Noell,
preparing students for the state exams. These uses 2001), and others reviewed the predictive validity
may require additional analysis by the districts. of the Dynamic Indicators of Basic Early Literacy
The predictive ability of an assessment is not a Skills for its relationship to the Iowa Test of Basic
use but rather a quality of the assessment. For Skills for a group of students in Michigan (Schil-
example, college admissions tests are supposed to ling, Carlisle, Scott, & Zheng, 2007). McGlinchey
predict future performance in college, but the tests and Hixson (2004) studied the predictive validity
are used to decide who to admit to college. Part of of curriculum-based measures for student reading
the evidence of predictive validity for these tests performance on the Michigan Educational Assess-
consists of data on whether students who perform ment Program’s fourth-grade reading assessment.
well on the test also do well in college. Similar
correlation evidence should be obtained for the Similar investigations studied mathematics. Clarke
benchmark assessments used in the Mid-Atlantic and Shinn (2004) investigated the predictive valid-
Region. That is, do scores on the benchmark as- ity of four curriculum-based measures in pre-
sessments correlate highly with state test scores dicting first-grade student performance on three
taken at a later date? For example, is there evi- distinct criterion measures in one school district in
dence that students who score highly on a bench- the Pacific Northwest, and VanDerHeyden, Witt,
mark assessment in the fall also score highly on Naquin, & Noell (2001) included mathematics out-
the state assessment taken in the spring? comes in their review of the predictive validity of
readiness probes for kindergarten students in Loui-
siana. Each of these studies focused on the predic-
Review of previous research tive validity of a given benchmark assessment for
a given assessment, some of them state-mandated
A review of the research literature shows few tests. Most of these investigations dealt with the
published accounts of similar investigations. There early elementary grades. Generally, these studies
is no evidence of a large-scale multistate review showed that various benchmark assessments could
of the predictive validity of specific benchmark predict outcomes such as test scores and need for
assessments (also referred to as curriculum-based retention in grade, but there was much variability
measures). Many previous studies were narrowly in the magnitude of these relationships.
focused, both in the assessment area and the age of
students. Many have been con-
While the commonly ducted with only early elementary About this study
used benchmark school students. For example,
assessments in the researchers studied the validity of This study differs from the earlier research by
Mid-Atlantic Region early literacy measures in predict- reviewing evidence of the predictive validity of
jurisdictions may ing kindergarten through third benchmark assessments in use across a wide
possess strong grade scores on the Oregon State- region and by looking beyond early elementary
internal psychometric wide Assessment (Good, Simmons, students.
characteristics, evidence & Kame’enui, 2001) and fourth
is generally lacking of grade scores on the Washington The key question addressed in this study is: What
their predictive validity Assessment of Student Learning evidence is there, for a selection of commonly
with respect to the (Stage & Jacobsen, 2001). Research- used commercial benchmark assessments, of the
required summative ers in Louisiana investigated predictive validity of each instrument with respect
assessments in the the predictive validity of early to the state assessment?
Mid-Atlantic Region readiness measures for predict-
jurisdictions ing performance on the Compre- Based on a review of practices within the school dis-
hensive Inventory of Basic Skills, tricts in the Mid-Atlantic Region, this report details
About this study 5

the benchmark assessments being used, in which (Renaissance Learning, 2001a, 2002), Scholastic’s
states and grade levels, and the technical evidence Study Island (Study Island, 2006a), and CTB/
available to support the use of these assessments for McGraw Hill’s TerraNova (CTB/McGraw Hill,
predictive purposes. The report also summarizes 2001b)—finds documentation of concurrent or
conversations with test publishing company person- predictive validity of some sort for three of them
nel and the findings of technical reports, adminis- (STAR, MAP, and TerraNova), but only one was
trative manuals, and similar materials (see box 2 and truly a predictive study and demonstrated strong
appendix A on methodology and data collection). evidence of predictive validity (TerraNova).

While the commonly used benchmark assess- Moreover, nearly all of the criterion validity studies
ments in the Mid-Atlantic Region jurisdictions showing a link between these benchmark assess-
may possess strong internal psychometric charac- ments and state outcome measures used the Penn-
teristics, the report finds that evidence is generally sylvania State System of Assessment (CTB/McGraw-
lacking of their predictive validity with respect Hill, 2002a; Renaissance Learning, 2001a, 2002) as
to the required summative assessments in the the criterion measure. One study used the Delaware
Mid-Atlantic Region jurisdictions. A review of Student Testing Program test at a single grade
the evidence for the four benchmark assessments level as the criterion measure, and several studies
considered (table 2)—Northwest Evaluation As- for MAP and STAR were related to the Stanford
sociation’s Measures of Academic Progress (MAP; Achievement Test–Version 9 (SAT–9) (Northwest
Northwest Evaluation Association, 2003), Renais- Evaluation Association, 2003, 2004; Renaissance
sance Learning’s STAR Math/STAR Reading Learning, 2001a, 2002) used in District of Columbia.

Table 2
Benchmark assessments with significant levels of use in Mid-Atlantic Region jurisdictions
Benchmark assessment Publisher Publisher classification State or jurisdiction
New Jersey
4Sight Math and Readinga Success For All Nonprofit organization Pennsylvania
Delaware
Maryland
New Jersey
Measures of Academic Progress (MAP) Northwest Evaluation Pennsylvania
Math and Reading Association Nonprofit organization District of Columbia
Delaware
Maryland
STAR Math and Reading Renaissance Learning Commercial publisher New Jersey
Maryland
New Jersey
Study Island Math & Reading Scholastic Commercial publisher Pennsylvania
Maryland
New Jersey
TerraNova Math and Reading CTB/McGraw-Hill Commercial publisher Pennsylvania

a. The 4Sight assessments were reviewed for this report but were subsequently dropped from the analysis as the purpose of the assessments, according
to the publisher, is not to predict a future score on the state assessment but rather “to provide a formative evaluation of student progress that predicts
how a group of students would perform if the PSSA [Pennsylvania State System of Assessment] were given on the same day.” As a result, it was argued that
concurrent, rather than predictive, validity evidence was a more appropriate form of evidence of validity in evaluating this assessment. Users of the 4Sight
assessments, as with users of other assessments, are strongly encouraged to use the assessments consistent with their stated purposes, not to use any
assessments to predict state test scores obtained at a future date without obtaining or developing evidence of validity to support such use, and to carefully
adhere to the Standards for Educational and Psychological Testing (specifically, Standards 1.3 and 1.4) (American Educational Research Association, Ameri-
can Psychological Association, & National Council for Measurement in Education, 1999).
Source: Education Week, Quality Counts 2007, individual test publishers’ web sites, and state department of education web sites.
6 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Box 2 This process yielded four assessments based on accepted standards in the
Methodology and data collection in both reading and mathematics: testing profession and recommen-
Scholastic’s Study Island Math and dations in the research literature
This report details the review of Reading assessments, Renaissance (American Educational Research
several benchmark assessments Learning’s STAR Math and STAR Association, American Psychologi-
identified to be in widespread use Reading assessments, Northwest cal Association, & National Council
throughout the Mid-Atlantic Region. Evaluation Association’s Measures on Measurement in Education, 1999;
The report is illustrative, not exhaus- of Academic Progress (MAP) Math Rudner, 1994). Ratings were pro-
tive, identifying only a small number and Reading assessments, and CTB/ vided for each element on the rating
of these benchmark assessments. McGraw-Hill’s TerraNova Math and guide. First, the lead author, a trained
Reading assessments.1 psychometrician, rated each element
Some 40 knowledgeable stakeholders based on the information collected,
were consulted in the identification Direct measures of technical ad- with scores ranging from 3 (“yes or
process, yielding a list of more than equacy and predictive validity were true”) through 1 (“no or not true”).
20 assessment tools. Three criteria collected from December 2006 Next, the assessment publishers
were used to make the final selection: through February 2007. Extensive were asked to confirm or contest the
the assessments were used in more efforts were made to obtain scor- ratings and invited to submit addi-
than one jurisdiction, the assess- ing manuals, technical reports, and tional information. This second phase
ments were not developed for a single predictive validity evidence associ- resulted in modifications to fewer
district or small group of districts but ated with each benchmark assess- than 10 percent of the initial rat-
would be of interest to many schools ment, but test publishers vary in the ings, mostly due to the acquisition of
and districts in the jurisdictions, amount and quality of information additional documentation providing
and there was evidence, anecdotal or they provide in test manuals. Some previously unavailable evidence.
otherwise, of significant levels of use test manuals, norm tables, bulletins,
of the assessments within the region. and other materials were available Note
online. However, since none of the 1. 4Sight Math and Reading assessments,
While not all of the assessments test publishers provided access to a published by Success for All, were
reviewed for this report, but were subse-
selected are widely used in every comprehensive technical manual on
quently dropped from the analysis as the
jurisdiction, each has significant their web site and because critical purpose of the assessments, according to
penetration within the region, as information is often found in un- the publisher, is not to predict a future
reported through the stakeholder published reports, publishers were score on the state assessment but rather
consultations. Short of a large-scale contacted directly for unpublished “to provide a formative evaluation of stu-
survey study, actual numbers are measures, manuals, and technical dent progress that predicts how a group
of students would perform if the PSSA
difficult to derive as some of the pub- reports. All provided some additional
[Pennsylvania State System of Assess-
lishers of these assessments consider materials. ment] were given on the same day.” As
that information proprietary. For the a result, it was argued that concurrent,
illustrative purposes of this report A benchmark assessment rating guide rather than predictive, validity evidence
the less formal identification process was developed for reviewing the was a more appropriate form of validity
is sufficient. documentation for each assessment, evidence in evaluating this assessment.

None of the studies showed predictive or concurrent Review of benchmark assessments


validity evidence for tests used in the other Mid-
Atlantic Region jurisdictions. Thus, no predictive or This study reviewed the presence and quality of
concurrent validity evidence was found for any of specific predictive validity evidence for a collec-
the benchmark assessments reviewed here for state tion of benchmark assessments in widespread use
assessments in Maryland and New Jersey. in the Mid-Atlantic Region. The review focused
Review of benchmark assessments 7

on available technical documentation along with resulted in modifications to fewer than 10 percent
other supporting documentation provided by the of the ratings, mostly due to the acquisition of
test publishers to identify a number of important additional documentation providing evidence that
components when evaluating a benchmark as- was previously unavailable.
sessment that will be used for predicting student
performance on a later test. These components For each benchmark Only TerraNova provided
included the precision of the benchmark assess- assessment below a brief evidence of predictive
ment scores, use of and rationale for criterion summary of the docu- validity, and that was
measures for establishing predictive validity, the mentation reviewed is limited to a single
distributional properties of the criterion scores, followed by two tables. state assessment in
if any were used, and the predictive accuracy of The first table (tables 3, only a few grades
the benchmark assessments. Judgments regard- 5, 7, and 9) describes the
ing these components were made and reported assessment and its use,
along with justifications for the judgments. While and the second table (tables 4, 6, 8, and 10) presents
additional information regarding other technical judgments about the predictive validity evidence
qualities of the benchmark assessments is pro- identified in the documentation. Overall, the
vided in appendix C, only a brief description of evidence reviewed for this set of benchmark assess-
the assessment and the information on predictive ments is varied but generally meager with respect
validity evidence is described here. to supporting predictive judgments on student per-
formance on the state tests used in the Mid-Atlantic
A rating guide was developed for reviewing the Region. Although the MAP, STAR, and TerraNova
documentation for each benchmark assessment, assessments are all strong psychometrically regard-
based on accepted standards in the testing profes- ing test score precision and their correlations with
sion and sound professional recommendations other measures, only TerraNova provided evidence
in the research literature (American Educational of predictive validity, and that was limited to a
Research Association, American Psychological single state assessment in only a few grades.
Association, & National Council on Measurement
in Education,1999; Rudner, 1994). The review Northwest Evaluation Association’s Measures of Academic
occurred in multiple stages. First, the lead author, Progress (MAP) Math and Reading assessments
a trained psychometrician, rated each element of
the rating guide based on a review of information The Northwest Evaluation Association’s (NWEA)
collected for each assessment. Each element was Measures of Academic Progress (MAP) assess-
rated either 3, indicating yes or true; 2, indicating ments are computer-adaptive tests in reading,
somewhat true; 1, indicating no or not true; or na, mathematics, and language usage. Several docu-
indicating that the element was not applicable. In ments were consulted for this review. The first is a
most cases the judgment dealt primarily with the 2004 research report by the assessment developer
presence or absence of a particular type of infor- (Northwest Evaluation Association, 2004). The
mation. Professional judgment was employed in others include the MAP administration manual
cases requiring more qualitative determinations. for teachers (Northwest Evaluation Association,
2006) and the MAP technical manual (North-
To enhance the fairness of the reviews, each profile west Evaluation Association, 2003). These reports
was submitted to the assessment publisher or provide evidence of reliability and validity for the
developer for review by its psychometric staff. The NWEA assessments, including reliability coeffi-
publishers were asked to confirm or contest the cients derived from the norm sample (1994, 1999,
initial ratings and were invited to submit addi- and 2002) for MAP. With rare exceptions these
tional information that might better inform the measures indicate strong interrelationships among
evaluation of that assessment. This second phase the test items for these assessments.2
8 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Table 3
Northwest Evaluation Association’s Measures of Academic Progress: assessment description and use
Item Description
Publisher Northwest Evaluation Association
What is measured Math, reading, language usage
Scoring method Computer scored—using item response theory (IRT)
Type Computerized adaptive
Target groups All students in grades 2–10
Mid-Atlantic Region jurisdictions where used Delaware, District of Columbia, Maryland, New Jersey, and Pennsylvania
“Both MAP and ALT are designed to deliver assessments matched to the
Intended uses capabilities of each individual student” (technical manual, p. 1).

Table 4
Northwest Evaluation Association’s Measures of Academic Progress: predictive validity
Criterion Scorea Comments
Is the assessment score precise enough to Estimated score precision based on standard error of measurement
use the assessment as a basis for decisions values suggests the scores are sufficiently precise (generally below
concerning individual students? 3 .40) for individual students (technical manual, p. 58).
Are criterion measures used to provide Criterion measures are used to provide evidence of concurrent
evidence of predictive validity? 1 validity but not of predictive validity.
Criterion measures are other validated assessments used in the
Is the rationale for choosing these measures states in which the concurrent validity studies were undertaken
provided? 3 (technical manual, p. 52).
Is the distribution of scores on the criterion Criterion measures in concurrent validity studies span multiple
measure adequate? 3 grade levels and student achievement.
Is the overall predictive accuracy of the The overall levels of relationship with the criterion measures are
assessment adequate? 1 adequate, but they do not assess predictive validity.
Are predictions for individuals whose scores The nature of the computer-adaptive tests allows for equally
are close to cutpoints of interest accurate? 3 precise measures across the ability continuum.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table 4 indicates that although the MAP scores are Validity” provided a wealth of statistical informa-
sufficiently precise overall and are at the cutpoints tion on reliability and correlations with other
of interest, and criterion measures with adequate outcome measures in the same domain (Renais-
distributions across grade levels were used in the sance Learning, 2000, 2001b). While evidence
research studies, these studies did not provide is found correlating STAR assessments with a
evidence of predictive validity. Rather, the criterion multitude of other measures of mathematics and
measures are used to provide evidence of concur- reading, none of these estimates are of predictive
rent validity. The concurrent relationships are ad- validity. Most are identified as concurrent validity
equate, but they do not provide the type of evidence studies, while the rest are labeled “other external
necessary to support predictive judgments. validity data” in the technical reports. These data
show relationships between the STAR tests and
Renaissance Learning’s STAR Math state tests given prior to, rather than subsequent
and Reading assessments to, the administration of the STAR assessments.
Although the documentation provides evidence of
For both STAR Reading and Math assessments relationships between the STAR assessment and
reports titled “Understanding Reliability and many assessments, including three used as state
Review of benchmark assessments 9

Table 5
Renaissance Learning’s STAR: assessment description and use
Item Description
Publisher Renaissance Learning
What is measured STAR Math / STAR Reading
Target groups All students in grades 1–12
Scoring method Computer scored using item response theory (IRT)
Type Computerized adaptive
Mid-Atlantic jurisdictions where used Delaware, Maryland, and New Jersey
“First, it provides educators with quick and accurate estimates of students’
instructional math levels relative to national norms. Second, it provides the
means for tracking growth in a consistent manner over long time periods
Intended use for all students” (Star Math technical manual, p. 2).

Table 6
Renaissance Learning’s STAR: predictive validity
Criterion Scorea Comments
Is the assessment score precise enough to
use the assessment as a basis for decisions Adaptive test score standard errors are sufficiently small to use as a
concerning individual students? 3 predictive measure of future performance.
Numerous criterion studies were found. For Math, however, there
were only two studies for the Mid-Atlantic Region (Delaware
Are criterion measures used to provide and Pennsylvania), and neither provided evidence of predictive
evidence of predictive validity? 1 validity. The Delaware study had a low correlation coefficient (.27).
Is the rationale for choosing these measures
provided? 3 Rational for assessments used is clear.
Is the distribution of scores on the criterion
measure adequate? 3 Criterion scores span a wide grade range, with large samples.
Criterion relationships vary across grade and outcome, but there
is evidence that in some circumstances the coefficients are quite
large. The average coefficients (mid-.60s) are modest for Math and
Is the overall predictive accuracy of the higher for Reading (.70–.90). However, these are coefficients of
assessment adequate? 1 concurrent validity, not predictive validity.
Because of the computer-adaptive nature of the assessment,
Are predictions for individuals whose scores scores across the ability continuum can be estimated with
are close to cutpoints of interest accurate? 3 sufficient precision.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

assessments in the Mid-Atlantic Region (Penn- studies offer correlations with criterion measures
sylvania State System of Assessment, SAT–9, and that are well distributed across grades and student
Delaware Student Testing Program), these reports ability levels. These correlations are generally
provided no evidence of predictive validity for the stronger in reading than in math. However, while
STAR assessments and the assessments used in these studies provide evidence of concurrent
the Mid-Atlantic Region. relationships between the STAR tests and state test
measures, they do not provide the kind of validity
As with the MAP test, the evidence suggests that evidence that would support predictive judgments
the STAR tests provide sufficiently precise scores regarding the STAR test and state tests in the Mid-
all along the score continuum and that several Atlantic region.
10 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Scholastic’s Study Island Math and Reading assessments assessments and the state assessments used by any
of the other Mid-Atlantic Region jurisdictions.
The documentation for Scholastic’s Study Island
assessments was limited to the administrator’s Whereas the documentation reviewed for the
handbook and some brief research reports (Study MAP and STAR tests provides evidence of test
Island 2006a, 2006b) on the Study Island web site score precision and correlations between these
(www.studyisland.com). Only one report con- tests and state test scores, documentation for the
tained information pertaining to the Mid-Atlantic Study Island assessments lacks any evidence to
Region, a study comparing proficiency rates on the support concurrent or predictive judgments—
Pennsylvania State System of Assessment (PSSA) there was no evidence of test score precision or
between Pennsylvania schools using Study Island predictive validity for this instrument (table 8).
and those not using Study Island. However, since
analyses were not conducted relating scores from CTB/McGraw-Hill’s TerraNova Math
the Study Island assessments to the PSSA scores, and Reading assessments
there was no evidence of predictive validity for
the Study Island assessments. Nor was evidence CTB/McGraw-Hill’s TerraNova assessments had
of predictive validity found for Study Island the most comprehensive documentation of the

Table 7
Scholastic’s Study Island: assessment description and use
Item Description
Publisher Scholastic
What is measured Math and Reading content standards
Target groups All K–12 students
Scoring method Computer scored
Type Computer delivered
Mid-Atlantic jurisdictions where used Maryland, New Jersey, and Pennsylvania
To “help your child master the standards specific to their grade in your
Intended use state” (administrators handbook, p. 23).

Table 8
Scholastic’s Study Island: predictive validity
Criterion Scorea Comments
Is the assessment score precise enough to use the
assessment as a basis for decisions concerning individual
students? 1 No evidence of score precision is provided.
Are criterion measures used to provide evidence of
predictive validity? 1 No predictive validity evidence is provided.
Is the rationale for choosing these measures provided? na
Is the distribution of scores on the criterion measure
adequate? na
Is the overall predictive accuracy of the assessment
adequate? na
Are predictions for individuals whose scores are close to
cutpoints of interest accurate? na

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


Review of benchmark assessments 11

Table 9
CTB/McGraw-Hill’s TerraNova: assessment description and use
Item Description
Publisher CTB/McGraw-Hill
What is measured Reading, math, language arts
Target groups All K–12 students
Item response theory (IRT) models (a triple parameter logistic and a double
Scoring method parameter logistic partial credit)
Type Nonadaptive
Mid-Atlantic jurisdictions where used Maryland, New Jersey, and Pennsylvania
TerraNova consists of three test editions: Survey, Complete Battery, and
Multiple Assessment. TerraNova Multiple Assessment contains multiple-
choice and constructed-response items providing measures of academic
performance in various content areas including reading, language arts,
science, social studies, and mathematics. “TerraNova is an assessment
system designed to measure concepts, processes, and skills taught
Intended use throughout the nation” (technical manual, p. 1).

Table 10
CTB/McGraw-Hill’s TerraNova: predictive validity
Criterion Scorea Comments
Is the assessment score precise enough to
use the assessment as a basis for decisions Adequately small standard errors of measurement reflect sufficient
concerning individual students? 3 score precision for individual students.
Linking study to Pennsylvania System of School Assessments
Are criterion measures used to provide (PSSA) provides evidence of predictive validity for grades 3–11 in
evidence of predictive validity? 3 mathematics and reading.
Is the rationale for choosing these measures Linking study documentation provides rationale for using PSSA as
provided? 3 outcome.
Is the distribution of scores on the criterion Distribution of PSSA scores shows sufficient variability within and
measure adequate? 3 between grade levels.
Is the overall predictive accuracy of the Linking study provides predictive validity coefficients ranging
assessment adequate? 3 from .67 to .82.
Are predictions for individuals whose scores
are close to cutpoints of interest accurate? 3 Accuracy at the cutpoints is sufficient.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

assessments reviewed for this study. In addition reliable, and valid assessment instrument. The
to a robust technical report of more than 600 teachers guide details the assessment develop-
pages (CTB/McGraw-Hill, 2001b), there was a ment procedure and provides information on
teachers guide (CTB/McGraw-Hill, 2002b) and a assessment content, usage, and score interpreta-
research study linking the TerraNova assessment tion for teachers. A linking study provided clear
to the Pennsylvania System of School Assess- and convincing evidence of predictive validity for
ments (PSSA) test (CTB/McGraw-Hill, 2002a). The the TerraNova Reading and Math assessments in
technical report exhaustively details the extensive predicting student performance on the PSSA for
test development, standardization, and valida- grades 5, 8, and 11 (CTB/McGraw-Hill, 2002a). No
tion procedures undertaken to ensure a credible, predictive validity evidence was found, however,
12 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

relating TerraNova assessments to state assess- specifically linking these reviewed benchmark
ments in Delaware, the District of Columbia, assessments with the state assessments currently
Maryland, or New Jersey. in use. Even in the one case where evidence of
predictive validity was provided, it is clear that
In contrast to the other measures reviewed in this more evidence is needed: “The study presents pre-
report, the TerraNova documentation provided liminary evidence of the relationship between the
support for predictive judgments regarding use of two instruments and does present cause for future
TerraNova in relation to at least one state test mea- investigations based upon the changing nature
sure in use in the Mid-Atlantic region. TerraNova of the Pennsylvania PSSA” (CTB/McGraw-Hill,
possesses good test score precision overall and at 2002a, p. 7).
the relevant cutpoints. The criterion measure using
the predictability or linking study (the Pennsylva- Additional research is therefore recommended to
nia System of School Assessments) has adequate develop the type of evidence of predictive validity
variability both within and across grades (table 10). for each of these benchmark assessments with the
Further, the rationale for the use of this assess- state assessments for all grade levels tested across
ment is well specified and, more important, the the entire Mid-Atlantic Region. Such evidence
predictive relationships range from adequate (.67) is crucial for school districts to make informed
to strong (.82). While this evidence decisions about which benchmark assessments
Additional research supports the use of TerraNova as a correspond to state assessment outcomes so that
is needed specifically benchmark assessment to predict instructional decisions meant to improve student
linking these scores on the Pennsylvania System learning, as measured by state tests, have a reason-
reviewed benchmark of School Assessments, comparable able chance of success.
assessments with the evidence to support the use of Ter-
state assessments raNova to predict scores on state In some jurisdictions such studies are already
currently in use assessments used in the other Mid- under way. For example, a study is being con-
Atlantic Region states is lacking. ducted in Delaware to document the predictive
validity of assessments used in that state. To judge
the efficacy of remedial programs that target
Need for further research outcomes identified through high-stakes state
assessments, the data provided by benchmark
To provide the Mid-Atlantic Region jurisdictions assessments are crucial. While some large school
with adequate information on the predictive districts may have the psychometric staff resources
validity of the benchmark assessments they are to document the predictive qualities of the bench-
currently using, additional research is needed mark assessments in use, most districts do not.
Appendix A 13

Appendix A  development of assessments correlated with that


Methodology state’s standards. Assessments that are in use in
most of the jurisdictions rather than just one were
The benchmark assessments included in this selected.
review were identified through careful research
and consideration; however, it was not the intent to Second, given the small number of assessments
conduct a comprehensive survey of all benchmark that a study of this limited scope might review,
assessments in all districts, but rather to identify the decision was made to incorporate assessments
a small number of benchmark assessments widely of interest to a wide range of schools and districts
used in the Mid-Atlantic Region. Thus, this report rather than local assessments of interest to a single
is intended to be illustrative, not exhaustive. district or a small group of districts. Thus district-
authored assessments were excluded. Maryland,
Data collection for example, has a history of rigorous development
of local assessments and relies heavily on them
Some 40 knowledgeable stakeholders were con- for benchmarking. Local assessments might be
sulted in the five jurisdictions that constitute the included in a future, more comprehensive study.
Mid-Atlantic Region: Delaware, the District of
Columbia, Maryland, New Jersey, and Pennsylva- Finally, researchers looked for assessments for
nia. Participants included state coordinators and which there were evidence, anecdotal or other-
directors of assessment and accountability, state wise, of significant levels of use within the region.
content area coordinators and directors, district In some cases, a high level of use was driven by
assessment and testing coordinators, district state support. In other cases, as with the Study
administrators of curriculum and instruction, Island Reading and Math assessments, adoption is
district superintendents, representatives of assess- driven by teachers and schools.
ment publishers, and university faculty.
While not all of the assessments selected are
These consultations yielded a list of more than 20 widely used in every state, each has significant
assessment tools currently in use in the region, penetration within the region, as reported through
along with some information on their range of use. the consultations with stakeholders. Short of a
Precise penetration data were not available be- large-scale survey study, actual numbers are diffi-
cause of publisher claims of proprietary informa- cult to derive as some of the publishers of these as-
tion and the limits imposed on researchers by U.S. sessments consider that information to be propri-
Office of Management and Budget requirements. A etary. For the illustrative purposes of this report,
future curriculum review study will include ques- the less formal identification process employed is
tions about benchmark assessment usage to clarify considered adequate.
this list.
This process yielded five assessments in both read-
Three criteria were used to make the final selec- ing and mathematics: Northwest Evaluation As-
tion of benchmark assessments. First, researchers sociation’s Measures of Academic Progress (MAP)
looked for assessments that were used in more Math and Reading assessments; Renaissance
than one Mid-Atlantic jurisdiction. In New Jersey Learning’s STAR Math and STAR Reading assess-
the Riverside Publishing Company has created ments; Scholastic’s Study Island Math and Reading
a library of assessments known as “NJ PASS” assessments; TerraNova Math and Reading as-
that have been developed around and correlated sessments; and Success For All’s 4Sight Math and
against the New Jersey State Standards and so Reading assessment;. The 4Sight assessments were
would be relevant only to New Jersey districts. reviewed for this report but were subsequently
The State of Delaware has contracted for the dropped from the analysis since the purpose of
14 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

the assessments, according to the publisher, is Two sources of data were used to collect informa-
not to predict a future score on state assessments tion: the publishers’ web sites and the publishers
but rather “to provide a formative evaluation of themselves. Each publisher’s web site was searched
student progress that predicts how a group of stu- for test manuals, norm tables, bulletins, and other
dents would perform if the [state assessment] were materials. Technical and administrative informa-
given on the same day.” As a result, it was argued tion was found online for STAR Math and Read-
that concurrent, rather than predictive, validity ing, MAP Math and Reading, and TerraNova Math
evidence was a more appropriate form of validity and Reading. None of the assessment publishers
evidence in evaluating this assessment. provided access to a comprehensive technical
manual on their web sites, however.
For the review of the benchmark assessments
summarized here, direct measures (rather than Since detailed technical information was not
self-report questionnaires) of technical adequacy available online, and because unpublished reports
and predictive validity were collected from often contain critical information, publishers were
December 2006 through February 2007. National contacted directly. All provided some additional
standards call for a technical manual to be made materials. Table A1 provides details on the avail-
available by the publisher so that any user can ability of test manuals and other relevant research
obtain information about the norms, reliability, used in this review, including what information
and validity of the instrument as well as other was available online and what information was
relevant topics (American Educational Research available only by request in hard copy.
Association, American Psychological Association,
& National Council on Measurement in Education, Rating
1999, p. 70). Extensive efforts were made to obtain
scoring manuals, technical reports, and predictive A rating guide, based on accepted standards in the
validity evidence associated with each bench- testing profession and sound professional recom-
mark assessment. Because assessment publishers mendations in the research literature (American
vary in the amount and quality of information Educational Research Association, American
they provide in test manuals, this review encom- Psychological Association, & National Council on
passes a wide range of published and unpublished Measurement in Education,1999; Rudner, 1994),
measures obtained from the four assessment was developed for reviewing the documentation
developers. for each benchmark assessment. First, the lead

Table A1
Availability of assessment information
Measures of
Academic
Type of information Progress (MAP) STAR Study Island TerraNova
Available online (on test developers’ web site)
Technical manual
Test manual (users guide) 3 3 3
Predictive validity research and relevant psychometric information
Hard copy materials provided on request
Technical manual 3 3 3
Test manual (users guide) 3
Predictive validity research and relevant psychometric information 3
Appendix A 15

author, a trained psychometrician, rated each psychometric staff. The publishers were asked to
element on the rating guide based on a review of confirm or contest the initial ratings and invited
information collected for each assessment. Each to submit additional information that might better
element was rated either 3, indicating yes or true; inform the evaluation of that assessment. This
2, indicating somewhat true; 1, indicating no or second phase resulted in modifications to fewer
not true; or na, indicating that the element was than 10 percent of the ratings, mostly due to the
not applicable. Each profile was submitted to the acquisition of additional documentation providing
assessment publisher or developer for review by its specific evidence previously unavailable.
16 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Appendix B  Correlation. The tendency for certain values or


Glossary levels of one variable to occur with particular
values or levels of another variable.
Benchmark assessment. A benchmark assess-
ment is a formative test of performance, usually Correlation coefficient. A measure of association
with multiple equated forms, that is administered between two variables that can range from –1.00
at multiple times over a school year. In addi- (perfect negative relationship) to 0 (no relation-
tion to formative functions, benchmark assess- ship) to +1.00 (perfect positive relationship).
ments allow educators to monitor the progress of
students against state standards and to predict Criterion. A standard or measure on which a
performance on state exams. judgment may be based.

Computerized adaptive tests. A computer-based, Criterion validity. The ability of a measure to


sequential form of individual testing in which suc- predict performance on a second measure of the
cessive items in the test are chosen based primar- same outcome, computed as a correlation. If both
ily on the psychometric properties and content of measures are administered at approximately the
the items and the test taker’s response to previous same time, this is described as concurrent validity.
items. If the second measure is taken after the first, the
ability is described as predictive validity.
Concurrent validity. The relationship of one mea-
sure to another that assesses the same attribute Curriculum-based measure. A set of measures
and is administered at approximately the same tied to the curriculum and used to assess student
time. See criterion validity. progress and to identify students who may need
additional or specific instruction.
Construct validity. A term used to indicate the
degree to which the test scores can be interpreted Decision consistency. The extent to which an
as indicating the test taker’s standing on the theo- assessment, if administered multiple times to the
retical variable to be measured by the test. same respondent, would classify the respondent
in the same way. For example, an instrument has
Content validity. A term used in the 1974 Stan- strong decision consistency if students classified as
dards to refer to an aspect of validity that was “re- proficient on one administration of the assessment
quired when the test user wishes to estimate how would be highly likely to be classified as proficient
an individual performs in the universe of situa- on a second administration.
tions the test is intended to represent” (American
Psychological Association, 1974, p. 28). In the 1985 Formative assessment. An assessment designed
Standards the term was changed to content-related to provide information to guide instruction.
evidence, emphasizing that it referred to one type
of evidence within a unitary conception of validity Internal consistency. The extent to which the
(American Educational Research Association, items in an assessment that are intended to
American Psychological Association, & National measure the same outcome or construct do so
Council on Measurement in Education, 1985). consistently.
In the current Standards, this type of evidence is
characterized as “evidence based on test content” Internal consistency coefficient. An index of the
(American Educational Research Association, reliability of test scores derived from the statisti-
American Psychological Association, & National cal interrelationships of responses among item
Council on Measurement in Education, 1999). responses or scores on separate parts of a test.
Appendix A 17

Item response. The correct or incorrect answer Reliability. The degree to which test scores for a
to a question designed to elicit the presence or group of test takers are consistent over repeated ap-
absence of some trait. plications of a measurement procedure and hence
are inferred to be dependable and repeatable for an
Item response function (IRF). An equation or the individual test taker; the degree to which scores are
plot of an equation that indicates the probability of free of errors of measurement for a given group.
an item response for different levels of the overall
performance. Reliability coefficient. A coefficient of correlation
between two administrations of a test or among
Item response theory (IRT). Test analysis proce- items within a test. The conditions of administra-
dures that assume a mathematical model for the tion may involve variations in test forms, raters,
probability that a test taker will respond correctly or scorers or the passage of time. These and other
to a specific test question, given the test taker’s changes in conditions give rise to varying descrip-
overall performance and the characteristics of the tions of the coefficient, such as parallel form reli-
test questions. ability, rater reliability, and test-retest reliability.

Item scaling. A mathematical process through Standard errors of measurement. The standard
which test items are located on a measurement deviation of an individual’s observed scores from
scale reflecting the construct the items purport to repeated administrations of a test (or parallel
measure. forms of a test) under identical conditions. Be-
cause such data cannot generally be collected, the
Norms. The distribution of test scores of some standard error of measurement is usually esti-
specified group. For example, this could be a mated from group data.
national sample of all fourth graders, a national
sample of all fourth-grade males, or all fourth Test documents. Publications such as test manu-
graders in some local district. als, technical manuals, users guides, specimen
sets, and directions for test administrators and
Outcome. The presence or absence of an educa- scorers that provide information for evaluating the
tionally desirable trait. appropriateness and technical adequacy of a test
for its intended purpose.
Predictive accuracy. The extent to which a test
accurately predicts a given outcome, such as Test norming. The process of establishing norma-
designation into a given category on another tive responses to a test instrument by administer-
assessment. ing the test to a specified sample of respondents,
generally representative of a given population.
Predictive validity. The ability of one assessment
tool to predict future performance either in some Test score precision. The level of test score accu-
activity (a job, for example) or on another assess- racy, or absence of error, at a given test score value.
ment of the same construct.
Validity. The degree to which evidence and theory
Rasch model. One of a family of mathematical support specific interpretations of test scores
formulas, or item response models, describing the entailed by proposed uses of a test.
relationship between the probability of correctly
responding to an assessment item and an indi- Variation in test scores. The degree to which in-
vidual’s level of the trait being measured by the dividual responses to a particular test vary across
assessment item. individuals or administrations.
18 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Appendix C 
Detailed findings of benchmark
assessment analysis

Findings for Northwest Evaluation Association’s


Measures of Academic Progress (MAP)
Math and Reading assessments

Table C1
Northwest Evaluation Association’s Measures of Academic Progress: reliability coefficients
Reliability coefficienta Coefficient value Interpretation
These values reflect strong internal
3 Internal consistency .92–.95 consistency.
Almost all coefficients above .80.
3 Test-retest with same form .79 –.94 Exceptions in grade 2.
Marginal reliabilities calculated from
3 Test-retest with equivalent forms .89–.96 norm sample (technical manual, p. 55).
3 Item/test information (IRT scaling) Uses Rasch model.
2.5–3.5 in Rochester Institute of These values reflect adequate
Technology (RIT) scales (or .25–.35 in measurement precision. Scores typically
3 Standard errors of measurement (SEM) logit values) range from 150–300 on the RIT scale.
Decision consistency

Note: See appendix B for definitions of terms.


a. Checkmarks indicate reliability information that is relevant to the types of interpretations being made with scores from this instrument.

Table C2
Northwest Evaluation Association’s Measures of Academic Progress: predictive validity
Criterion Scorea Comments
Is the assessment score precise enough to Estimated score precision based on SEM values suggests the
use the assessment as a basis for decisions scores are sufficiently precise for individual students (technical
concerning individual students? 3 manual, p. 58).
Are criterion measures used to provide Criterion measures are used to provide evidence of concurrent
evidence of predictive validity? 1 validity but not of predictive validity.
Criterion measures are other validated assessments used in the
Is the rationale for choosing these measures states in which the concurrent validity studies were undertaken
provided? 3 (technical manual, p. 52).
Is the distribution of scores on the criterion Criterion measures in concurrent validity studies span multiple
measure adequate? 3 grade levels and student achievement.
Is the overall predictive accuracy of the The overall levels of relationship with the criterion measures are
assessment adequate? 1 adequate, but they do not indicate evidence of predictive validity.
Are predictions for individuals whose scores The nature of the computer-adaptive tests allows for equally
are close to cutpoints of interest accurate? 3 precise measures across the ability continuum.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


Appendix C 19

Table C3
Northwest Evaluation Association’s Measures of Academic Progress: content/construct validity
Criterion Scorea Comments
No clear statement of universe of skills in reviewed documents, but
there are vague statements about curriculum coverage with brief
Is there a clear statement of the universe of examples in technical manual. No listing of learning objectives is
skills represented by the assessment? 2 provided in reviewed documentation.
Was sufficient research conducted to Not clear from reviewed documentation. However, the content
determine desired assessment content and alignment guidelines detail a process that likely ensures
evaluate content? 2 appropriate content coverage.
Concurrent validity estimates are provided in the technical manual
Is sufficient evidence of construct validity and the process for defining the test content is found in the
provided for the assessment? 3 content alignment guidelines.
Is adequate criterion validity evidence Criterion validity evidence in the form of concurrent validity is
provided? 3 provided for a number of criteria.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C4
Northwest Evaluation Association’s Measures of Academic Progress: administration of the assessment
Criterion Scorea Comments
Are the administration procedures clear and Procedures are clearly explained in the technical manual (pp.
easy to understand? 3 36–39).
Do the administration procedures replicate
the conditions under which the assessment Norm and validation samples were obtained using the same
was validated and normed? 3 administration procedure outlined in the technical manual.
Are the administration procedures Administration procedures are standardized in the technical
standardized? 3 manual.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C5
Northwest Evaluation Association’s Measures of Academic Progress: reporting
Criterion Scorea Comments
Are materials and resources available to aid Materials to aid in interpreting results are in the technical manual
in interpreting assessment results? 3 (pp. 44–45) and available on the publisher’s web site.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


20 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Findings for Renaissance Learning’s STAR


Math and Reading assessments

Table C6
Renaissance Learning’s STAR: reliability coefficients
Reliability coefficienta Coefficient value Interpretation
.77–.88 Math
3 Internal consistency .89–.93 Reading Strong internal consistency.
.81–.87 Math
3 Test-retest with same form .82–.91 Reading Strong stability.
.72–.79 Math
3 Test-retest with equivalent forms .82–.89 Reading Strong equivalence across forms.
3 Item/test information (IRT scaling) Both tests use the Rasch model.
Math scale ranges from 1 to 1,400
through linear transformation of the
Rasch scale. Reading scale ranges
from 1 to 1,400, but the Rasch scale is
transformed through a conversion table.
For reading this equates to roughly a .49
Average 40 (Math scale) in an IRT scale, or classical reliability of
3 Standard errors of measurement (SEM) Average 74 (Reading scale) approximately .75.
Decision consistency

Note: See appendix B for definitions of terms.


a. Checkmarks indicate reliability information that is relevant to the types of interpretations being made with scores from this instrument.

Table C7
Renaissance Learning’s STAR: content/construct validity
Criterion Scorea Comments
Is there a clear statement of the universe of
skills represented by the assessment? 3 Content domain well specified for both Math and Reading.
Was sufficient research conducted to
determine desired assessment content and Content specifications well documented for both Math and
evaluate content? 3 Reading.
Is sufficient evidence of construct validity Construct validity evidence provided for both Math and Reading
provided for the assessment? 3 assessments in technical manuals.
Is adequate criterion validity evidence Criterion validity estimates are provided for 28 tests in Math (276
provided? 3 correlations) and 26 tests in Reading (223 correlations).

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


Appendix C 21

Table C8
Renaissance Learning’s STAR: appropriate samples for assessment validation and norming
Criterion Scorea Comments
Is the purpose of the assessment clearly The purpose for both Math and Reading assessments is clearly
stated? 3 stated in the documentation.
The framework for the Math assessment is well delineated in
the technical manual, along with a list of the objectives covered.
The STAR Reading assessment technical manual states, “After
an exhaustive search, the point of reference for developing
STAR Reading items that best matched appropriate word-level
placement information was found to be the 1995 updated
Is a description of the framework for the vocabulary lists that are based on the Educational Development
assessment clearly stated? 3 Laboratory’s (EDL) A Revised Core Vocabulary (1969)” (p. 11).
Is there evidence that the assessment
adequately addresses the knowledge, skills, Yes, a list of objectives is provided in the technical manual for STAR
abilities, behavior, and values associated Math. For STAR Reading the measures are restricted to vocabulary
with the intended outcome? 3 and words in the context of an authentic text passage.
Were appropriate samples used in pilot Sufficient samples were used in assessment development
testing? 3 processes (calibration stages).
Were appropriate samples used in
validation? 3 Sufficient samples were used in validation.

Were appropriate samples used in norming? 3 Norm samples are described in detail in the technical manual.
If normative date is provided, was the norm A norm sample was collected in spring 2002 for Math, but in 1999
sample collected within the last five years? 2 for Reading.
Are the procedures associated with the
gathering of the normative data sufficiently
well described so that procedures can be The norming procedure is well described in the technical manuals
properly evaluated? 3 for both assessments.
Scores are sufficiently variable across grades as indicated by
Is there sufficient variation in assessment scale score standard deviations in the technical manuals from
scores? 3 calibration or norm samples.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C9
Renaissance Learning’s STAR: administration of the assessment
Criterion Scorea Comments
Are the administration procedures clear and
easy to understand? 3 Procedures are defined in the technical manual (pp. 7–8).
Do the administration procedures replicate
the conditions under which the assessment Procedures appear consistent with procedures used in norm
was validated and normed? 3 sample.
Are the administration procedures Administration procedures are standardized (technical manuals,
standardized? 3 pp. 7–8).

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


22 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Table C10
Renaissance Learning’s STAR: reporting
Criterion Scorea Comments
Are materials and resources available to aid Information on assessment score interpretation provided in
in interpreting assessment results? 3 technical documentation.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C11
Scholastic’s Study Island: reliability coefficients
Reliability coefficienta Coefficient value Interpretation
Internal consistency None
Test-retest with same form None
Test-retest with equivalent forms None
Item/test information (IRT scaling) None
Standard errors of measurement None
Decision consistency None

Note: See appendix B for definitions of terms.


a. Checkmarks indicate reliability information that is relevant to the types of interpretations being made with scores from this instrument.

Findings for Scholastic’s Study Island


Math and Reading assessments

Table C12
Scholastic’s Study Island: content/construct validity
Criterion Scorea Comments
Is there a clear statement of the universe of
skills represented by the assessment? 3 Statement indicates entirety of state standards.
Was sufficient research conducted to
determine desired assessment content and
evaluate content? 1 None provided.
Is sufficient evidence of construct validity
provided for the assessment? 1 None provided.
Is adequate criterion validity evidence
provided? 1 No criterion validity evidence is provided.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


Appendix C 23

Table C13
Scholastic’s Study Island: appropriate samples for assessment validation and norming
Criterion Scorea Comments
Is the purpose of the assessment clearly
stated? 3 Yes, in the administrators handbook.
Is a description of the framework for the
assessment clearly stated? 3 Framework relates to state standards.
Is there evidence that the assessment
adequately addresses the knowledge, skills,
abilities, behavior, and values associated
with the intended outcome? 1 No validation evidence is provided.
Were appropriate samples used in pilot
testing? na No pilot testing information provided.
Were appropriate samples used in
validation? na No validation information provided.
Were appropriate samples used in norming? na No normative information provided.
If normative date is provided, was the norm
sample collected within the last five years? na None provided.
Are the procedures associated with the
gathering of the normative data sufficiently
well described so that procedures can be
properly evaluated? na No normative information provided.
Is there sufficient variation in assessment
scores? na None provided.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C14
Scholastic’s Study Island: administration of the assessment
Criterion Scorea Comments
Are the administration procedures clear and
easy to understand? 3 Outlined in the administrators handbook (p. 20).
Do the administration procedures replicate
the conditions under which the assessment
was validated and normed? na
Are the administration procedures
standardized? 3 Computer delivers assessments in a standardized fashion.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C15
Scholastic’s Study Island: reporting
Criterion Scorea Comments
Are materials and resources available to aid The administrators handbook and web site offer interpretative
in interpreting assessment results? 3 guidance.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


24 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Findings for CTB/McGraw Hill’s TerraNova


Math and Reading assessments

Table C16
CTB/McGraw-Hill’s TerraNova: reliability coefficients
Reliability coefficienta Coefficient value Interpretation
3 Internal consistency .80 – .95 Strong internal consistency.
Test-retest with same form
3 Test-retest with equivalent forms .67–.84 Moderate to strong evidence of stability for various grade levels.
Item/test information (IRT scaling)
Variability in SEMs across grade, but standard errors are sufficiently
small. These standard errors equate to roughly .25–.33 standard
Standard errors of measurement deviation units for the test scale. Scores typically range from 423 to
3 (SEM) 2.8 – 4.5 722 in reading in grade 3 and from 427 to 720 in math in grade 3.
3 Decision consistency: Generalizability coefficients exceed .86.

Note: See appendix B for definitions of terms.


a. Checkmarks indicate reliability information that is relevant to the types of interpretations being made with scores from this instrument.

Table C17
CTB/McGraw-Hill’s TerraNova: content/construct validity
Criterion Scorea Comments
The domain tested is derived from careful examination of content
Is there a clear statement of the universe of of recently published textbook series, instructional programs, and
skills represented by the assessment? 3 national standards publications (technical manual, p. 17).
Was sufficient research conducted to “Comprehensive reviews were conducted of curriculum guides
determine desired assessment content and/ from almost every state, and many districts and dioceses, to
or evaluate content? 3 determine common educational goals” (technical manual, p. 17).
Is sufficient evidence of construct validity Construct validity evidence provided in technical manual (pp.
provided for the assessment? 3 32–58).
Is adequate criterion validity evidence Linking study to Pennsylvania State System of Assessment
provided? 3 provides evidence of predictive validity for grades 3–11.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C18
CTB/McGraw-Hill’s TerraNova: appropriate samples for test validation and norming
Criterion Scorea Comments
Is the purpose of the assessment clearly
stated? 3 Purpose clearly stated in the technical manual (p.1).
Is a description of the framework for the The framework is described in the development process in the
assessment clearly stated? 3 technical manual (p. 18).
Is there evidence that the assessment
adequately addresses the knowledge, skills,
abilities, behavior, and values associated Construct validity evidence provided in the technical manual (pp.
with the intended outcome? 3 32–58).

(continued)
Appendix C 25

Table C18 (continued)


CTB/McGraw-Hill’s TerraNova: appropriate samples for test validation and norming
Criterion Scorea Comments
“The test design entailed a target N of at least 400 students in the
standard sample and 150 students in each of the African-American
and Hispanic samples for each level and content area. More than
Were appropriate samples used in pilot 57,000 students were involved in the TerraNova tryout study”
testing? 3 (technical manual, p. 26).
Were appropriate samples used in Standardization sample was appropriate and is described in the
validation? 3 technical manual (pp. 63–66).
Standardization sample was appropriate and is described in
the technical manual (pp. 63–66). More than 275,000 students
Were appropriate samples used in norming? 3 participated in the standardization sample.
According to the technical manual, TerraNova national
standardization occurred in the spring and fall of 1996 (p. 61).
According to the technical quality report, standardization samples
If normative date is provided, was the norm were revised in 1999 and 2000, and further documentation from
sample collected within the last five years? 3 the vendor indicates that the norms were updated again in 2005.
Are the procedures associated with the
gathering of the normative data sufficiently
well described so that the procedures can National standardization study detailed in technical manual (pp.
be properly evaluated? 3 61–90).
Is there sufficient variation in assessment TerraNova assessment scores from the national sample reflect
scores? 3 score variability across and within grades.

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C19
CTB/McGraw-Hill’s TerraNova: administration of the assessment
Criterion Scorea Comments
Are the administration procedures clear and The teachers guide details the test administration procedure in an
easy to understand? 3 understandable manner (pp. 3–4)
Do the administration procedures replicate
the conditions under which the assessment Standardization sample was drawn from users, thus the conditions
was validated and normed? 3 of assessment use and standardized sample use are comparable.
Are the administration procedures Procedures are standardized and detailed in the teachers guide.
standardized? 3 (pp. 3–4)

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.

Table C20
CTB/McGraw-Hill’s TerraNova: reporting
Criterion Scorea Comments
An “Information system” was developed to “ensure optimal
Are materials and resources available to aid application of the precise data provided by TerraNova” (Technical
in interpreting assessment results? 3 quality report, p. 29).

a. 3 is yes, true; 2 is somewhat true; 1 is not true; and na is not applicable.


26 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Notes Measure Standards, Colorado Student Assessment


Program, Illinois Standards Achievement Test, In-
1. This example comes from Enhancing Education diana Statewide Testing for Educational Progress-
Through Technology site visit reports (Metiri Plus, Iowa Test of Basic Skills, Minnesota Compre-
Group, 2007). hensive Assessment and Basic Skills Test, Nevada
Criterion Referenced Assessment, Palmetto
2. In addition, the technical manual (Northwest Achievement Challenge Tests, Stanford Achieve-
Evaluation Association, 2003) provides concurrent ment Test, Texas Assessment of Knowledge and
validity evidence for the tests through correla- Skills, Washington Assessment of Student Learn-
tion analysis with a number of criterion outcome ing, and the Wyoming Comprehensive Assessment
measures. These include: Arizona Instrument to System.
References 27

References Good, R. H., Simmons, D. C., & Kame’enui, E. J. (2001).


The importance and decision-making utility of a
American Educational Research Association, American Psy- continuum of fluency-based indicators of foundational
chological Association, & National Council on Measure- reading skills for third-grade high-stakes outcomes.
ment in Education. (1985). Standards for educational Scientific Studies of Reading, 5, 257–288.
and psychological testing. Washington, DC: Authors.
McGlinchey, M. T., & Hixson, M.D. (2004). Using curric-
American Educational Research Association, American ulum-based measurement to predict performance on
Psychological Association, & National Council on state assessments in reading. School Psychology Review,
Measurement in Education (Joint Committee). (1999). 33(2), 193–203.
Standards for educational and psychological testing.
Washington, DC: Authors. Metiri Group. (2007). Evaluation of the Pennsylvania
Enhancing Education Through Technology Program:
American Psychological Association. (1974). Standards for 2004–2006. Culver City, CA.: Author.
educational and psychological tests. Washington, DC:
Author. Northwest Evaluation Association. (2003). Technical man-
ual for the NWEA Measures of Academic Progress and
Clarke, B., & Shinn, M. R. (2004). A preliminary investiga- Achievement Level Tests. Lake Oswego, OR: Author.
tion into the identification and development of early
mathematics curriculum-based measurement. School Northwest Evaluation Association. (2004). Reliability and
Psychology Review, 33(2), 234–248. validity estimates: NWEA Achievement Level Tests and
Measures of Academic Progress. Lake Oswego, OR:
CTB/McGraw-Hill. (2001a). TerraNova Technical bulletin Author.
(2nd Ed.). Monterey, CA: Author.
Northwest Evaluation Association. (September, 2005). RIT
CTB/McGraw-Hill. (2001b). TerraNova Technical report. scale norms for use with Achievement Level Tests and
Monterey, CA: Author. Measures of Academic Progress. Lake Oswego, OR:
Author.
CTB/McGraw-Hill. (2002a). A linking study between the
TerraNova assessment series and Pennsylvania PSSA Northwest Evaluation Association. (2006). Teacher hand-
Assessment for reading and mathematics. Monterey, book: Measures of Academic Progress (MAP). Lake
CA: Author. Oswego, OR: Author. Available online at http://www.
nwea.org/assets/documentLibrary/Step%201%20-%20
CTB/McGraw-Hill. (2002b). Teacher’s guide to TerraNova MAP%20Administration%20Teacher%20Handbook.
(2nd Ed.). Monterey, CA: Author. pdf

Education Market Research. (2005a, February). The Northwest Evaluation Association. (n.d.). NWEA Measures
complete K-12 newsletter: The reading market. Rock- of Academic Progress (MAP) content alignment guide-
away Park, NY. Retrieved March 2007 from http:// lines and processes and item development processes.
www.ed-market.com/r_c_archives/display_article. Lake Oswego, OR: Author.
php?article_id=76.
Olson, L. (2005, December 30). Benchmark assessments
Education Market Research. (2005b, June). The complete offer regular checkups on student achievement. Educa-
K-12 newsletter: The mathematics market. Rockaway tion Week 22(37): 13–14.
Park, NY. Retrieved March 10, 2007 from http://
www.ed-market.com/r_c_archives/display_article. Renaissance Learning. (2000). STAR Reading: Understand-
php?article_id=83 ing reliability and validity. Wisconsin Rapids, WI:
28 The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region

Renaissance Learning, Inc. Retrieved from http:// Schilling, S. G., Carlisle, J. F., Scott, S. E., & Zeng, J. (2007).
research.renlearn.com/research/pdfs/133.pdf Are fluency measures accurate predictors of reading
achievement? The Elementary School Journal, 107,
Renaissance Learning. (2001a). STAR Math technical 429–448.
manual. Wisconsin Rapids, WI: Author.
Scholastic. (2006a). Study Island administrator handbook.
Renaissance Learning. (2001b). STAR Math: Understanding Dallas, TX: Author.
reliability and validity. Wisconsin Rapids, WI: Author.
Retrieved from http://research.renlearn.com/research/ Scholastic. (2006b). Evidence of success: Pennsylvania
pdfs/135.pdf System of School Assessments: Solid research equals
solid results research. Retrieved from http://www.
Renaissance Learning. (2002). STAR Reading technical studyisland.com/salesheets
manual. Wisconsin Rapids, WI: Author.
Stage, S. A., & Jacobsen, M. D. (2001). Predicting student
Renaissance Learning. (2003). The research foundation for success on state-mandated performance-based assess-
Reading Renaissance goal-setting practices. Madison, ment using oral reading fluency. School Psychology
WI: Author. Retrieved from http://research.renlearn. Review, 30(3), 407–419.
com/research/pdfs/162.pdf
VanDerHeyden, A. M., Witt, J. C., Naquin, G., & Noell, G.
Rudner, L. M. (1994). Questions to ask when evaluat- (2001). The reliability and validity of curriculum-based
ing tests. Practical Assessment, Research & Evalua- measurement readiness probes for kindergarten stu-
tion, 4(2). Retrieved February 14, 2007 from http:// dents. School Psychology Review, 30(3), 363–382.
PAREonline.net/getvn.asp?v=4&n=2.

You might also like