Advantages of Norm Referenced Tests
Advantages of Norm Referenced Tests
Advantages of Norm Referenced Tests
estimate of the position of the tested individual in a predefined population, with respect to the
trait being measured. The estimate is derived from the analysis of test scores and possibly other
relevant data from a sample drawn from the population.[1] That is, this type of test identifies
whether the test taker performed better or worse than other test takers, not whether the test taker
knows either more or less material than is necessary for a given purpose.
The term normative assessment refers to the process of comparing one test-taker to his or her
peers.[1]
Norm-Referenced Test
LAST UPDATED: 07.22.15
Norm-referenced refers to standardized tests that are designed to compare and rank test takers in
relation to one another. Norm-referenced tests report whether test takers performed better or
worse than a hypothetical average student, which is determined by comparing scores against the
performance results of a statistically selected group of test takers, typically of the same age or
grade level, who have already taken the exam.
Calculating norm-referenced scores is called the “norming process,” and the comparison group is
known as the “norming group.” Norming groups typically comprise only a small subset of
previous test takers, not all or even most previous test takers. Test developers use a variety of
statistical methods to select norming groups, interpret raw scores, and determine performance
levels.
Norm-referenced tests often use a multiple-choice format, though some include open-ended,
short-answer questions. They are usually based on some form of national standards, not locally
determined standards or curricula. IQ tests are among the most well-known norm-referenced
tests, as are developmental-screening tests, which are used to identify learning disabilities in
young children or determine eligibility for special-education services. A few major norm-
referenced tests include the California Achievement Test, Iowa Test of Basic Skills, Stanford
Achievement Test, and TerraNova.
The following are a few representative examples of how norm-referenced tests and scores may
be used:
To determine a young child’s readiness for preschool or kindergarten. These tests may be
designed to measure oral-language ability, visual-motor skills, and cognitive and social
development.
To evaluate basic reading, writing, and math skills. Test results may be used for a wide
variety of purposes, such as measuring academic progress, making course assignments,
determining readiness for grade promotion, or identifying the need for additional
academic support.
To identify specific learning disabilities, such as autism, dyslexia, or nonverbal learning
disability, or to determine eligibility for special-education services.
To make program-eligibility or college-admissions decisions (in these cases, norm-
referenced scores are generally evaluated alongside other information about a student).
Scores on SAT or ACT exams are a common example.
Norm-referenced tests are specifically designed to rank test takers on a “bell curve,” or a
distribution of scores that resembles, when graphed, the outline of a bell—i.e., a small
percentage of students performing well, most performing average, and a small percentage
performing poorly. To produce a bell curve each time, test questions are carefully designed to
accentuate performance differences among test takers, not to determine if students have achieved
specified learning standards, learned certain material, or acquired specific skills and knowledge.
Tests that measure performance against a fixed set of standards or criteria are called criterion-
referenced tests.
Criterion-referenced test results are often based on the number of correct answers provided by
students, and scores might be expressed as a percentage of the total possible number of correct
answers. On a norm-referenced exam, however, the score would reflect how many more or fewer
correct answers a student gave in comparison to other students. Hypothetically, if all the students
who took a norm-referenced test performed poorly, the least-poor results would rank students in
the highest percentile. Similarly, if all students performed extraordinarily well, the least-strong
performance would rank students in the lowest percentile.
It should be noted that norm-referenced tests cannot measure the learning achievement or
progress of an entire group of students, but only the relative performance of individuals within a
group. For this reason, criterion-referenced tests are used to measure whole-group performance.
Reform
Norm-referenced tests have historically been used to make distinctions among students, often for
the purposes of course placement, program eligibility, or school admissions. Yet because norm-
referenced tests are designed to rank student performance on a relative scale—i.e., in relation to
the performance of other students—norm-referenced testing has been abandoned by many
schools and states in favor of criterion-referenced tests, which measure student performance in
relation to common set of fixed criteria or standards.
It should be noted that norm-referenced tests are typically not the form of standardized test
widely used to comply with state or federal policies—such as the No Child Left Behind Act—
that are intended to measure school performance, close “achievement gaps,” or hold schools
accountable for improving student learning results. In most cases, criterion-referenced tests are
used for these purposes because the goal is to determine whether schools are successfully
teaching students what they are expected to learn.
Similarly, the assessments being developed to measure student achievement of the Common
Core State Standards are also criterion-referenced exams. However, some test developers
promote their norm-referenced exams—for example, the TerraNova Common Core—as a way
for teachers to “benchmark” learning progress and determine if students are on track to perform
well on Common Core–based assessments.
Debate
While norm-referenced tests are not the focus of ongoing national debates about “high-stakes
testing,” they are nonetheless the object of much debate. The essential disagreement is between
those who view norm-referenced tests as objective, valid, and fair measures of student
performance, and those who believe that relying on relative performance results is inaccurate,
unhelpful, and unfair, especially when making important educational decisions for students.
While part of the debate centers on whether or not it is ethically appropriate, or even
educationally useful, to evaluate individual student learning in relation to other students (rather
than evaluating individual performance in relation to fixed and known criteria), much of
the debate is also focused on whether there is a general overreliance on standardized-test scores
in the United States, and whether a single test, no matter what its design, should be used—in
exclusion of other measures—to evaluate school or student performance.
The following are representative of the kinds of arguments typically made by proponents of
norm-referenced testing:
Norm-referenced tests are relatively inexpensive to develop, simple to administer, and
easy to score. As long as the results are used alongside other measures of performance,
they can provide valuable information about student learning.
The quality of norm-referenced tests is usually high because they are developed by
testing experts, piloted, and revised before they are used with students, and they are
dependable and stable for what they are designed to measure.
Norm-referenced tests can help differentiate students and identify those who may have
specific educational needs or deficits that require specialized assistance or learning
environments.
The tests are an objective evaluation method that can decrease bias or favoritism when
making educational decisions. If there are limited places in a gifted and talented program,
for example, one transparent way to make the decision is to give every student the same
test and allow the highest-scoring students to gain entry.
The following are representative of the kinds of arguments typically made by critics of norm-
referenced testing:
Although testing experts and test developers warn that major educational decisions
should not be made on the basis of a single test score, norm-referenced scores are often
misused in schools when making critical educational decisions, such as grade promotion
or retention, which can have potentially harmful consequences for some students and
student groups.
Norm-referenced tests encourage teachers to view students in terms of a bell curve, which
can lead them to lower academic expectations for certain groups of students, particularly
special-needs students, English-language learners, or minority groups. And when
academic expectations are consistently lowered year after year, students in these groups
may never catch up to their peers, creating a self-fulfilling prophecy. For a related
discussion, see high expectations.
Multiple-choice tests—the dominant norm-referenced format—are better suited to
measuring remembered facts than more complex forms of thinking. Consequently, norm-
referenced tests promote rote learning and memorization in schools over more
sophisticated cognitive skills, such as writing, critical reading, analytical thinking,
problem solving, or creativity.
Overreliance on norm-referenced test results can lead to inadvertent discrimination
against minority groups and low-income student populations, both of which tend to face
more educational obstacles that non-minority students from higher-income households.
For example, many educators have argued that the overuse of norm-referenced testing has
resulted in a significant overrepresentation of minority students in special-education
programs. On the other hand, using norm-referenced scores to determine placement in
gifted and talented programs, or other “enriched” learning opportunities, leads to the
underrepresentation of minority and lower-income students in these programs. Similarly,
students from higher-income households may have an unfair advantage in the college-
admissions process because they can afford expensive test-preparation services.
An overreliance on norm-referenced test scores undervalues important achievements, skills, and
abilities in favor of the more narrow set of skills measured by the tests.
Norm and CriterionReferenced Testing.
Linda A. Bond
North Central Regional Educational Laboratory
Tests can be categorized into two major groups: normreferenced tests and
criterionreferenced tests. These two tests differ in their intended purposes, the
way in which content is selected, and the scoring process which defines how the
test results must be interpreted. This brief paper will describe the differences
between these two types of assessments and explain the most appropriate uses of
each.
INTENDED PURPOSES
The major reason for using a normreferenced tests (NRT) is to classify students.
NRTs are designed to highlight achievement differences between and among
students to produce a dependable rank order of students across a continuum of
achievement from high achievers to low achievers (Stiggins, 1994). School
systems might want to classify students in this way so that they can be properly
placed in remedial or gifted programs. These types of tests are also used to help
teachers select students for different ability level reading or mathematics
instructional groups.
With normreferenced tests, a representative group of students is given the test
prior to its availability to the public. The scores of the students who take the test
after publication are then compared to those of the norm group. Tests such as the
California Achievement Test (CTB/McGrawHill), the Iowa Test of Basic Skills
(Riverside), and the Metropolitan Achievement Test (Psychological Corporation)
are normed using a national sample of students. Because norming a test is such
an elaborate and expensive process, the norms are typically used by test
publishers for 7 years. All students who take the test during that seven year
period have their scores compared to the original norm group.
While normreferenced tests ascertains the rank of students, criterionreferenced
tests (CRTs) determine "...what test takers can do and what they know, not how
they compare to others (Anastasi, 1988, p. 102). CRTs report how well students
are doing relative to a predetermined performance level on a specified set of
educational goals or outcomes included in the school, district, or state
curriculum.
Educators or policy makers may choose to use a CRT when they wish to see how
well students have learned the knowledge and skills which they are expected to
have mastered. This information may be used as one piece of information to
determine how well the student is learning the desired curriculum and how well
the school is teaching that curriculum.
Both NRTs and CRTs can be standardized. The U.S. Congress, Office of
Technology Assessment (1992) defines a standardized test as one that uses
uniform procedures for administration and scoring in order to assure that the
results from different people are comparable. Any kind of testfrom multiple
choice to essays to oral examinationscan be standardized if uniform scoring and
administration are used (p. 165). This means that the comparison of student
scores is possible. Thus, it can be assumed that two students who receive the
identical scores on the same standardized test demonstrate corresponding levels
of performance. Most national, state and district tests are standardized so that
every score can be interpreted in a uniform manner for all students and schools.
SELECTION OF TEST CONTENT
Test content is an important factor choosing between an NRT test and a CRT
test. The content of an NRT test is selected according to how well it ranks
students from high achievers to low. The content of a CRT test is determined by
how well it matches the learning outcomes deemed most important. Although no
test can measure everything of importance, the content selected for the CRT is
selected on the basis of its significance in the curriculum while that of the NRT is
chosen by how well it discriminates among students.
Any national, state or district test communicates to the public the skills that
students should have acquired as well as the levels of student performance that
are considered satisfactory. Therefore, education officials at any level should
carefully consider content of the test which is selected or developed. Because of
the importance placed upon high scores, the content of a standardized test can be
very influential in the development of a school's curriculum and standards of
excellence.
NRTs have come under attack recently because they traditionally have
purportedly focused on low level, basic skills. This emphasis is in direct contrast
to the recommendations made by the latest research on teaching and learning
which calls for educators to stress the acquisition of conceptual understanding as
well as the application of skills. The National Council of Teachers of
Mathematics (NCTM) has been particularly vocal about this concern. In an
NCTM publication (1991), Romberg (1989) cited that "a recent study of the six
most commonly used commercial achievement tests found that at grade 8, on
average, only 1 percent of the items were problem solving while 77 percent were
computation or estimation" (p. 8).
In order to best prepare their students for the standardized achievement tests,
teachers usually devote much time to teaching the information which is found on
the standardized tests. This is particularly true if the standardized tests are also
used to measure an educator's teaching ability. The result of this pressure placed
upon teachers for their students to perform well on these tests has resulted in an
emphasis on low level skills in the classroom (Corbett & Wilson, 1991). With
curriculum specialists and educational policy makers alike calling for more
attention to higher level skills, these tests may be driving classroom practice in
the opposite direction of educational reform.
TEST INTERPRETATION
As mentioned earlier, a student's performance on an NRT is interpreted in
relation to the performance of a large group of similar students who took the test
when it was first normed. For example, if a student receives a percentile rank
score on the total test of 34, this means that he or she performed as well or better
than 34% of the students in the norm group. This type of information can useful
for deciding whether or not students need remedial assistance or is a candidate
for a gifted program. However, the score gives little information about what the
student actually knows or can do. The validity of the score in these decision
processes depends on whether or not the content of the NRT matches the
knowledge and skills expected of the students in that particular school system.
It is easier to ensure the match to expected skills with a CRT. CRTs give detailed
information about how well a student has performed on each of the educational
goals or outcomes included on that test. For instance, "... a CRT score might
describe which arithmetic operations a student can perform or the level of
reading difficulty he or she can comprehend" (U.S. Congress, OTA, 1992, p. 170).
As long as the content of the test matches the content that is considered
important to learn, the CRT gives the student, the teacher, and the parent more
information about how much of the valued content has been learned than an
NRT.
SUMMARY
Public demands for accountability, and consequently for high standardized tests
scores, are not going to disappear. In 1994, thirtyone states administered NRTs,
while thirtythree states administered CRTs. Among these states, twentytwo
administered both. Only two states rely on NRTs exclusively, while one state
relies exclusively on a CRT. Acknowledging the recommendations for educational
reform and the popularity of standardized tests, some states are designing tests
that "reflect, insofar as possible, what we believe to be appropriate educational
practice" (NCTM, 1991, p.9). In addition to this, most states also administer
other forms of assessment such as a writing sample, some form of openended
performance assessment or a portfolio (CCSSO/NCREL, 1994).
Before a state can choose what type of standardized test to use, the state
education officials will have to consider if that test meets three standards. These
criteria are whether the assessment strategy(ies) of a particular test matches the
state's educational goals, addresses the content the state wishes to assess, and
allows the kinds of interpretations state education officials wish to make about
student performance. Once they have determined these three things, the task of
choosing between the NRT and CRT will becomes easier.
REFERENCES
Anastasi, A. (1988). Psychological Testing. New York, New York: MacMillan
Publishing Company.
Corbett, H.D. & Wilson, B.L. (1991). Testing, Reform and Rebellion. Norwood,
New Jersey: Ablex Publishing Company.
Romberg, T.A., Wilson, L. & Mamphono Khaketla (1991). "The Alignment of Six
Standardized Tests with NCTM Standards", an unpublished paper, University of
WisconsinMadison. In Jean Kerr Stenmark (ed; 1991). Mathematics
Assessment: Myths, Models, Good Questions, and Practical Suggestions. The
National Council of Teachers of Mathematics (NCTM)
Stenmark, J.K (ed; 1991). Mathematics Assessment: Myths, Models, Good
Questions, and Practical Suggestions. Edited by. Reston, Virginia: The National
Council of Teachers of Mathematics (NCTM)
Stiggins, R.J. (1994). StudentCentered Classroom Assessment. New York:
Merrill
U.S. Congress, Office of Technology Assessment (1992). Testing in America's
Schools: Asking the Right Questions. OTASET519 (Washington, D.C.: U.S.
Government Printing Office)
Descriptors: *Achievement Tests; *Criterion Referenced Tests; Elementary Secondary Education; National
Norms; *Norm Referenced Tests; Selection; *Standardized Tests; *State Programs; Test Content; Test Norms;
*Test Use; Testing Programs
A serious limitation of norm-reference tests is that the reference group may not represent the
current population of interest. As noted by the Oregon Research Institute's International
Personality Item Pool website, "One should be very wary of using canned "norms" because it
isn't obvious that one could ever find a population of which one's present sample is a
representative subset. Most "norms" are misleading, and therefore they should not be used. Far
more defensible are local norms, which one develops oneself. For example, if one wants to give
feedback to members of a class of students, one should relate the score of each individual to the
means and standard deviations derived from the class itself. To maximize informativeness, one
can provide the students with the frequency distribution for each scale, based on these local
norms, and the individuals can then find (and circle) their own scores on these relevant
distributions." [8]
Norm-referencing does not ensure that a test is valid (i.e. that it measures the construct it is
intended to measure).
Another disadvantage of norm-referenced tests is that they cannot measure progress of the
population as a whole, only where individuals fall within the whole. Rather, one must measure
against a fixed goal, for instance, to measure the success of an educational reform program that
seeks to raise the achievement of all students.
With a norm-referenced test, grade level was traditionally set at the level set by the middle 50
percent of scores.[9] By contrast, the National Children's Reading Foundation believes that it is
essential to assure that virtually all children read at or above grade level by third grade, a goal
which cannot be achieved with a norm-referenced definition of grade level.[10]
Norms do not automatically imply a standard. A norm-referenced test does not seek to enforce
any expectation of what test takers should know or be able to do. It measures the test takers'
current level by comparing the test takers to their peers. A rank-based system produces only data
that tell which students perform at an average level, which students do better, and which students
do worse. It does not identify which test takers are able to correctly perform the tasks at a level
that would be acceptable for employment or further education.