Rasch Analysis An Introduction
Rasch Analysis An Introduction
Rasch Analysis An Introduction
A
s described in thefollowingarticle, not all statis-
tics arecreated equal: somedescribe, whileoth-
ers measure. Weat Laboratory Medicinewanted to
measureyour attitudes and needs, becausewebelieve
this will provideus with theinformation necessary to
producethebest publication for all our readers.
Rasch analysis is being used to measure the
responses to the Laboratory MedicineReader Sur-
veys that appeared in the January, February, March,
and April issues. The results of this analysis will be
published in the coming months and help deter-
mine the future direction of Laboratory Medicine.
Research Problems
Decision Making
We conduct research because we have questions
about how to act or react to a given situation. The
time, energy, and money invested in the research
project and the effects of decisions require a great
deal of confidence in the research process. The com-
plete information contained in the data does not
always see the light of day because traditional data
analysis techniques do not deal with the subtleties
and complexities inherent in a research situation.
Instinctively we know that there are problems we
should deal with when analyzing the data. However,
because we do not know how, we do the best we can
with what we have. New cutting-edge techniques
mean we can now address these problems instead of
just having nightmares about them.
Rating Scales
Rating Scales Dont Have a Uniform, Linear Structure
Rating scales are one of the most commonly used
research tools. Surveys, evaluation instruments,
and psychological tests use ratings that are treated
as if the choices were all even steps away from one
another. This is not the case.
Research shows the non linearity of rating scales.
Many raters have a tendency to group scores around
the middle of the scale values. The end points are
further from the points next to them than the other
points are from each other because some raters do
not like to make extreme judgments.
Instead of the ideal in which each point on the
scale is evenly spaced:
1 2 3 4 5 6
reality is considerably messier:
1 2 3 4 5 6
Terrible Poor Fair Good Very Excellent
good
Items
All Items Are Not Created Equal
When surveying or testing for such things as atti-
tudes, opinions, or interest, the items are not all at
the same point on the scale. Some items reflect a
more intense attitude than others, or a greater
level of interest.
For example, it is easier for respondents to
agree that they need a good way to keep up to date
with new instrumentation, assays, and tests than
to agree that Laboratory Medicineis the best jour-
nal they receive every month. Or, there is higher
agreement that Laboratory Medicineis useful than
that it is their favorite magazine.
It is not useful if all items measure at the same
point on the scale because that does not allow us
to examine the structure of the variable. Impor-
tant information is contained in the differences
between elements, not just in how many things are
alike. It would be like giving high school seniors a
math test that consisted only of addition. The test
does not distinguish one student from another.
Understanding the structure of the items gives
improved information for decision making.
Dr Tatum is director
of Examination
Activities and staff
psychometrician,
ASCP Board of
Registry, Chicago.
Donna Surges Tatum, PhD
S TATI S TI CS
Rasch Analysis: An Introduction
to Objective Measurement
NOTE: Raw scores
arenot suitable
for addingand
averaging.
MAY 2000 VOLUME 31, NUMBER 5 LABORATORY MEDI CI NE 273
Items Must Be Proven Valid and Reliable
Items must be examined to determine whether
they are all related to the same variable or if there
are different subscales. The items must also be
looked at for fit. That is, are the items behaving
in a predictable manner?Do people who use the
rating form misunderstand some items?Are the
items valid and reliable?Do they fit a theoretical
constructan idea that reflects the underlying
reality of the research question?
Raters
All Raters Are Not EqualThey Are Individual in the
Way They J udge a Performance, Product, or Situation
Raters are a crucial element in many research situ-
ations. However, we know from communication
and psychology theory that we all live in our own
perceptual world and attend to different things.
One person may react more to how a speech is
organized than how it is delivered, and another
may react the opposite way.
No matter how hard we try to establish inter-
rater reliability, we will never achieve the assump-
tion of all raters being equal. Instead of the false
assumption of sameness, we must address the issue
of differences. In fact, the differences between
raters can provide additional information.
Different raters use different levels of severity
when judging an event. That means we cannot take
their raw scores and add them to come up with an
objective measure. One raters 3 may be worth
more than another raters 4 because that person
is consistently more critical in her judgments.
Again we are lacking linearity and cannot use the
raw scores for mathematical functions.
Raters Must Be Consistent in Their J udgments
We assume, hope, or pray that our raters are well
trained and well behaved. If raters are not consis-
tent in their judgment of an event, then we have
no real basis upon which to make comparisons
and decisions. We must be able tell who is or is not
providing consistent evaluations. For example,
research has shown that more than 85% of histol-
ogy slide raters are effective and consistent.
Results
An Average or Percentage Is Not a Measure
When results are given in terms of raw scores with
averages or percentages, they are descriptive of
one-time events. The results are not true measures,
for as we have seen in the preceding paragraphs,
none of the components (rating scale, items, or
raters) is linear, and there are inherent problems.
Thus, they cannot be used to perform arithmetic
such as addition, subtraction, and multiplication.
One of the fundamental errors made in
research is asking statistics to perform a function
for which they are not equippedto measure
instead of describe. This is like using a rubber
ruler; there is no consistency or comparability
between persons, items, or groups. Statistics
describe a one-time event, after which the rubber
ruler has to be thrown away because it is of no fur-
ther use. It is not a calibrated ruler of units with
fixed intervals. There is no common frame of ref-
erence with standardized measures. Subsequent
research will be measured with another rubber
ruler that is not really the same thing, even though
the appearance is the same. This can lead to fuzzy
descriptions instead of the facts of measurement.
Direct Comparisons Require a Straight Line
Without a straight line marked in equal intervals,
direct comparisons lack precision and accuracy.
Tracking products over time, from group to group,
or in field tests can be tedious, difficult, and impre-
cise. If a calibrated ruler is used to measure instead
of the aforementioned rubber ruler, then pictures
and maps can be drawn to more easily comprehend
the results. A well-drawn picture is worth a thou-
sand numbers, for it creates perspective.
Solution
Nearly 10 years of research produced a scientific
method based on the Rasch Model. This system
for research and data analysis is objective mea-
surement. In 1953 Georg Rasch, a Danish mathe-
matician, was hired by the Danish government to
develop achievement tests to place army recruits.
He discovered a mathematical model that was
completely different from any used previously for
this type of data analysis. In 1960 Rasch came to
the University of Chicago for a year where he met
Benjamin D. Wright. Professor Wright, a psychol-
ogist who originally trained as a physicist, saw the
implications of this method. In 1963 he founded
the MESA Psychometric Laboratory at the Univer-
sity of Chicago where he and his colleagues refined
and extended the Rasch model. In the process, they
revolutionized social science research.
It is the extension of this model upon which
objective measurement is based:
S
e
c
t
i
o
n
S
c
i
e
n
t
i
f
i
c
C
o
m
m
u
n
i
c
a
t
i
o
n
s
4
NOTE: Differences
in raters must be
accommodated in
order to achieve
objectivity.
NOTE: Examine
items for order of
difficulty as well as
validity.
NOTE: A stable
frameof reference
must becreated
and maintained to
makemeaningout
of data.
Objective measurement allows one to examine
the various elements in an assessment situation. In
this case, it is examinee, judges, and the items on
the evaluation form. All of the facets are calibrated
in common units of measure within a common
frame of reference. Objective measurement analysis
performs the following functions:
1. Detects rating scale step structure
2. Provides a calibration of items
3. Produces objective measures of the respondents
4. Measures the severity of the raters
5. Discovers rater inconsistency
Method in Brief
This is a brief explanation of the concepts inherent
to understanding objective measurement. This
unique approach to rater-mediated evaluations
provides the most objective means for assessment
yet discovered.
The Research Situation
A classic psychometric analysis of raw scores is pri-
marily descriptive. It gives us a simple snapshot of
the research situation. It portrays a specific group of
people using a particular set of test items at a given
time. All the elements are inextricably bound
together. Raw scores are not linear and do not have
the mathematical properties of true measurement.
Social scientists take a snapshot of the research
situation. They or others replicate the snapshot and
then compare snapshots. However, these circles are
not directly comparable. Each one is unique unto
itself. Each circle reflects a particular, discrete situa-
tion. Averages, percentages, or percentiles based on
raw scores are sample dependent and can only rep-
resent what is happening in that circle with those
elements at thattime. The results are not a measure
that transcends from the particular to the general.
Measured Elements
When raw scores are conditioned using objective
measurement techniques, something wondrously
useful occurs. The strands in the analysis are dis-
entangled from each other and smoothed out into
straight lines. They are calibrated into common
units, providing context-free rulers that are able to
measure at any time and any place. These results
are precise reproducible measurement instead of
the fuzzy idiosyncratic description of statistics.
Investigation is now possible in a manner that
conforms to scientific principles. Instruments are
constructed and calibrated to produce generalizable
results. Each element can be examined separately,
allowing us to delve into the data in a far deeper way
than has been possible with traditional methods. We
discover information heretofore unavailable.
This Is It in a Nutshell
Observational statistics like raw scores and ratings
describe a one-time event with all elements inter-
woven. Objective measurement gives us straight
lines, precise measures, and separated elements
that remain stable across time and samples.
Bibliography
Arnold B. Scaling techniques. In: Philip E, Barker LL, eds.
Measurement of Communication Behavior. New York, NY: Long-
man; 1989.
Aristotle. Rhetoric. Rhys Roberts W, trans. New York, NY:
Modern Library; 1954.
Bock DG, Bock EH. EvaluatingClassroomSpeaking. Urbana,
IL: ERIC Clearinghouse on Reading and Communication Skills,
National Institute of Education; Annandale, VA: Speech Com-
munication Association; 1981.
Guilford JP. Psychometric Methods. 2nd ed. New York, NY:
McGraw; 1954.
Kerlinger FN. Foundationsin Behavioral Research. 3rd ed. New
York, NY: Holt, Reinhardt and Winston; 1973.
Likert R. A technique for the measurement of attitudes. Arch
of Psychol. 1932;140:1-55.
Linacre JM. FACETScomputer program. Chicago, IL: MESA
Press; 1988-1998.
Linacre JM. Many-Faceted Rasch Measurement. Chicago, IL:
MESA Press; 1998.
Linacre JM, et al. Measurement with judges: many faceted
conjoint measurement. Inter J Educ Res. 1994;21:569-577.
Lunz ME, Wright BD, Linacre JM. Measuring the impact of
judge severity on examination scores. Appl Meas Education.
1990;3:331-345.
McCrosky JC, et al. Attitude intensity and the neutral point on
semantic differential scales. PublicOpin Q. 1967;31:642-645.
Rasch G. Probabilistic Modelsfor SomeIntelligenceand Attain-
ment Tests. Copenhagen: Danmarks Paedogogiske Institute;
1960/80.
Stevens SS. Measurement, statistics, and the schematic view.
Science. 1986;141:849-856.
Surges-Tatum D. A Measurement Systemfor Speech Evaluation
[ dissertation] . Chicago, IL: University of Chicago; 1991.
Surges-Tatum D. Controlling for judge differences in the mea-
surement of public speaking. Paper presented at: American
Educational Research Association; April 1992; Atlanta, GA.
Surges-Tatum D. UsingFACETS to InvestigateSpeech Compe-
tency: ProceedingsSCA Summer Assessment Conference, July 1994.
Surges-Tatum D. An operational definition of speaking abil-
ity. In: Linacre JM, ed. Rasch Measurement Transactions, Part 1.
Chicago, IL: MESA Press; 1995.
Surges-Tatum D. Psychometric issues and fairness in K-12
assessment of oral communication. Paper presented at: Speech
Communication Association; November 1995; San Antonio, TX.
Wright BD, Masters GN. RatingScaleAnalysis. Chicago, IL:
MESA Press; 1982.
Wright BD, Stone MH. Best Test Design. Chicago, IL: MESA
Press; 1979.
LABORATORY MEDI CI NE VOLUME 31, NUMBER 5 MAY 2000 274
P
njik
Log [
] = B
n
C
j
D
i
F
k
P
njik
1
B
n
n =1, N (examinee)
C
j
j =1, O (raters)
D
i
i =1, N (items)
F
k
k =1-6 (rating scale categories)