Using Rasch Analysis To Inform Rating Scale
Using Rasch Analysis To Inform Rating Scale
Using Rasch Analysis To Inform Rating Scale
DOI 10.1007/s11162-017-9448-0
SHORT PAPER/NOTE
Abstract The use of surveys, questionnaires, and rating scales to measure important
outcomes in higher education is pervasive, but reliability and validity information is often
based on problematic Classical Test Theory approaches. Rasch Analysis, based on Item
Response Theory, provides a better alternative for examining the psychometric quality of
rating scales and informing scale improvements. This paper outlines a six-step process for
using Rasch Analysis to review the psychometric properties of a rating scale. The Partial
Credit Model and Andrich Rating Scale Model will be described in terms of the
pyschometric information (i.e., reliability, validity, and item difficulty) and diagnostic
indices generated. Further, this approach will be illustrated through the example of
authentic data from a university-wide student evaluation of teaching.
Keywords Higher education Rasch Analysis Likert-type scale Partial Credit Model
Rating Scale Model Scale development
Introduction
Surveys and rating scales predominate measurement in higher education outside of the
classroom. Borden and Kernel (2012) have compiled an inventory of measures used to
assess quality in higher education, and of the 251 instruments identified, most are surveys
and rating scales. Further, most campuses use many locally-designed measures of student
satisfaction, teaching effectiveness, campus climate, and the list goes on. To further
illustrate this prevalence, a search of the ERIC database using the following keywords:
measurement, evaluation, assessment; higher education or college or university; and survey
123
Res High Educ
or questionnaire; results in 31,710 items. However, when the keyword reliability and
validity is added as a filter, only 732 (2.3%) remain. When rating scale is substituted for
survey or questionnaire, the results are less drastic: 290 out of 2545 items (11.4%) include
direct reference to reliability and validity as a keyword.
When the psychometric quality of these surveys and ratings scales is explored, the most
common approach is to use principles of Classical Test Theory (CTT), which suffers from
several limitations, including the fact that derived scores are sample dependent and biased
toward central scores (Bradley et al. 2015). Further, missing data presents a problem for
computing overall scores. Measure reliability is often presented as Cronbachs alpha, and
evidence of validity is based on the content of the items and correlations of scale scores
with other measures, which may or may not be reliable and valid themselves. Finally, it is
very difficult to examine the operation of individual items to determine effectiveness of
these items for the target population and their contribution to measurement of the overall
latent construct. Determining the functioning of rating scale response options is very
difficult and usually conducted by a superficial examination of mean scores on alternate
versions of the scale. Further complicating measurement issues with surveys, question-
naires, and rating scales is the fact that these are indirect measures in which respondents
self-report on their perceptions and are subject to many kinds of response bias (Bradley
et al. 2015; Zlatkin-Troitschanskaia et al. 2015).
Rasch Analysis, based on Item Response Theory (IRT; Embretson and Reise 2000),
provides a very effective alternative for exploring the psychometric properties of measures
and accounting for response bias (Bradley et al. 2015). The original Rasch model was
developed for use with dichotomously scored items (i.e., those that are marked as either
correct or incorrect), and is based on the early work of Thurstone and Guttman (Osterlind
2009). Unlike in CTT, where the standard error of measurement is assumed to be equiv-
alent across all test takers and is sample dependent, in IRT, measurement error is assumed
to vary across individuals and does not depend on a particular sample of respondents.
Estimates of the latent trait being measured are based on both person and item charac-
teristics, and both person ability and item difficulty are measured on the same scale (logits).
Thus, we can use analyses based on IRT to determine if item difficulties are appropriate to
person ability levels on the latent trait. By more appropriately matching item difficulties to
person abilities, IRT allows us to develop measures with greater score reliability using
fewer test items.
The Andrich Rating Scale Model (RSM; Andrich 1978) is a variation of the traditional
Rasch model used for polytomous data (e.g., likert-type items). As with all Rasch models,
information is provided about item difficulty, person ability, and reliability. In the case of a
non-achievement measure, difficulty refers to how much of the latent trait the individual
must possess before they positively endorse an item. Reliability information is provided for
both item measurement and person measurement in the form of separation indices and
reliability indices. In addition, detailed information about the contribution of individual
rating scale options to the measurement of the latent construct is provided. The indices of
interest include category frequencies and average measures, infit and outfit mean squares,
and threshold calibrations. By using the RSM, we can determine the sufficiency of the
rating scale options for the level of measurement precision required, both in terms of
number of options and their labels.
The Partial Credit Model (PCM; Wright and Masters 1982) allows for the compilation
of items on different scales into an overall latent score using linking items that are on a
common scale. Through the use of linking items and the PCM, it is possible to ensure that
123
Res High Educ
several different versions of a rating scale are measuring a latent trait in an equivalent
fashion (Bond and Fox 2012). This paper will illustrate how the PCM can be used in
conjunction with the RSM to compare several rating scales to each other to select the most
appropriate version. Six steps (see Fig. 1) will be described for the review of the psy-
chometric quality of rating scales according to objective criteria. However, this same
review can be performed on a single rating scale without the steps involving PCM this
parallel process will be highlighted throughout the discussion below. Additional infor-
mation about the underlying theory and calculations for Rasch Analysis can be found in the
Embretson and Reise (2000) volume. Practical applications of Rasch Analysis using
Winsteps software is available from Bond and Fox (2012).
For most users of Rasch Analysis, the question or issue that brings them to Rasch involves
the quality of an established rating scale. The purpose of the analysis will be to establish
reliability and ensure that it is indeed measuring the construct with precision. However, a
very valuable use of the Andrich Rating Scale Model is the information it provides about
the functioning of the options in a likert-type scale. Through an examination of category
diagnostic indices, a great deal of information can be gleaned about the functioning of the
scale options in providing adequate measurement (Bond and Fox 2012).
Fig. 1 Steps in using Rasch Analysis to review psychometric properties of rating scales
123
Res High Educ
In the illustrative example presented here, Rasch Analysis is used to identify the most
appropriate number of rating scale options for a student evaluation instrument of teaching
(shown in Table 1). However, this same approach can be used for any measure that uses a
likert-type rating scale. The items in the example measure are based on the educational
quality factors identified by Marsh (1987) in his work with the Student Evaluation of
Educational Quality (SEEQ) instrument. The 11 items are divided into two measures:
Course Effectiveness (items about the course itself) and Teaching Effectiveness (items
specific to the individual instructor, repeated for each instructor in a multiple instructor
course). For both measures, the neutral option was excluded as a strategy to encourage
students to express either a positive or negative opinion, rather than remaining neutral. The
4-point scale included the following options: 1 = strongly disagree, 2 = disagree,
3 = agree, 4 = strongly agree. Further, all 11 items were required, with no mechanism for
opting out if they truly had no basis for forming an opinion.
This approach to scale development achieved its intended purpose of maximizing
collected data for those students who completed the evaluation. Every student making it to
the end had a complete set of data because there was no way to skip questions. In addition,
as shown in Table 1, the Cronbachs alpha reliabilities for Course and Teaching Effec-
tiveness were very high (0.95 and 0.96, respectively), indicating that the selected items in
each measure are very highly correlated with one another. However, many instructors and
students alike expressed concerns about the fact that students were forced to respond to all
items, even when they could not form an opinion, and many students expressed feeling
pressured to complete their evaluations. From a measurement perspective, the extent of
bias in responses was unknown. How many students were simply selecting any response to
proceed with and complete the evaluation, and were students tending to mark on the
positive side, the negative side, or both?
The psychometric question in this example concerns the issue of psychological comfort
versus meaningful responses. Can the scale options be adjusted to provide an option for
those who truly are neutral and to allow an opt out while still maintaining reliable
123
Res High Educ
To ensure enough responses for each scale option for each item, a sufficiently large sample
is required. Linacre (2014b) has prepared guidelines for appropriate sample size and
suggests a minimum of 10 respondents for each scale point to achieve adequate statistical
power. In the present study, at the very minimum, 60 respondents are required for each of
the four conditions, or 240 total respondents. A sample of this size allows for item cali-
bration precisions within ?/- logit (a \ 0.05). As the decisions based on the mea-
surement results become more serious, the desired measurement precision will be greater.
However, the greatest number of respondents indicated by Linacre, even at the most high-
stakes levels of decision making, is 500.
With regard to the course evaluation example, oversampling was needed due to tradi-
tionally poor response rates. At this institution, response rates per class range from 30 to
40%. To achieve the needed sample size in light of the low response rate, large under-
graduate sections of seated courses (150 or more students enrolled) were selected for the
recruitment pool. This pool was further narrowed to include only 36 sections with a single
instructor. Ten of these instructors consented to participate (27.8%), and two of them
volunteered additional course sections, resulting in a final sample of 1271 completed
course evaluations. The total student response rate across all four conditions was 43.4%,
and sufficient responses were collected to meet the sample size requirements recommended
by Linacre (2014b).
Within the participating course sections, the student enrollments were randomly
assigned to one of four conditions based on the version of the rating scale the students
would see on the evaluation form. In addition to the 11 common course evaluation items,
all study participants received five additional linking items that used condition 4 (midpoint
1 1 2 3 4
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5 6
123
Res High Educ
and opt out), selected from the University Course Evaluation Item Bank (Purdue
University Center for Instructional Excellence 2014):
1. Relationships among course topics are clearly explained.
2. My instructor makes good use of examples and illustrations.
3. My instructor indicates relationship of course content to recent developments.
4. My instructor effectively blends facts with theory.
5. Difficult concepts are explained in a helpful way.
These items are used to link all versions of the scales together so that overall ratings of
course and teacher effectiveness can be estimated on the same scale using the PCM and
compared, regardless of condition (Step 3; Linacre 2014a). This additional step is not
required for projects where only the psychometric properties of a single scale are
examined.
In sum, students in each identified class section randomly received one of four varia-
tions of the course evaluation rating scale, but all received the five common linking items
rated on the 6-point scale. All student responses were completely anonymous and
instructor identifiers were stripped from the data before data analysis began.
In Step 3, latent course and instructional effectiveness scores for each respondent were
estimated using the Rasch PCM (Linacre 2014a), a step that is not required when one is
examining the psychometric properties of a single version of a rating scale. In the course
evaluation example, five additional linking items used the rating scale with all possible
options, allowing Winsteps to calibrate all responses regardless of rating scale condition
and to estimate measures of course effectiveness for all respondents across all conditions.
Each respondents estimated latent Course Effectiveness score was then saved to a data file
that could be exported to SPSS.
A full factorial analyses of variance (ANOVA) model was estimated, using Course
Effectiveness as the dependent variable and course section as a control factor to allow for
the fact that different kinds of courses and different instructors will likely have different
course ratings. Results are shown in Table 3. Controlling for section effects, the main
effect of rating scale condition was not statistically significant (F(3,1242) = 0.69), indicating
that the format of the scale does not impact the measure of course effectiveness. Based on
this finding, we can proceed to Step 4. Note that this finding is not consistent with the
findings of Bradley et al. (2015) in their examination of academic readiness of college
freshmen, suggesting that the neutral mid-point option may function differently in different
measurement contexts and with different latent constructs. This fact further illustrates why
123
Res High Educ
Rasch Analysis is such an important tool for teasing out the impact of response options on
measurement of the latent construct.
In Step 4, Winsteps (Linacre 2014a) is used to run the Andrich RSM (Andrich 1978; Bond
and Fox 2012) and generate rating scale diagnostics and reliability indices. This step is
relevant for examinations of psychometric quality. When the focus of the examination is
on a single rating scale, this analysis will be run just one time. In the present example, it
was run four times, once for each condition. Options that are considered opt out options,
such as dont know or not applicable, are coded as missing values in these analyses.
Procedures and resulting fit indices outlined by Bond and Fox (2012) are used to analyze
the measurement precision of the rating scale. These include item separation and reliability
and person separation and reliability. Category diagnostics are examined to determine the
appropriateness of the number of response options for each scale, including category
frequencies and average measures, infit and outfit mean squares, and threshold calibrations.
Probability curves, showing the likelihood of responses for each response option, are
generated to provide a visual analysis of the appropriateness of each option.
Item separation and reliability estimates indicate the degree to which the item estimates
are expected to remain stable in a new sample. In general, an item separation index greater
than 3.0 coupled with reliability greater than 0.90 is an indication that the hierarchical
structure of items according to level of latent trait will be stable in a new sample (Bond and
Fox 2012). The criteria for stability of item difficulty are most likely to be achieved with
large sample sizes and items that have a wide range of levels of the latent trait (Linacre
2014a). Person reliability indices reflect the degree to which people in new samples can be
classified along the latent trait being measured, and stability of classification is found when
the person separation index is greater than 2.0 and the reliability estimate is greater than
0.80 (Linacre 2014a). The person-level estimates indicate the level of generalizability of
the measurement to new samples.
The item separation and reliability estimates and person separation and reliability
estimates for the Course Effectiveness conditions are shown in Table 4. In the present
example, the item reliability indices do not achieve acceptable levels for stability of item
difficulty across samples. The separation and reliability estimates are extremely consistent
across Conditions 1, 3, and 4, with Condition 2 having the lowest item separation and
reliability estimates. This lack of item stability is likely due to the fact that all of the items
on this measure are very closely clustered together in terms of difficulty.
Three of the four versions of the Course Effectiveness measure have adequate person
reliability (i.e., person separation greater than 2.0 and reliability greater than 0.80),
123
Res High Educ
Conditions 1, 2, and 4. These results suggest that the scales used in these three conditions
are able to sufficiently measure respondents endorsement of Course Effectiveness items.
Tables 5 displays the category diagnostic indices for each category within each scale
condition. Criteria for each of these indices against which results can be compared have
been established (Bond and Fox 2012). In terms of category frequencies, each category
should have at least 10 responses, and average measures should increase monotonically
from the lowest rating point to the highest rating point. Infit and outfit mean squares should
be less than 2.0; values higher than this suggest that the category is not contributing to the
measurement of the latent trait and, in fact, may be working to diminish precision. Finally,
with regard to thresholds, each threshold, or step up the scale (for example, from the rating
of strongly disagree to disagree), should be at least 1.4 logits greater than the last to show
appropriate distinction between categories. However, intervals of more than 5 logits
indicate that there is a gap in the measurement of the trait.
As shown, each version of the scale meets the criteria for category frequency and
monotonicity of average measures. The lowest category frequencies are for the dont
know/not applicable option in Conditions 3 and 4, but each of these still exceeds the
123
Res High Educ
minimum criterion of 10. Further, all of the infit and outfit mean squares are less than 2.0,
suggesting that the amount of error is in an acceptable range. Thus, the thresholds appear to
be the index of most value for determining the appropriateness of each of the four scales.
For Condition 1, the threshold distance between points 1 and 2 and 2 and 3 fall within the
appropriate range of widths (2.99 and 1.43, respectively), but the distance between 3 and 4
(6.11) suggests that another option would be appropriate between these two. In Condi-
tions 2 and 4, the threshold distance between options 3 and 4 is too small, but inclusion of a
midpoint results in more optimal spacing between the top two options. For Condition 3, the
threshold distances are most appropriately spaced but the distance between 3 and 4 (5.14)
is a little too wide. Probability curves showing category frequencies and thresholds are
shown in Fig. 2. These curves illustrate the data shown in Table 5: for every version of the
scale, respondents are most likely to be grouped in the top two categories. In the versions
used in Conditions 2 and 4, the neutral midpoint appears to have a minimal role, but
threshold distances between the last two options are at the most appropriate width when the
midpoint is included.
There are no clear cut winners in terms of scale condition. The probability curves
suggest that students are either on the very low side in terms of ratings or on the very high
side with not much distinction in between. For this reason, an examination of the func-
tioning of individual items is necessary.
The final step in the psychometric analysis is an examination of the individual items.
Through an examination of the item separation index and the probability curves, we can
begin to get a sense of the range of difficulty levels of items included in the measure. In an
123
Res High Educ
appropriately designed measure, respondents at all levels of the latent trait will be matched
to items that assess their level of that trait, and we should see a full range of item
difficulties. Item separation indices that do not meet the criteria of 3.0 suggest that the
difficulty levels of the items may be mismatched with respondents. Additional evidence of
inappropriate item difficulties may be seen in the probability curves. Taking the Course
Effectiveness results as an example, none of the versions of the scale met the item sepa-
ration criteria of 3.0. In Fig. 2, regardless of the version of the scale, respondents from all
levels of perceptions of course effectiveness, from very low levels of perception that the
course is effective to very high levels, selected the agree option, which was the most
common option in each of the scales.
Wright maps illustrate how the difficulty of items, measured in logits, is matched to the
overall level of the latent trait in each respondent, also measured in logits (Bond and Fox
2012). Wright maps for the maintained scale version for Course Effectiveness is shown in
Fig. 3. The majority of students are clustered at ?6.0 or ?7.0 logits, at the highest levels of
perception of course effectiveness. The item difficulties, however, never exceed 1.0 logits,
and all of the items are grouped together at the same levels of difficulty. With the existing
measure, it takes very low levels of perceived course effectiveness to rate courses highly
(positive bias; Darby 2008). These findings suggest that the items need to be reviewed for
potential rewording to better capture a wider range of perceptions of course effectiveness
or that additional items from a wider range of difficulties should be added.
Based on results of the analyses, the next step is to determine if the measure is performing
optimally or if revisions are needed in either the wording of the items or the scale itself. When
measures fail to meet the criteria for Item and Person Separation, it is likely due to the fact that
items are not sufficiently different in terms of difficulty levels and that they do not match all
the ability levels represented among the respondents. The Wright Map can provide confir-
mation of both of these conclusions. Items that are too similar in terms of difficulty level and/
or do not correspond to the ability levels of the respondents should be reviewed and reworded
or new items with more appropriate levels of difficulty should be added.
With regard to the diagnostic indices for the scale itself, when the infit and outfit mean
squares are too large and/or when the threshold distances are not appropriately spaced, the
scale options need to be reviewed and the probability curves examined to see where there are
inappropriate gaps in measurement. When the threshold distances are too small, options can
be removed, and when too large, additional options can be added. Item difficulty results will
indicate if additional items of higher or lower difficulty levels should be included.
However, it is important to keep in mind that the measurement context and purposes for
the measurement must be taken into account when making a final decision about revision.
With the present example, there was no clear cut best choice for scale version for either
measure. None of the options met the criteria for item separation and reliability or for
threshold distances. The context had to guide the decision. The Faculty Senate was open to
the idea of adding a neutral midpoint to increase students psychological comfort with the
scale and, in turn, hopefully boost response rates. However, for Course Effectiveness,
Senate was not open to adding an opt out since all items were written to be relevant for all
course types. Items from a wider range of difficulty levels will be phased in over time to
improve the psychometric quality. However, in the short term, the overall university
response rate improved from 42.0% with the four-point scale to 48.0% with the use of the
revised scales, and student complaints about the scale itself are virtually non-existent.
123
Res High Educ
Conclusions
This paper and the course evaluation example illustrate how Rasch Analysis can be used to
empirically review the psychometric properties and quality of rating scales. Through the
use of a systematic process to collect and analyze the data and compare results against
specific, predetermined criteria, we can make conclusions about the quality of our rating
123
Res High Educ
scales and our items. The Andrich Rating Scale Model provides diagnostic indicators for
each response option that indicate if each is working optimally to precisely measure the
construct. The PCM is helpful to compare different versions of scales to determine which is
providing the most precise and most reliable measurement of the construct. We can use
information from these two forms of analyses iteratively to review, revise, and refine our
measures until they achieve the level of measurement precision needed for our decision-
making purposes.
It is important to keep in mind, however, that the measurement context and the purposes
and needs for the measurement must also be brought to bear on the decision to revise a
measure. In some cases, for the sake of consistency over time, it is important to make small
revisions to a measure rather than to revise the measure in its entirety. In the end, the
empirical results of Rasch analysis need to be combined with professional judgment and
measurement context to determine the best course to take.
References
Andrich, D. (1978). Rating formulation for ordered response category. Psychometrika, 43(4), 561573.
Bond, T. G., & Fox, C. M. (2012). Applying the Rasch Model: Fundamental measurement in the human
sciences (2nd ed.). New York: Routledge.
Bradley, K., Peabody, M., Akers, K., & Knutson, N. (2015). Rating scales in survey research: Using the
Rasch model to illustrate the middle category measurement flaw. Survey Practice, 8(2). Retrieved from
http://www.surveypractice.org/index.php/SurveyPractice/article/view/266.
Darby, J. A. (2008). Course evaluations: A tendency to respond favourably on scales? Quality Assurance
in Education, 16, 718.
Embretson, S. A., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence
Erlbaum Associates.
Linacre, J. M. (2014a). A users guide to Winsteps Ministeps Rasch-Model computer programs. Chicago, IL:
Author.
Linacre, J. M. (2014b). Sample size and item calibration (or person measure) stability. Institute for Objective
Measurement. Retrieved October 7, 2014, from http://www.rasch.org/rmt/rmt74m.htm.
Marsh, H. W. (1987). Students evaluations of university teaching: Research findings, methodological
issues, and directions for future research. International Journal of Educational Research, 11(3),
253388.
Osterlind, S. J. (2009). Modern measurement: Theory, principles, and applications of mental appraisal.
Upper Saddle River, NJ: Pearson.
Purdue University Center for Instructional Excellence. (2014). PICES item catalog. Retrieved August 26,
2014, from http://www.purdue.edu/cie/data/pices.html.
Wright, B. D., & Masters, G. D. (1982). Rating scale analysis. Chicago, IL: Mesa Press.
Zlatkin-Troitschanskaia, O., Shavelson, R. J., & Kuhn, C. (2015). The international state of research on
measurement of competency in higher education. Studies in Higher Education, 40(3), 393411.
123