Measurement: Scaling, Reliability and Validity

MEASUREMENT: SCALING,
RELIABILITY, VALIDITY
GROUP MEMBERS:
AHMED JAN DAHRI (15-MBA-03)
ASGHAR ALI ARAIN (15-MBA-09)
MEASUREMENT: SCALING, RELIABILITY, VALIDITY
1. Rating Scales
Dichotomous scale
Category scale
Likert scale
Numerical scales
Semantic differential scale
Itemized rating scale
Fixed or constant sum rating scale
Staple scale
Graphic rating scale
Consensus scale
1.1 DICHOTOMOUS SCALE

The dichotomous scale is used to elicit a Yes or No answer, as in example
below. Note that a nominal scale is used to elicit the response.
Example 9.1
Do you own a car?
Yes
No
1.2 Category Scale

The category scale uses multiple items to elicit a single response as per the
following example. This also uses nominal scale.
Example 9.2
Where in northern California do you reside? ____North Bay
____South Bay
____East Bay
____Peninsula
____Other
1.3. Likert Scale

The Likert scale is designed to examine how strongly subjects agree
or disagree with statements on a 5-point scale with the following anchors:
The responses over a number of items tapping a particular concept or

variable (as per the following example) are then summated for every
respondent. This is an interval scale and the differences in the responses
between any two points on the scale remain the same.
1.4 Semantic Differential Scale

Several bipolar attributes at the identified at the extremes of the scale, and
respondents are asked to indicate their attitudes, on what may be called a
semantic space, toward a particular individual, object, or event on each of the
attributes. The bipolar adjectives used, for instance, would employ such terms
as Good---Bad; Strongweak; HotCold. The responses can be plotted to
obtain a good idea of their perceptions. This is treated as an interval scale.
Example 9.4
1.5. Numerical Scale

The numerical scale is similar to the semantic differential scale, with the
difference that numbers on a 5-point or 7-point scale are provided, with bipolar
adjectives at both ends, as illustrated below. This is also an interval scale.
Example 9.5
1.6 Itemized Rating Scale

A 5-point or 7-point scale with anchors, as needed, is provided for each item and
the respondent states the appropriate number on the side of each item, or circles
the relevant number against each item. The responses to the items are then
summated. This uses an interval scale.
Example 9.6 (I)
Respond to each item using the scale below, and indicate your response number
on the line by each item.
Example 9.6 (ii)

Circle the number that is closest to how you feel for item below.
1.7 Fixed or Constant Sum Scale

The respondents are here asked to distribute a given number of points
across various items as per the example below. This is more in the nature of
an ordinal scale.
Example 9.7
In choosing a toilet soap, indicate the importance you attach to each of the
following five aspects by allotting points for each to total 100 in all.
1.8 Stapel Scale

This scale simultaneously measures both the direction and intensity
of the attitude toward the items under study. The characteristic of interest
to the study is placed at the corner and a numerical scale ranging, say, from
+3 to -3, on either side of the item as illustrated below. This gives an idea of
how close or distant the individual response to the stimulus is, as shown in
the example below. Since this does not have an absolute zero point, this is
an interval scale.
Example 9.8
1.9 Graphic Rating Scale

A graphical representation helps the respondents to indicate on this
scale their answers to a particular question by placing a mark at the
appropriate point on the line, as in the following example. This is an
ordinal scale, though it might appear to make it look like an interval
scale.
Example 9.9
1.10. Consensus Scale

Scales are also developed by consensus, where a panel of judges selects
certain items, which in its view measure the relevant concept. The items are
chosen particularly based on their pertinence or relevance to the concept. One
such consensus scale is the Thurstone Equal Appearing Interval Scale,
where a concept is measured by a complex process followed by a panel of
judges. Using a pile of cards containing several descriptions of the concept, a
panel of judges offers inputs to indicate how close or not the statements are to
the concept under study. The scale is then developed based on the consensus
reached.
1.11. Other Scales

There are also some advanced scaling methods such as multidimensional
scaling, where objects, people, or both, are visually scaled, and a conjoint
analysis is performed. This provides a visual image of the relationships in
space among the dimensions of a construct.
2. Ranking Scales
Ranking scales are used to tap preferences between two or among more
objects or items (ordinal in nature).
2.1 Paired Comparison
The paired comparison scale is used when, among a small number of
objects, respondents are asked to choose between two objects at a time. This
helps to assess preferences. If, for instance, in the previous example, during
the paired comparisons, respondents consistently show a preference for
product one over products two, three, and four, the manager reliably
understands which product line demands his utmost attention. However, as
the number of objects to be compared increases, so does the number of
paired comparisons. Hence paired comparison is a good method if the
number of stimuli presented is small.
2.2 Forced Choice

The forced choice enables respondents to rank objects relative to one
another, among the alternatives provided. This is easier for the respondents,
particularly if the number of choices to be ranked is limited in number.
Example 9.10
Rank the following magazines that you would like to subscribe to in the
order of preference, assignment 1 for the most preferred choice and 5 for the least
preferred.
2.3 Comparative Scale

The comparative scale provides a benchmark or a point of reference to
assess attitudes towards the current object, event, or situation under study.
An example of the use of comparative scale follows.
Example 9.11
Rating scales are used to measure most behavioral concepts. Ranking scales
are used to make comparisons or rank the variables that have been tapped on
a nominal scale.
3. Goodness of Measures
It is important to make sure that the instrument that the instrument that we
develop to measure a particular concept is instrument that we develop to
measures a particular concept is indeed accurately measuring the variable,
and that in fact, we are actually measuring the concept that we set out to
measure. This ensures that in operationally defining perceptual and
attitudinal variables, we have not overlooked some important dimensions and
elements or included some irrelevant ones.
3.1. Item Analysis

Item analysis is done to see if the items in the instrument belong there or not.
Each item is examined for its ability to discriminate between those subjects
whose total scores are high, and those will low scores. In item analysis, the
means between the high-score group and the low-score group are tested to
detect significant differences through the t-values. The items with a high t-value (test
which is able to identify the highly discriminating items in the
instrument) are then included in the instrument.
4. Reliability
The reliability of a measure indicates the extent to which it is without bias
(error free) and hence ensures consistent measurement across time and across
the various items in the instrument. In other words, the reliability of a
measure is an indication of the stability and consistency with which the
instrument measures the concept and helps to assess the goodness of a
measure.
4.1 Stability of Measures

The ability of a measure to remain the same over timedespite uncontrollable
testing conditions or the state of the respondents themselvesis indicative of
its stability and low vulnerability to changes in the situation. This attests to its
goodness because the concept is stably measured, no matter when it is done.
Two tests of stability are test-retest reliability and parallel-form reliability.
4.2 Test-Retest Reliability

The reliability coefficient obtained with a repetition of the same measure on
a second occasion is called test-retest reliability. That is, when a
questionnaire is administered to a set of respondents now, and again to the
same respondents, says several weeks to 6 months later, then the correlation
between the scores obtained at the two different times from one and the
same set of respondents is called the test-retest coefficient. The higher it is,
the better the test-retest reliability, and consequently, the stability of the
measure across time.
4.3 Parallel-Form Reliability

When responses on two comparable sets of measures tapping the same
construct are highly correlated, we have parallel-form reliability. Both forms
have similar items and the same response format, the only changes being the
wordings and the order or sequence of the questions. What we try to
establish here is the error variability resulting from wording and ordering of
the questions. If two such comparable forms are highly correlated the
measures are reasonably reliable.
4.4 Internal Consistency of Measures

The internal consistency of measures is indicative of the homogeneity of the
items in the measure that tap the construct. In other words, the items should
hang together as a set, and be capable of independently measuring the
same concept so that the respondents attach the same overall meaning to
each of the items. This can be seen by examining if the items and the subsets
of items in the measuring instruments are correlated highly. Consistency can
be examined through the inter-item consistency reliability and split-half
reliability tests.
a. Interitem Consistency Reliability

This is a test of the consistency of respondents answers to all the items in a
measure. To the degree that items are independent measures of the same
concept, they will be correlated with one another. The most popular test of
interitem consistency reliability is the Cronbachs coefficient alpha
(Cronbachs alpha; Cronbach, 1946), which is used for multipoint-scaled
items, and the Kuder-Richardson formulas (Kuder & Richardson, 1937),
used for dichotomous items. The higher the coefficients, the better the
measuring instrument.
b. Split-Half Reliability
Split-half reliability reflects the correlations between two halves of an
instrument. The estimates would vary depending on how the items in the
measure are split into two halves. Split-half reliabilities could be higher than
Cronbachs alpha only in the circumstance of there being more than one
underlying response dimension tapped by the measure and when certain
other conditions are met as well. Hence, in almost all cases, Cronbachs alpha
can be considered a perfectly adequate index of the interitem consistency
reliability.
5. Validity
Several types of validity tests are used to test the goodness of measures and
writers use different terms to denote them. For the sake of clarity, we may
group validity tests under three broad headings: content validity,
criterion-related validity, and construct validity.
5.1 Content Validity
Content validity ensures that the measure includes an adequate and
representative set of items that tap the concept. The more the scale items
represent the domain or universe of the concept being measured, the greater
the content validity. To put it differently, content validity is a function of how
well the dimensions and elements of a concept have been delineated.
Face validity is considered by some as a basic and a very minimum index of
content validity. Face validity indicates that the items that are intended to
measure a concept, do on the face of it look like they measure the concept.
5.2 Criterion-Related Validity

Criterion-related validity is established when the measure differentiates
individuals on a criterion it is expected to predict. This can be done by
establishing con-current validity or predictive validity, as explained below.
Concurrent validity is established when the scale discriminates individuals who
are known to be different; that is, they should score differently on the
instrument as in the example that follows.
5.3 Construct Validity

Construct validity testifies to how well the results obtained from the use of the
measure fit the theories around which the test is designed. This is assessed
through convergent and discriminant validity, which are explained below.
Convergent validity is established when the scores obtained with two different
instruments measuring the same concept are highly correlated.
Discriminant Validity is established when, based on theory, two variables are
predicted to be uncorrelated, and the scores obtained by measuring them are
indeed empirically found to be so.
MEASURES FROM MANAGEMENT

RESEARCH
Below is a sample of five scales used to measure five variables related to
management research.
V. LEAST PREFERRED COWORKER SCALE (TO ASSESS

WHETHER EMPLOYEES ARE PRIMARILY PEOPLEORIENTED OR TASK-ORIENTED)
Look at the words at both ends of the line before you put in your X. Please remember that
there are no right or wrong answers. Work rapidly; your first answer is likely to be the best.
Please do not omit any items, and mark each item only once.
LPC
Think of the person with whom you can work least well. He may be someone you work with
now, or he may be someone you knew in the past.
He does not have to be the person you like least well, but should be the person with whom
you had the most difficulty in getting a job done. Describe this person as he appears to you.
ANY QUESTION IF?

Measurement: Scaling, Reliability and Validity

Uploaded by

Copyright:

Available Formats

Measurement: Scaling, Reliability and Validity

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measurement: Scaling, Reliability and Validity

Uploaded by

Copyright:

Available Formats

MEASUREMENT: SCALING,

MEASUREMENT: SCALING, RELIABILITY, VALIDITY

1.1 DICHOTOMOUS SCALE

1.2 Category Scale

1.3. Likert Scale

The responses over a number of items tapping a particular concept or

1.4 Semantic Differential Scale

1.5. Numerical Scale

1.6 Itemized Rating Scale

Example 9.6 (ii)

1.7 Fixed or Constant Sum Scale

1.8 Stapel Scale

1.9 Graphic Rating Scale

1.10. Consensus Scale

1.11. Other Scales

2.2 Forced Choice

2.3 Comparative Scale

3.1. Item Analysis

4.1 Stability of Measures

4.2 Test-Retest Reliability

4.3 Parallel-Form Reliability

4.4 Internal Consistency of Measures

a. Interitem Consistency Reliability

5.2 Criterion-Related Validity

5.3 Construct Validity

MEASURES FROM MANAGEMENT

V. LEAST PREFERRED COWORKER SCALE (TO ASSESS

ANY QUESTION IF?

You might also like