Measurement and Scaling

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Measurement and Scaling

Scaling is the procedure of measuring and assigning the objects to the numbers
according to the specified rules. In other words, the process of locating the
measured objects on the continuum, a continuous sequence of numbers to which
the objects are assigned is called scaling.

The measurement is the process of assigning numbers or symbols to the


characteristics of the object as per the specified rules. Here, the researcher
assigns numbers, not to the object, but its characteristics such as perceptions,
attitudes, preferences, and other relevant traits.

For example, consider a scale from 1 to 10 for locating consumer


characteristics (preference for the product). Each respondent is assigned a
number from 1 to 10 denoting the degree of unfavorableness for the product,
with ‘1’ indicating extremely unfavorable and ’10’ indicating extremely
favorable. Here, the measurement is the process of assigning the actual number
from 1 to 10 to each respondent while the scaling is a process of placing
respondents on a continuum concerning their preference for the product.

In research, usually, the numbers are assigned to the qualitative traits of the
object because the quantitative data help in the statistical analysis of the
resulting data and further facilitate the communication of measurement rules
and results.

The variables or numbers are defined and categorized using different scales of


measurement. Each level of measurement scale has specific properties that
determine the various use of statistical analysis. In this article, we will learn
four types of scales such as nominal, ordinal, interval, and ratio scales.
What is the Scale?
A scale is a device or an object used to measure or quantify any event or another
object.

Levels of Measurements
There are four different scales of measurement. The data can be defined as
being one of the four scales. The four types of scales are:

 Nominal Scale
 Ordinal Scale
 Interval Scale
 Ratio Scale

Nominal Scale
A nominal scale is the 1st level of measurement scale in which the numbers
serve as “tags” or “labels” to classify or identify the objects. A nominal scale
usually deals with non-numeric variables or numbers that do not have any
value.
Characteristics of Nominal Scale

A nominal scale variable is classified into two or more categories. In this


measurement mechanism, the answer should fall into either of the classes.
 It is qualitative. The numbers are used here to identify the objects.
 The numbers don’t define the object characteristics. The only permissible
aspect of numbers in the nominal scale is “counting.”
Example:
An example of a nominal scale measurement is given below:
What is your gender?
M- Male
F- Female
Here, the variables are used as tags, and the answer to this question should be
either M or F.

Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the ordering and
ranking of data without establishing the degree of variation between them.
Ordinal represents the “order.” Ordinal data is known as qualitative data or
categorical data. It can be grouped, named, and also ranked.
Characteristics of the Ordinal Scale

The ordinal scale shows the relative ranking of the variables


 It identifies and describes the magnitude of a variable
 Along with the information provided by the nominal scale, ordinal scales
give the rankings of those variables
 The interval properties are not known
 The surveyors can quickly analyze the degree of agreement concerning
the identified order of variables
Example:

 Ranking of school students – 1st, 2nd, 3rd, etc.


 Ratings in restaurants
 Evaluating the frequency of occurrences

o Very often
o Often
o Not often
o Not at all
 Assessing the degree of agreement
o Totally agree
o Agree
o Neutral
o Disagree
o Totally disagree

Interval Scale
The interval scale is the 3rd level of the measurement scale. It is defined as a
quantitative measurement scale in which the difference between the two
variables is meaningful. In other words, the variables are measured in an exact
manner, not in a relative way in which the presence of zero is arbitrary.
Characteristics of Interval Scale:

The interval scale is quantitative as it can quantify the difference between


the values
 It allows calculating the mean and median of the variables
 To understand the difference between the variables, you can subtract the
values between the variables
 The interval scale is the preferred scale in Statistics as it helps to assign
any numerical values to arbitrary assessments such as feelings, calendar
types, etc.
Example:

 Likert Scale
 Net Promoter Score (NPS)
 Bipolar Matrix Table

Ratio Scale
The ratio scale is the 4th level of the measurement scale, which is quantitative. It
is a type of variable measurement scale. It allows researchers to compare the
differences or intervals. The ratio scale has a unique feature. It possesses the
character of the origin or zero points.
Characteristics of Ratio Scale:

Ratio scale has a feature of absolute zero


 It doesn’t have negative numbers, because of its zero-point feature
 It affords unique opportunities for statistical analysis. The variables can
be orderly added, subtracted, multiplied, and divided. Mean, median, and
mode can be calculated using the ratio scale.
 Ratio scale has unique and useful properties. One such feature is that it
allows unit conversions like kilogram – calories, gram – calories, etc.
Example:
An example of a ratio scale is:
What is your weight in Kgs?

 Less than 55 kgs


 55 – 75 kgs
 76 – 85 kgs
 86 – 95 kgs
 More than 95 kgs
Definition: Scaling is the process of generating the continuum, a
continuous sequence of values, upon which the measured objects are
placed.

In Marketing Research, several scaling techniques are employed to study


the relationship between the objects. The most commonly used techniques
can be classified as:

1. Comparative Scales: In comparative scaling, there is a direct


comparison of stimulus objects. For example, the respondent
might be asked directly about his preference between the ink pen
and gel pen. The comparative data can only be interpreted in relative
terms and hence possess the ordinal or rank-order properties. This is
the reason why comparative scaling is also called nonmetric
scaling. Comparative Scaling includes the following techniques:

 Paired Comparison Scaling


 Rank Order Scaling
 Constant Sum Scaling
 Q-Sort Scaling

2. Noncomparative Scales: The non-comparative scale, also called


the monadic or metric scale is a scale in which each object is
scaled independently of the other objects in the stimulus set under
study. Generally, the resulting data are assumed to be an interval
and ratio scale. For example, a respondent may be asked to rate
their preference for the gel pen on a preference scale (1 = not at all
preferred, 6 = greatly preferred). The non-comparative scale includes
the following techniques:

 Continuous Rating Scale


 Itemized Rating Scale
Paired Comparison Scaling
Definition: The Paired Comparison Scaling is a comparative scaling
technique wherein the respondent is shown two objects at the same time
and is asked to select one according to the defined criterion. The resulting
data are ordinal.
The paired Comparison scaling is often used when the stimulus objects are
physical products. The comparison data so obtained can be analyzed in
either of the ways. First, the researcher can compute the percentage of
respondents who prefer one object over another by adding the matrices
for each respondent, dividing the sum by the number of respondents, and
then multiplying it by 100. Through this method, all the stimulus objects can
be evaluated simultaneously.

Second, under the assumption of transitivity (which implies that if brand


X is preferred to Brand Y, and brand Y to brand Z, then brand X is
preferred to brand Z) the paired comparison data can be converted into
rank order. To determine the rank order, the researcher identifies the
number of times the object is preferred by adding up all the matrices.

The paired comparison method is effective when the number of objects is


limited because it requires direct comparison. And with a large number of
stimulus objects, the comparison becomes cumbersome. Also, if there is a
violation of the assumption of transitivity the order in which the objects are
placed may bias the results.

Rank Order Scaling


Definition: The Rank Order Scaling is yet another comparative scaling
technique wherein the respondents are presented with numerous objects
simultaneously and are required to order or rank these according to some
specified criterion.

The Rank order scaling is often used to measure the preference for the
brand and attributes. The ranking data is typically obtained from
respondents in the conjoint analysis (a statistical technique used to
determine how the brand and the combination of its attributes such as
features, functions, and benefits, influences the decision-making of a
person), as it forces the respondents to discriminate among the stimulus
objects. The Rank order scaling results in the ordinal data.

Concerning the paired comparison scaling, the Rank order scaling


resembles more closely to the shopping environment, and also it takes less
time and eliminates all the intransitive responses (not object-directed).
Such as, if there are ‘n’ stimulus objects, then only ‘n-1’ scaling decisions
are to be made in the case of Rank order scaling, while in the case of
paired comparison scaling ‘[n (n-1) /2]’ scaling decisions are required.
Moreover, the rank order scaling is an easy method to understand.
However, the major limitation of this process is that it results only in ordinal
data.

Note: Under the assumption of Transitivity (implies that if brand X is


preferred to brand Y, and brand Y, is preferred to brand Z, then brand X is
preferred to brand Z), the rank order data can be converted to equivalent
paired comparison data and vice-versa.

Constant Sum Scaling


Definition:  Constant Sum Scaling is a technique wherein the
respondents are asked to allocate a constant sum of units, such as points,
dollars, chips, or chits among the stimulus objects according to some
specified criterion.

In other words, a scaling technique that involves the assignment of a fixed


number of units to each attribute of the object, reflecting the importance a
respondent attaches to it, is called constant sum scaling. For example,
Suppose a respondent is asked to allocate 100 points to the attributes of a
body wash based on the importance he attaches to each attribute. In case
he feels any attribute is unimportant can allocate zero points and in case
some attribute is twice as important as any other attribute can assign it
twice the points. The sum of all the points allocated to each attribute should
be equal to 100.

Once the points are allocated, the attributes are scaled by counting the
points as assigned by the respondents to each attribute and then dividing it
by the number of respondents under analysis. Such type of information
cannot be obtained from rank order data unless it is transformed into
interval data. The constant sum scaling is considered an ordinal scale
because of its comparative nature and lack of generalization.

One of the advantages of the constant sum scaling technique is that it


allows proper discrimination among the stimulus objects without consuming
too much time. However, it suffers from two serious limitations. First, the
respondent might allocate more or fewer units than those specified.
Second, there might be a rounding error, in case too few units are
allocated. On the other hand, if a large number of units are used then it
might be burdensome on the respondents and cause confusion and
fatigue.
Q-Sort Scaling
Definition: The Q-Sort Scaling is a Rank order scaling technique wherein
the respondents are asked to sort the presented objects into piles based on
similarity according to a specified criterion such as preference, attitude,
perception, etc.

In other words, a scaling technique in which the respondents sort the


number of statements or attitudes into piles, usually of 11, based on some
specified criterion. For example, suppose the respondents are given 100
motivational statements on individual cards and are asked to place these in
11 piles, ranging from the “most agreed with” to the “least agreed
with”. Generally, the most agreed statement is placed on the top while the
least agreed on statement is at the bottom.

Ideally, the objects to be sorted shall not be less than 60 and not more


than 140. While the range between 60 to 90 is considered the most
reasonable range. The number of objects to be placed in each pile is
prespecified, such that the resulting data represent a normal distribution of
objects over the whole set under analysis. The Q-Sort Scaling was
developed to facilitate quick discrimination among a relatively large number
of stimulus objects.

Thus, Q-Sort Scaling helps in assigning ranks to different objects within the
same group, and the differences among the groups (piles) are visible.

Continuous Rating Scale


Definition: The Continuous Rating Scale is a Noncomparative Scale
technique wherein the respondents are asked to rate the stimulus objects
by placing a point/mark appropriately on a line running from one extreme of
the criterion to the other variable criterion.

The continuous rating scale is also called a Graphic Rating Scale. Here
the respondent can place a mark anywhere on the line based on his
opinion and is not restricted to selecting from the values as previously set
by the researcher. The continuous scale can observe many forms, i.e. it
can either be vertical or horizontal; scale points, in the form of numbers or
brief descriptions, may be provided, and if these are provided, then the
scale points might be few or many.
Once the ratings are obtained, the researcher splits up the line into several
categories and then assigns the scores depending on the category in which
the ratings fall. We can say that the continuous rating scale possesses the
characteristics of description, order, and distance.  By description, we
mean, the unique tags, names, or labels used to designate each scale
value. The order refers to the relative position of the descriptors, and
the distance means an absolute difference between the descriptors is
known and can be expressed in unitary terms.

One of the advantages of the continuous rating scale is that it is easy to


construct. However, the scoring is burdensome and reckless. Also, these
rating scales provide little information. Therefore, the continuous rating
scale has limited use in marketing research.

Despite the limitations, due to the increased popularity of computer-


assisted personal interviewing (CAPI), a technique wherein the respondent
and interviewer give answers via computers, the use of the continuous
scaling technique has been increased. The continuous rating scale can be
well implemented in CAPI or on the internet enabling a cursor to move
continuously to select the appropriate position on a scale that best
describes the evaluation of the candidate.

Itemized Rating Scale


Definition: The Itemized Rating Scale is an Ordinal Scale that has a brief
description or numbers associated with each category, ordered in terms of
scale positions. The respondents are asked to select the category that best
describes the stimulus object being rated.

The following are the most commonly used itemized rating scales:

1. Likert Scale: A Likert Scale is a scale with five response


categories that range from “strongly disagree” to “strongly
agree”, wherein the respondent is asked to indicate the degree of
agreement or disagreement with each of the statements related to
the stimulus object under analysis.
2. Semantic Differential Scale: The semantic differential scale is
a seven-point rating scale with the extreme points having semantic
meaning. The scale is used to measure the meaning or semantics of
words, especially the bipolar adjectives (such as “evil” or “good”,
“warm” or “cold”) to derive the respondent’s attitude towards the
stimulus object.
3. Stapel Scale: Stapel scale is a single adjective rating scale with 10
categories ranging from -5 to +5 with no zero points. This scale is
usually a vertical scale in which the single adjective is placed in the
middle of the even-numbered range (-5 to +5). The respondent is
asked to identify how correctly or incorrectly each term describes the
stimulus object by choosing an appropriate response category.
The itemized rating scale is widely used in marketing research and serves
as a basic component of more complex scales, such as Multi-Item Scales.

 Reliability refers to the consistency of a measure (whether the results


can be reproduced under the same conditions).
 Validity refers to the accuracy of a measure (whether the results really
do represent what they are supposed to measure).

Reliability

Reliability refers to the consistency of a measure. Psychologists consider three types of


consistency: over time (test-retest reliability), across items (internal consistency), and across
different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the
scores they obtain should also be consistent across time. Test-retest reliability is the extent to
which this is actually the case. For example, intelligence is generally thought to be consistent
across time. A person who is highly intelligent today will be highly intelligent next week.
This means that any good measure of intelligence should produce roughly the same scores for
this individual next week as it does today. Clearly, a measure that produces highly
inconsistent scores over time cannot be a very good measure of a construct that is supposed to
be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time,
using it again on the same group of people at a later time, and then looking at test-
retest correlation between the two sets of scores. This is typically done by graphing the data
in a scatterplot and computing Pearson’s r. In general, a test-retest correlation of +.80 or
greater is considered to indicate good reliability.

Again, high test-retest correlations make sense when the construct being measured is
assumed to be consistent over time, which is the case for intelligence, self-esteem, and the
Big Five personality dimensions. But other constructs are not assumed to be stable over time.
The very nature of mood, for example, is that it changes. So a measure of mood that produced
a low test-retest correlation over a period of a month would not be a cause for concern.
Internal Consistency

A second kind of reliability is internal consistency, which is the consistency of people’s


responses across the items on a multiple-item measure. In general, all the items on such
measures are supposed to reflect the same underlying construct, so people’s scores on those
items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who
agree that they are a person of worth should tend to agree that that they have a number of
good qualities. If people’s responses to the different items are not correlated with each other,
then it would no longer make sense to claim that they are all measuring the same underlying
construct. This is as true for behavioural and physiological measures as for self-report
measures. For example, people might make a series of bets in a simulated game of roulette as
a measure of their level of risk seeking. This measure would be internally consistent to the
extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and
analyzing data. One approach is to look at a split-half correlation. This involves splitting the
items into two sets, such as the first and second halves of the items or the even- and odd-
numbered items. Then a score is computed for each set of items, and the relationship between
the two sets of scores is examined.

Perhaps the most common measure of internal consistency used by researchers in psychology
is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the mean of all
possible split-half correlations for a set of items. For example, there are 252 ways to split a
set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half
correlations. Note that this is not how α is actually computed, but it is a correct way of
interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken
to indicate good internal consistency.

Interrater Reliability

Many behavioural measures involve significant judgment on the part of an observer or a


rater. Inter-rater reliability is the extent to which different observers are consistent in their
judgments. For example, if you were interested in measuring university students’ social
skills, you could make video recordings of them as they interacted with another student
whom they are meeting for the first time. Then you could have two or more observers watch
the videos and rate each student’s level of social skills. To the extent that each participant
does in fact have some level of social skills that can be detected by an attentive observer,
different observers’ ratings should be highly correlated with each other. Inter-rater reliability
would also have been measured in Bandura’s Bobo doll study. In this case, the observers’
ratings of how many acts of aggression a particular child committed while playing with the
Bobo doll should have been highly positively correlated. Interrater reliability is often
assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic
called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity

Validity is the extent to which the scores from a measure represent the variable they are
intended to. But how do researchers make this judgment? We have already considered one
factor that they take into account—reliability. When a measure has good test-retest reliability
and internal consistency, researchers should be more confident that the scores represent what
they are supposed to. There has to be more to it, however, because a measure can be
extremely reliable but have no validity whatsoever. As an absurd example, imagine someone
who believes that people’s index finger length reflects their self-esteem and therefore tries to
measure self-esteem by holding a ruler up to people’s index fingers. Although this measure
would have extremely good test-retest reliability, it would have absolutely no validity. The
fact that one person’s index finger is a centimetre longer than another’s would indicate
nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to
interpret these types is that they are other kinds of evidence—in addition to reliability—that
should be taken into account when judging the validity of a measure. Here we consider three
basic kinds: face validity, content validity, and criterion validity.

Face Validity

Face validity is the extent to which a measurement method appears “on its face” to measure
the construct of interest. Most people would expect a self-esteem questionnaire to include
items about whether they see themselves as a person of worth and whether they think they
have good qualities. So a questionnaire that included these kinds of items would have good
face validity. The finger-length method of measuring self-esteem, on the other hand, seems to
have nothing to do with self-esteem and therefore has poor face validity. Although face
validity can be assessed quantitatively—for example, by having a large sample of people rate
a measure in terms of whether it appears to measure what it is intended to—it is usually
assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring
what it is supposed to. One reason is that it is based on people’s intuitions about human
behaviour, which are frequently wrong. It is also the case that many established measures in
psychology work quite well despite lacking face validity. The Minnesota Multiphasic
Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders
by having people decide whether each of over 567 different statements applies to them—
where many of the statements do not have any obvious relationship to the construct that they
measure. For example, the items “I enjoy detective or mystery stories” and “The sight of
blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In
this case, it is not the participants’ literal answers to these questions that are of interest, but
rather whether the pattern of the participants’ responses to a series of questions matches those
of individuals who tend to suppress their aggression.

Content Validity

Content validity is the extent to which a measure “covers” the construct of interest. For
example, if a researcher conceptually defines test anxiety as involving both sympathetic
nervous system activation (leading to nervous feelings) and negative thoughts, then his
measure of test anxiety should include items about both nervous feelings and negative
thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and
actions toward something. By this conceptual definition, a person has a positive attitude
toward exercise to the extent that he or she thinks positive thoughts about exercising, feels
good about exercising, and actually exercises. So to have good content validity, a measure of
people’s attitudes toward exercise would have to reflect all three of these aspects. Like face
validity, content validity is not usually assessed quantitatively. Instead, it is assessed by
carefully checking the measurement method against the conceptual definition of the
construct.

Criterion Validity

Criterion validity is the extent to which people’s scores on a measure are correlated with
other variables (known as criteria) that one would expect them to be correlated with. For
example, people’s scores on a new measure of test anxiety should be negatively correlated
with their performance on an important school exam. If it were found that people’s scores
were in fact negatively correlated with their exam performance, then this would be a piece of
evidence that these scores really represent people’s test anxiety. But if it were found that
people scored equally well on the exam regardless of their test anxiety scores, then this would
cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the
construct being measured, and there will usually be many of them. For example, one would
expect test anxiety scores to be negatively correlated with exam performance and course
grades and positively correlated with general anxiety and with blood pressure during an
exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s
scores on this measure should be correlated with their participation in “extreme” activities
such as snowboarding and rock climbing, the number of speeding tickets they have received,
and even the number of broken bones they have had over the years. When the criterion is
measured at the same time as the construct, criterion validity is referred to as concurrent
validity; however, when the criterion is measured at some point in the future (after the
construct has been measured), it is referred to as predictive validity (because scores on the
measure have “predicted” a future outcome).

Criteria can also include other measures of the same construct. For example, one would
expect new measures of test anxiety or physical risk taking to be positively correlated with
existing measures of the same constructs. This is known as convergent validity.

Assessing convergent validity requires collecting data using the measure. Researchers John
Cacioppo and Richard Petty did this when they created their self-report Need for Cognition
Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982)[1].
In a series of studies, they showed that people’s scores were positively correlated with their
scores on a standardized academic achievement test, and that their scores were negatively
correlated with their scores on a measure of dogmatism (which represents a tendency toward
obedience). In the years since it was created, the Need for Cognition Scale has been used in
literally hundreds of studies and has been shown to be correlated with a wide variety of other
variables, including the effectiveness of an advertisement, interest in politics, and juror
decisions (Petty, Briñol, Loersch, & McCaslin, 2009)[2].

Discriminant Validity

Discriminant validity, on the other hand, is the extent to which scores on a measure


are not correlated with measures of variables that are conceptually distinct. For example, self-
esteem is a general attitude toward the self that is fairly stable over time. It is not the same as
mood, which is how good or bad one happens to be feeling right now. So people’s scores on a
new measure of self-esteem should not be very highly correlated with their moods. If the new
measure of self-esteem were highly correlated with a measure of mood, it could be argued
that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence
of discriminant validity by showing that people’s scores were not correlated with certain
other variables. For example, they found only a weak correlation between people’s need for
cognition and a measure of their cognitive style—the extent to which they tend to think
analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.”
They also found no correlation between people’s need for cognition and measures of their test
anxiety and their tendency to respond in socially desirable ways. All these low correlations
provide evidence that the measure is reflecting a conceptually distinct construct.

You might also like