Cluster Analysis-Unit 11
Cluster Analysis-Unit 11
Cluster Analysis-Unit 11
9-1
Cluster Analysis
LEARNING OBJECTIVES
Upon completing this chapter, you should be able to do
the following:
• Define cluster analysis, its roles and its limitations.
• Identify the types of research questions addressed
by cluster analysis.
• Understand how interobject similarity is measured.
• Understand why different distance measures are
sometimes used.
9-2
Cluster Analysis
9-3
Cluster Analysis Defined
9-4
What is Cluster Analysis?
9-5
Three Cluster Diagram Showing
Between-Cluster and Within-Cluster Variation
9-6
Scatter Diagram for Cluster Observations
High
Frequency of eating out
Low
Low High
Frequency of going to fast food restaurants
9-7
Scatter Diagram for Cluster Observations
High
Frequency of eating out
Low
Low High
Frequency of going to fast food restaurants
9-8
Scatter Diagram for Cluster Observations
High
Frequency of eating out
Low
Low High
Frequency of going to fast food restaurants
9-9
Scatter Diagram for Cluster Observations
High
Frequency of eating out
Low
Low High
Frequency of going to fast food restaurants
9-10
Criticisms of Cluster Analysis
The following must be addressed by
conceptual rather than empirical support:
9-11
What Can We Do With Cluster
Analysis?
9-12
Research Questions in Cluster Analysis
9-13
Stage 1: Objectives of Cluster Analysis
9-14
Other Research Questions ?
9-15
Selecting Cluster Variables
Two Issues . . .
1. Conceptual considerations – include only
variables that . . .
• Characterize the objects being clustered
• Relate specifically to the objectives of the
cluster analysis
2. Practical considerations.
9-16
Rules of Thumb 9–1
OBJECTIVES OF CLUSTER ANALYSIS
Cluster analysis is used for:
Taxonomy description – identifying natural groups within the
data.
Data simplification – the ability to analyze groups of similar
observations instead of all individual observations.
Relationship identification – the simplified structure from
cluster analysis portrays relationships not revealed otherwise.
Theoretical, conceptual and practical considerations must be
observed when selecting clustering variables for cluster analysis:
Only variables that relate specifically to objectives of the
cluster analysis are included, since “irrelevant” variables can
not be excluded from the analysis once it begins
Variables are selected which characterize the individuals
(objects) being clustered
9-17
Stage 2: Research Design in Cluster
Analysis
Four Questions . . .
• Is the sample size adequate?
• Can outliers be detected and, if so, should
they be deleted?
• How should object similarity be measured?
• Should the data be standardized?
9-18
Measuring Similarity
• Correlational Measures
• Distance Measures
• Association
9-19
Types of Distance Measures
• Euclidean distance
• Squared (or absolute) Euclidean
distance
• Chebychev distance
• Mahalanobis distance (D2)
9-20
Rules of Thumb 9 – 2
9-21
Rules of Thumb 9 – 2 continued . . .
Research Design in Cluster Analysis
• Similarity measures calculated across the entire set of clustering
variables allow for the grouping of observations and their comparison
to each other.
Distance measures are most often used as a measure of similarity,
with higher values representing greater dissimilarity (distance
between cases) not similarity.
There are many different distance measures, including:
Euclidean (straight line) distance is the most common
measure of distance.
Squared Euclidean distance is the sum of squared distances
and is the recommended measure for the centroid and Ward’s
methods of clustering.
Mahalanobis distance accounts for variable intercorrelations
and weights each variable equally. When variables are highly
intercorrelated, Mahalanobis distance is most appropriate.
Less frequently used are correlational measures, where large
values do indicate similarity.
9-22
Stage 3: Assumptions of Cluster
Analysis
9-23
Rules of Thumb 9 – 3
9-24
Stage 4: Deriving Clusters and Assessing
Overall Fit
9-25
Two Types of Hierarchical
Clustering Procedures
9-26
9-27
Stage 5: Interpretation of the
Clusters
9-28
Stage 6: Validation and Profiling of the
Clusters
Validation . . .
• Cross-validation
• Criterion validity
9-29
Rules of Thumb 9–5
DERIVING THE FINAL CLUSTER SOLUTION
• There is no single objective procedure to determine the ‘correct’
number of clusters. Rather the researcher must evaluate
alternative cluster solutions on the following considerations to
select the “best” solution:
Single-member or extremely small clusters are generally
not acceptable and should generally be eliminated.
For hierarchical methods, ad hoc stopping rules, based on
the rate of change in a total similarity measure as the
number of clusters increases or decreases, are an
indication of the number of clusters.
All clusters should be significantly different across the set
of clustering variables.
Cluster solutions ultimately must have theoretical validity
assess through external validation.
9-30
Steps in Cluster Analysis . . .
9-31
Step 1: Cluster Analysis – Variable Selection
9-32
Description of HBAT Primary Database Variables
Variable Description Variable Type
Data Warehouse Classification Variables
X1 Customer Type nonmetric
X2 Industry Type nonmetric
X3 Firm Size nonmetric
X4 Region nonmetric
X5 Distribution System nonmetric
Performance Perceptions Variables
X6 Product Quality metric
X7 E-Commerce Activities/Website metric
X8 Technical Support metric
X9 Complaint Resolution metric
X10 Advertising metric
X11 Product Line metric
X12 Salesforce Image metric
X13 Competitive Pricing metric
X14 Warranty & Claims metric
X15 New Products metric
X16 Ordering & Billing metric
X17 Price Flexibility metric
X18 Delivery Speed metric
Outcome/Relationship Measures
X19 Satisfaction metric
X20 Likelihood of Recommendation metric
X21 Likelihood of Future Purchase metric
X22 Current Purchase/Usage Level metric
X23 Consider Strategic Alliance/Partnership in Future nonmetric
9-33
9-34
9-35
9-36
Cluster Analysis
Learning Checkpoint
9-37