Cluster Analysis-Unit 11

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Unit-11- Cluster Analysis

9-1
Cluster Analysis

LEARNING OBJECTIVES
Upon completing this chapter, you should be able to do
the following:
• Define cluster analysis, its roles and its limitations.
• Identify the types of research questions addressed
by cluster analysis.
• Understand how interobject similarity is measured.
• Understand why different distance measures are
sometimes used.

9-2
Cluster Analysis

LEARNING OBJECTIVES continued . . .


Upon completing this chapter, you should be able to do
the following:
• Understand the differences between hierarchical and
nonhierarchical clustering techniques.
• Know how to interpret the results from cluster
analysis.
• Follow the guidelines for cluster validation.

9-3
Cluster Analysis Defined

Cluster analysis . . . groups objects


(respondents, products, firms, variables,
etc.) so that each object is similar to
the other objects in the cluster and
different from objects in all the other
clusters.

9-4
What is Cluster Analysis?

Cluster analysis . . . is a group of multivariate


techniques whose primary purpose is to group
objects based on the characteristics they
possess.

• It has been referred to as Q analysis, typology


construction, classification analysis, and
numerical taxonomy.

• The essence of all clustering approaches is the


classification of data as suggested by “natural”
groupings of the data themselves.

9-5
Three Cluster Diagram Showing
Between-Cluster and Within-Cluster Variation

Between-Cluster Variation = Maximize


Within-Cluster Variation = Minimize

9-6
Scatter Diagram for Cluster Observations

High
Frequency of eating out

Low
Low High
Frequency of going to fast food restaurants

9-7
Scatter Diagram for Cluster Observations

High
Frequency of eating out

Low
Low High
Frequency of going to fast food restaurants

9-8
Scatter Diagram for Cluster Observations

High
Frequency of eating out

Low
Low High
Frequency of going to fast food restaurants

9-9
Scatter Diagram for Cluster Observations

High
Frequency of eating out

Low
Low High
Frequency of going to fast food restaurants

9-10
Criticisms of Cluster Analysis
The following must be addressed by
conceptual rather than empirical support:

• Cluster analysis is descriptive, atheoretical,


and noninferential.
• . . . will always create clusters, regardless of
the actual existence of any structure in the
data.
• The cluster solution is not generalizable
because it is totally dependent upon the
variables used as the basis for the similarity
measure.

9-11
What Can We Do With Cluster
Analysis?

1. Determine if statistically different clusters


exist.

2. Identify the meaning of the clusters.

3. Explain how the clusters can be used.

9-12
Research Questions in Cluster Analysis

The primary objective of cluster analysis is to define


the structure of the data by placing the most similar
observations into groups. To do so, we must answer
three questions:
• How do we measure similarity?
• How do we form clusters?
• How many groups do we form?

9-13
Stage 1: Objectives of Cluster Analysis

Primary Goal = to partition a set of objects into two


or more groups based on the similarity of the
objects for a set of specified characteristics (the
cluster variate).

Two key issues:


• The research questions being addressed, and
• The variables used to characterize objects in the
clustering process.

9-14
Other Research Questions ?

Three basic questions . . .


• How to form the taxonomy – an empirically
based classification of objects.
• How to simplify the data – by grouping
observations for further analysis.
• Which relationships can be identified – the
process reveals relationships among the
observations.

9-15
Selecting Cluster Variables

Two Issues . . .
1. Conceptual considerations – include only
variables that . . .
• Characterize the objects being clustered
• Relate specifically to the objectives of the
cluster analysis
2. Practical considerations.

9-16
Rules of Thumb 9–1
OBJECTIVES OF CLUSTER ANALYSIS
 Cluster analysis is used for:
 Taxonomy description – identifying natural groups within the
data.
 Data simplification – the ability to analyze groups of similar
observations instead of all individual observations.
 Relationship identification – the simplified structure from
cluster analysis portrays relationships not revealed otherwise.
 Theoretical, conceptual and practical considerations must be
observed when selecting clustering variables for cluster analysis:
 Only variables that relate specifically to objectives of the
cluster analysis are included, since “irrelevant” variables can
not be excluded from the analysis once it begins
 Variables are selected which characterize the individuals
(objects) being clustered
9-17
Stage 2: Research Design in Cluster
Analysis

Four Questions . . .
• Is the sample size adequate?
• Can outliers be detected and, if so, should
they be deleted?
• How should object similarity be measured?
• Should the data be standardized?

9-18
Measuring Similarity

Interobject similarity is an empirical


measure of correspondence, or resemblance,
between objects to be clustered. It can be
measured in a variety of ways, but three methods
dominate the applications of cluster analysis:

• Correlational Measures
• Distance Measures
• Association

9-19
Types of Distance Measures

• Euclidean distance
• Squared (or absolute) Euclidean
distance
• Chebychev distance
• Mahalanobis distance (D2)

9-20
Rules of Thumb 9 – 2

Research Design in Cluster Analysis


• The sample size required is not based on statistical
considerations for inference testing, but rather:
 Sufficient size is needed to ensure
representativeness of the population and its
underlying structure, particularly small groups within
the population.
 Minimum group sizes are based on the relevance of
each group to the research question and the
confidence needed in characterizing that group.

9-21
Rules of Thumb 9 – 2 continued . . .
Research Design in Cluster Analysis
• Similarity measures calculated across the entire set of clustering
variables allow for the grouping of observations and their comparison
to each other.
 Distance measures are most often used as a measure of similarity,
with higher values representing greater dissimilarity (distance
between cases) not similarity.
 There are many different distance measures, including:
 Euclidean (straight line) distance is the most common
measure of distance.
 Squared Euclidean distance is the sum of squared distances
and is the recommended measure for the centroid and Ward’s
methods of clustering.
 Mahalanobis distance accounts for variable intercorrelations
and weights each variable equally. When variables are highly
intercorrelated, Mahalanobis distance is most appropriate.
 Less frequently used are correlational measures, where large
values do indicate similarity.
9-22
Stage 3: Assumptions of Cluster
Analysis

• Representativeness of the sample.


• Impact of multicollinearity.

9-23
Rules of Thumb 9 – 3

ASSUMPTIONS IN CLUSTER ANALYSIS


• Input variables should be examined for substantial
multicollinearity and if present . . .
 Reduce the variables to equal numbers in each
set of correlated measures.
 Use a distance measure that compensates for
the correlation, like Mahalanobis Distance.
 Take a proactive approach and include only
cluster variables that are not highly correlated.

9-24
Stage 4: Deriving Clusters and Assessing
Overall Fit

The researcher must . . .


• Select the partitioning procedure used
for forming clusters
 Hierarchical
 Non-hierarchical

• Decide on the number of clusters to be


formed.

9-25
Two Types of Hierarchical
Clustering Procedures

1. Agglomerative Methods (buildup)

2. Divisive Methods (breakdown)

9-26
9-27
Stage 5: Interpretation of the
Clusters

• This stage involves examining each


cluster in terms of the cluster variate to
name or assign a label accurately
describing the nature of the clusters

9-28
Stage 6: Validation and Profiling of the
Clusters

Validation . . .
• Cross-validation
• Criterion validity

Profiling . . . . describing the characteristics of each


cluster to explain how they may differ on relevant
dimensions. This typically involves the use of
discriminant analysis or ANOVA.

9-29
Rules of Thumb 9–5
DERIVING THE FINAL CLUSTER SOLUTION
• There is no single objective procedure to determine the ‘correct’
number of clusters. Rather the researcher must evaluate
alternative cluster solutions on the following considerations to
select the “best” solution:
 Single-member or extremely small clusters are generally
not acceptable and should generally be eliminated.
 For hierarchical methods, ad hoc stopping rules, based on
the rate of change in a total similarity measure as the
number of clusters increases or decreases, are an
indication of the number of clusters.
 All clusters should be significantly different across the set
of clustering variables.
 Cluster solutions ultimately must have theoretical validity
assess through external validation.
9-30
Steps in Cluster Analysis . . .

1. Select the variables.


2. Determine if clusters exist. To do so, verify
the clusters are statistically different and
theoretically meaningful (a logical name can
be assigned).
3. Decide how many clusters to use.
4. Describe the characteristics of the derived
clusters using demographics, psychographics,
etc.

9-31
Step 1: Cluster Analysis – Variable Selection

• Variables are typically measured


metrically, but technique can be applied
to non-metric variables.

• Variables must be logically related to a


single underlying concept or construct.

9-32
Description of HBAT Primary Database Variables
Variable Description Variable Type
Data Warehouse Classification Variables
X1 Customer Type nonmetric
X2 Industry Type nonmetric
X3 Firm Size nonmetric
X4 Region nonmetric
X5 Distribution System nonmetric
Performance Perceptions Variables
X6 Product Quality metric
X7 E-Commerce Activities/Website metric
X8 Technical Support metric
X9 Complaint Resolution metric
X10 Advertising metric
X11 Product Line metric
X12 Salesforce Image metric
X13 Competitive Pricing metric
X14 Warranty & Claims metric
X15 New Products metric
X16 Ordering & Billing metric
X17 Price Flexibility metric
X18 Delivery Speed metric
Outcome/Relationship Measures
X19 Satisfaction metric
X20 Likelihood of Recommendation metric
X21 Likelihood of Future Purchase metric
X22 Current Purchase/Usage Level metric
X23 Consider Strategic Alliance/Partnership in Future nonmetric
9-33
9-34
9-35
9-36
Cluster Analysis
Learning Checkpoint

1. Why might we use cluster analysis?


2. What are the three major steps in
cluster analysis?
3. How can we apply cluster analysis in
business research?

9-37

You might also like