Cluster Analysis

 Factor Analysis is reduction of variables generally known as dimension reduction. One the other
hand cluster analysis is used to reduce the number of records or cases, commonly known as
 Clustering is used for creating similar groups and the cases in a group is a cluster.
 Factor analysis is based on the concept of correlation between the variables. In a cluster you
should have good similarity between the cases, and it should be quite dissimilar to the cases in
the other cluster.
 How to calculate similarity what the algorithms that people use.
 To calculate the similarity matrix different kind of distance algorithm can be used in R. The
popular algorithms to calculate the distance is 1) Euclidean distance.
Distance between and b is sqrt((a1-b1) ^2+(a2-b2) ^2…. +(an-bn) ^2) depends on
number of variables in study.
 Chebyshev distance is also user. In this case they take the modulus of the distance|a1-b1|, |a2-
b2| and the largest distance is the mode distance.
 Manhattan Distance This method calculates the modulus values and then adds them.
 I R algorithm we will be using the Euclidean method to calculate the distance.

Process to do clustering

 Step 1: To identify the variables for clustering. More the number of variables the good will be
 Step2: Decide the clustering is Hierarchal clustering second one is Non-Hierarchal
clustering. Better is Hierarchal clustering.
 Step 3: To calculate the similarity and dissimilarity matrix using Euclidean distance.
 Step4: Select the clustering method.
 Step5: Decide the number of clusters which generally comes from the business context.
 Step6: To create the cluster profile and check which case is coming is under which clusters.
 The objective of the dataset cust.csv is to cluster the customers based on the following
o First monthly average spending
o Number of visits to departmental store
o Number of apparel purchase
o Number of high value item purchase
o Number the staple value purchased
 Libraries NbClust
 Fpc
 Cluster
 The structure of the file Cust.csv shows there are 10 observations with 7 variables.
 Before the running the cluster bring all the variables to same scale
 To scale the variables which are input in to cluster or scale the variables on which you require
 To scale the variables in cluster we will use inbuilt function named scale.

(X-Xmin)/(Xmax-Xmin) or Standardization (X-µ)/sigma

 A new data frame is created scaled.RCDF

 Calculation of the similarity matrix.
 From the Euclidean matrix the distance between 8 and 9 is the least. 0.7272685 so the
clustering will start from 8 and 9 and will move on.
 To create clusters using a clustering process we will use the average method.
 The procedure is hierarchical and method id average. The function we will use is hclust where h
stands for hierarchical clustering
 To see the dendrogram which shows how the case are getting combined.
 The plot should indicate the names of the customers
 ACD is one cluster, HIFJ customers are in second cluster BEG customer are in third cluster.
 To find what is the characteristic of each cluster. This aggregation will tell the property of each
variable in each cluster.
 In our LMS cities.sav
 The different clustering methods available are average, single linkage, complete linkage,centroid
method,ward’s method, Hclust is used for hierarchal clustering and alternate to is Non-
hierarchal algorithms and one of the most important Non-hierarchal algorithms is K-means
clustering which also gave me same procedure as hclust for hierarchal clustering.
 Hierarchal clustering is preferred over Non-hierarchal like K-means

