Clustering X

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Cluster Analysis

 Factor Analysis is reduction of variables generally known as dimension reduction. One the other
hand cluster analysis is used to reduce the number of records or cases, commonly known as
segmentation.
 Clustering is used for creating similar groups and the cases in a group is a cluster.
 Factor analysis is based on the concept of correlation between the variables. In a cluster you
should have good similarity between the cases, and it should be quite dissimilar to the cases in
the other cluster.
 How to calculate similarity what the algorithms that people use.
 To calculate the similarity matrix different kind of distance algorithm can be used in R. The
popular algorithms to calculate the distance is 1) Euclidean distance.
Distance between and b is sqrt((a1-b1) ^2+(a2-b2) ^2…. +(an-bn) ^2) depends on
number of variables in study.
 Chebyshev distance is also user. In this case they take the modulus of the distance|a1-b1|, |a2-
b2| and the largest distance is the mode distance.
 Manhattan Distance This method calculates the modulus values and then adds them.
 I R algorithm we will be using the Euclidean method to calculate the distance.

Process to do clustering

 Step 1: To identify the variables for clustering. More the number of variables the good will be
clustering.
 Step2: Decide the clustering procedure.one is Hierarchal clustering second one is Non-Hierarchal
clustering. Better is Hierarchal clustering.
 Step 3: To calculate the similarity and dissimilarity matrix using Euclidean distance.
 Step4: Select the clustering method.
 Step5: Decide the number of clusters which generally comes from the business context.
 Step6: To create the cluster profile and check which case is coming is under which clusters.
 The objective of the dataset cust.csv is to cluster the customers based on the following
variables.
o First monthly average spending
o Number of visits to departmental store
o Number of apparel purchase
o Number of high value item purchase
o Number the staple value purchased
 Libraries NbClust
 Fpc
 Cluster
 The structure of the file Cust.csv shows there are 10 observations with 7 variables.
 Before the running the cluster bring all the variables to same scale
 To scale the variables which are input in to cluster or scale the variables on which you require
clustering.
 To scale the variables in cluster we will use inbuilt function named scale.

(X-Xmin)/(Xmax-Xmin) or Standardization (X-µ)/sigma

 A new data frame is created scaled.RCDF


 Calculation of the similarity matrix.
 From the Euclidean matrix the distance between 8 and 9 is the least. 0.7272685 so the
clustering will start from 8 and 9 and will move on.
 To create clusters using a clustering process we will use the average method.
 The procedure is hierarchical and method id average. The function we will use is hclust where h
stands for hierarchical clustering
 To see the dendrogram which shows how the case are getting combined.
 The plot should indicate the names of the customers
 ACD is one cluster, HIFJ customers are in second cluster BEG customer are in third cluster.
 To find what is the characteristic of each cluster. This aggregation will tell the property of each
variable in each cluster.
 In our LMS cities.sav
 The different clustering methods available are average, single linkage, complete linkage,centroid
method,ward’s method, Hclust is used for hierarchal clustering and alternate to is Non-
hierarchal algorithms and one of the most important Non-hierarchal algorithms is K-means
clustering which also gave me same procedure as hclust for hierarchal clustering.
 Hierarchal clustering is preferred over Non-hierarchal like K-means

You might also like