Machine Learning and Business Analytics Surprize Quiz
Machine Learning and Business Analytics Surprize Quiz
Machine Learning and Business Analytics Surprize Quiz
Data Transformation
To do this, we had to transform the transactional sales data into customer-wise sales data for which
we used basic excel functions and pivot table. For recency, we considered the latest month of
transaction of a customer as 0 and earliest month as 12. For frequency, we count the number of
distinct invoices against the customer in the data range. For monetary value, it is the total amount of
money spent by the customer in the data range. We have now successfully converted a sheet with
100,000 line-items of data with individual sales transactions to one with 1158 line-items of
consolidated customer data.
Outlier Analysis
The expected method of outlier detection is to check if the values lie in the range of (Q1-1.5*IQR) to
(Q3+1.5*IQR) and remove these values. However, there is NO ONE WAY to remove outliers from a
data set. It highly depends on business context. In this case, out of the 1158 items of customer data,
following was the summary of quartile values.
Quartil Frequen
e cy MV
329.63
Q1 1 25
Q3 5 1840.6
1510.9
IQR 4 68
-
1936.8
Acceptan
ce Range
-5 2
4107.0
11 51
However, for frequency, we found that 109(9.4%) line-items of the data lied outside of the
acceptance range. Similarly, for monetary, we found that 114(9.8%) line-items of the data lied
outside of the acceptance range. However, we cannot consider data to be outliers as outliers cannot
be more than 1% of the data set. Hence, we cannot eliminate these outliers based on the IQR logic.
In this case, we eliminate the outliers by graphical method using histogram function on R. We
eliminate 2 line-items from frequency and 2 line-items from monetary for abnormally high values on
the histogram.
From the business context, we eliminate returns (negative sales) as they make no sense for our
model to understand customer segmentation. So, we have eliminated 10 line-items from the data
set for that reason.
We now move the save the data in a new separate .csv file to import into R.
Clustering using R
Identifying optimal number of clusters
We do cluster analysis on R to identify the different segments of customers and for this we use
kmeans function on R. Following is a snippet of the code used for the same.
We are using the kmeans$tot.withinss function to identify the error versus number of clusters and
map it in an array called nocl. We then plot a curve to use the elbow technique to identify the
optimal number of clusters. The elbow curve is shown above as well. Based on the errors, we can
identify n=4 as the optimal no of clusters for our customer data.
We create an array of the clusters named clus and bind it with the customer data set using cbind
function. We name this file as “newdata” shown below.
Now we have essentially assigned cluster numbers to each line item of the customer data and we
need to check the size of each cluster for which we use km$size. Sizes of the 4 clusters are given
below.
Note that high value of recency means very old customer who has not visited in recent past.
To generate the summary of individual clusters, we run the following R code on the “newdata” file.
Recency: Cluster Means are very low (0.083) compared to overall mean (3.11) => Very Recent Customer
Frequency: Cluster Means are very high (48.83) compared to overall mean (4.96) => Very High Volume
Monetary: Cluster means are very high (22617) compared to overall mean (1784) => High Monetary
Value
Cluster 2: (Cluster Size = 756)
Recency: Cluster Means are very low (1.642) compared to overall mean (3.11) => Recent Customer
Frequency: Cluster Means are little low (3.734) compared to overall mean (4.96) => Avg. Volume
Ordering
Monetary: Cluster means are low (1221.1) compared to overall mean (1784) => Low Monetary Value
Recency: Cluster Means are very high (8.244) compared to overall mean (3.11) => Very Old Customer
Frequency: Cluster Means are low (1.638) compared to overall mean (4.96) => Low Volume Ordering
Monetary: Cluster means are very low (466.7) compared to overall mean (1784) => Low Monetary Value
Recency: Cluster Means are very low (0.781) compared to overall mean (3.11) => Recent Customer
Frequency: Cluster Means are high (17.37) compared to overall mean (4.96) => High Volume Ordering
Monetary: Cluster means are high (6862) compared to overall mean (1784) => High Monetary Value