Data Mining Assignment: Sudhanva Saralaya

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Data Mining

Assignment

Sudhanva Saralaya

1
Contents
...................................................................................................................................1
Problem 1: Clustering ...............................................................................................3
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis). ..............................................4
Do you think scaling is necessary for clustering in this case? Justify ....................6
Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them ..........................................7
Apply K-Means clustering on scaled data and determine optimum
clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters. .......................9
Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters. ......................................................10
PROBLEM 2 ...............................................................................................................11
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).................................................12
Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network.............................................................13
Performance Metrics: Comment and Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model. .........................................14

2
Problem 1: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are
given the task to identify the segments based on credit card usage.

1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
1.2 Do you think scaling is necessary for clustering in this case? Justify
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
Dataset for Problem 1: bank_marketing_part1_Data.csv
Data Dictionary for Market Segmentation:
1. spending: Amount spent by the customer per month (in 1000s)
2. advance_payments: Amount paid by the customer in advance by cash (in 100s)
3. probability_of_full_payment: Probability of payment done in full by the customer to the
bank
4. current_balance: Balance amount left in the account to make purchases (in 1000s)
5. credit_limit: Limit of the amount in credit card (10000s)
6. min_payment_amt : minimum paid by the customer while making payments for
purchases made monthly (in 100s)
7. max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)

3
Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).

The below is the results of the EDA done on the data. Here we see many details such as value
count mean standard deviation min value max value and various percentages for each column

Here we see the column and their data type. Along with data type we see if they have any null
values in it or not as well. We see there are no null values present in this data and all belong to
float type.

4
Here we see how each of the columns interact with each other via the plot

5
Do you think scaling is necessary for clustering in this case? Justify

Yes. Clustering algorithms such as K-means do need feature scaling before they are fed to the
algorithm. Since clustering techniques use Euclidean Distance to form the cohorts, it will be wise
to scale the variables having heights in meters and weights in KGs before calculating the
distance. Here as we see the data dictionary, we see that each column has its own scale where
some are in 1000 and some are in 100. When we standardize the data prior to performing cluster
analysis, the clusters change. We find that with more equal scales, the features more
significantly contribute to defining the clusters. Standardization prevents variables with larger
scales from dominating how clusters are defined.
Data before scaling(first 5 lines)

Data after scaling (first 5 Lines)

6
Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them

We apply hierarchical clustering to the scaled data using the dendrogram feature from
scipy.cluster.hierarchy package. We use the wards linkage method for the dendrogram. Once
done we get the below results. Based out of the result received we can see that the optimal
number of clusters could be anywhere from 3 to 5 as we see more than that the clusters are too
tightly packed.

To check it we will see the last 5 clusters in the dendrogram. Again we see that the 1st cluster has
only 1.21 features and aother has just 10 where as we have 3 clusters with over 50 features. So we
can safely assume that just 3 clusters is what we need as optimum number.

7
Now checking the same with last 3 clusters we get the below

Here we see that the clusters have been split with a considerable number of different features.
We see that there are 3 clusters of 75 70 and 65 types each.

8
Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.

We can apply k means clustering using sklearn.cluster package. Once applied we can see how each
cluster behaves and how many features are in It .

Seeing the clusters as below we get to know the number of features in each cluster
1469.9999999999995
659.1717544870411
430.65897315130064
371.38509060801107

Here we see a significant drop in 1 2 3 clusters but after 3rd cluster we don’t see such drop. So, we are
safe to assume that 3 clusters are optimum.

Now we can see the elbow curve and the same in a plot wise diagram. The lot shows use the curve
very well. For doing that we use the WSS method or within sum of squares method.

WSS for the data is as below


[1469.9999999999995,659.1717544870411, 430.65897315130064, 371.38509060801107,
327.2127816566134, 289.315995389595, 262.98186570162267, 241.8189465608603,
223.91254221002728, 206.3961218478669]

Now plotting the same in the point plot we get the below.

As we saw in the Kmeans method we see that there is a significant drop till 3rd cluster after which we
see very less drops. So, we can easily come to a conclusion that 3 is a optimum number of clusters.

9
Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.

Now that we are done finalizing the clusters, we can put it in a frequency table and see how the
clusters is divided.

We can see there are 3 cluster each. So we can make the inferences for each as below.

Cluster 0 – these are the kind of people who have a low-spending in-Credit cards. Since they don’t
spend too much they have a low credit limit and the least spending done in single shopping. These
people don’t use credit cards very often hence It can be made such that some marketing is done toward
such customers and this way they might start using their credit cards more.
Cluster 1 -- these are the kind of people who have a least spending in Credit cards. Since they don’t
spend too much, they have a least credit limit and the least spending done in single shopping. These
kinds of consumers can belong to a category that does not use credit cards very often These people
don’t use credit cards very often hence It can be made such that some marketing is done toward such
customers and this way they might start using their credit cards more.
Cluster 3 -- Is the high spenders. This is a group of people who use credit card vey vell and do most of
their spending using a credit card. These people usually do advance payments and have a high
probability for making full payments. These guys maintain a high balance as well as high credit limit.
They have a little high minimum payment amount and do a very high amount of shopping using credit
cards. The bank can market to these people in a way where they give offers for shopping and other
related stuff which will make the people of this category to motivate them to use it for various other
reasons. These category of people must be the ones who could be a VIP customer or a very rich person
or could even be a high spender so basic offers wont work on them. The bank might need to give them
a high level offers worth their attention.

10
PROBLEM 2
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model which
predicts the claim status and provide recommendations to management. Use CART, RF & ANN and
compare the models' performances in train and test sets.

2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).

2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network

2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model.

2.4 Final Model: Compare all the models and write an inference which model is best/optimized.

2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations

11
Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).

The below is the results of the EDA done on the data. Here we see many details such as value
count mean standard deviation min value max value and various percentages for each column

Here we see the column and their data type. Along with data type we see if they have any null
values in it or not as well. We see there are no null values present in this data and all belong to
float type.

12
Data Split: Split the data into test and train, build classification
model CART, Random Forest, Artificial Neural Network.
We make 2 data frames where we drop Claimed column and one where we have
only claimed column.
After we split the data as such we split the data into train and test for test size 30
and random size 1
Cart Random forest and ANN in the Attached jupyter.
Not sure how to describe in the report. Please advice on how to do this.

13
Performance Metrics: Comment and Check the performance of
Predictions on Train and Test sets using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model.

ROC AUC for test data

ROC_AUC for Train Data

14
Confusion Matrix

15
16

You might also like