حل المشروع
حل المشروع
حل المشروع
Introduction
This assignment is designed for you to practice several activities in the six phases of CRISP-DM using a
semi-real-life data set. The data set customer_churn-data3.csv that we are going to use in this
assignment is a famous marketing topic – customer churn (loss of customer). Since customer churn in
this data set is a categorical response variable (churn versus loyal), we have a classification problem in
our hand. You will be asked to build models to predict customer churn and verify their robustness.
You will also be asked to create screenshots of your Rapidminer Process View and the result. If you are
not familiar with how to capture screenshots, please Google to train yourself. There are plenty of
tutorials that show you how it is done in your favorite operation system.
Your will document your analysis with screenshots in Assignment 1.docx and submit it along with the
exported process file. Please carefully follow the steps below.
All analytical steps below must be completed in Rapidminer. Do not use Excel or other programs unless it
is specifically noted.
Note: I would like to keep the amount of your work reasonable. Therefore, we will not be engaged in all
the activities in each of the CRISP-DM phases, but we will cover some essential ones.
Note: Please clearly show the CRISP-DM phases as your section headings in the Word file.
We are trying to create model for customer churn using decision tree algorithm. We split the data first
into training and testing parts. We build model in 80% of the data and test the model in remaining
20% of the data. Once we build the model, we can predict if customer is churn or not based on other
parameters.
1
Phase 2: Data understanding
Data are already provided to you, but you need to describe the data and verify data quality. We will focus
on just the data quality here for this assignment.
Drag the dataset to the Process View in Rapidminer and look at both the Data and Statistics tabs in the
result.
Write a paragraph or two with a screenshot of the Statistics tab in the result to report what you see in
the following issues. Highlight the following issues in your Statistics screenshot.
1. Missing values. Report the number of missing values and your reason why there is or isn’t a
possibility for MNAR.
There are 3 missing values in Gender. Since it is small, the chance of being just not typed is high and
hence it is missing because of random. There are 5 missing values in Payment method as well. Again,
comparing to number of observations, this is small and they may be missing because of random, like
not typed in dataset. So, this is not MNAR. Likewise, 4 missing value in Last Transaction is not MNAR.
There are 87 missing values in Churn. It cannot be by random. There has to be reason like maybe those
customers are not classified yet into churn and loyal classes. As there is a reason for being missing
value, these are MNAR.
2. Strange values. Go through each column and identify those values that are apparently wrong for
the column. A strange value could be something outrageous (e.g., Age > 100), something that is
not written in the same format as the other values in the same column, etc. Explain why they are
wrong.
There are strange values in age, those values are -400 and 721. There are -222 and -183 values which
are strange as well. Since transaction amount should be positive number.
2
1. LastTransaction has to be a positive value.
2. Perform mean substitution on LastTransaction after the above issue is fixed.
3. Perform mode substitution on Payment Method and Gender.
4. Age should be between 0 and 100.
5. Perform listwise removal after the above issues are fixed.
6. Show a screenshot of the resulting statistics for the data. Discuss if the issues in phase 2 have
been resolved.
3
As you can see, we don’t have any missing values and strange values anymore. Min and Max values
are in reasonable range. The types of variables are also correct (they were correct initially as well)
4
Phase 4: Modeling
1. Set ‘Churn’ to be your response variable. Split the data into 80/20.
2. Use the 80% as the training data set to build your Decision Tree model. Use the following
settings for the Decision Tree model. All these values are the default values, but some
installations of Rapidminer may not be using the same parameters. Let’s standardize these to the
following:
3. Show the screenshots of resulting tree model, performance (i.e., confusion matrix) and
prediction results of the test data set.
5
6
4. Also answer the following questions:
a. Q1: Is the accuracy of your model acceptable? Let’s consider Accuracy >= 80% to be
acceptable for this assignment. In real life, the threshold of acceptance is determined by
the specific application, context and company. Show screenshot of confusion matrix and
describe what all numbers mean.
So there were 901 observations after cleaning. We split it into 80%/20% data. We trained the model
with 721 observations, and let 180 observations for testing. That’s why confusion matrix shows total
of 180 observation results (104+15+12+49=180)
104 were predicted loyal and they were actually loyal. 12 were predicted churn, but they actually were
loyal. 15 were predicted loyal, but they were actually churn, 59 were predicted churn and they were
actually churn.
Q2: Explain class recall and class precision shown in the confusion matrix. I don’t need you to explain
what ‘class recall’ and ‘class precision’ mean. Instead, I want you to interpret numbers. Also show how
the numbers are calculated.
Class recall: Out of actually loyal customers, 104/116 = 89.66% were predicted correctly. Out of
actually churn customers, 49/64=76.56% were predicted correctly.
Class precision: Out of predicted loyal, 104/119=87.39% were actually loyal. Out of predicted churn,
49/61=80.33% were actually churn.
b. Q3: tweak the minimum gain parameter a bit. Report your findings. Does it improve or
degrade model prediction accuracy? After you are done reporting, reset minimum gain
7
to 0.1. (Minimum gain refers to the minimum reduction in impurity before a node split
will occur.)
Increasing minimum gain ratio simplifies the tree by making it less branchy, that’s why it may decrease
accuracy. Decreasing the minimum gain ratio make the tree too branchy and detailed, as it will capture
even the smallest details in the dataset. That’s why this too detailed model may not work as good as in
test data. Even though accuracy in training data may increase, but in test data it may/will be low. Thus,
optimal minimum gain ratio should be found by try and error method I guess.
Phase 5: Evaluation
Answer the following questions based on the results before Q2:
c. Q4: What are the top two predictors? How do you know?
8
Top two predictors are Gender and Age because they are sitting on the top of the tree.
d. Q5: What strategy do you recommend to the top-management? Your strategy should be
based on the result of your model before Q2.
We can analyze the tree and give recommendations based on the model. For example, mist males are
already loyal. But we cannot tell the same about females. Only some part of the females is loyal.
Older than 89.5 aged females are loyal. We can go deeper and retell the tree. The point is, based on
the model, manager should create some kind of coupons or specials. Especially for females, males are
already loyal.
Phase 6: Deployment
A model is not going to be useful if you don’t use it, right? Since Decision Tree is one of the Predictive
Analytics techniques, we can use it to predict the churn intention of some new customers. There are
usually two ways to make predictions: simple prediction and batch prediction.
9
I have highlighted the path of reaching the conclusion of LOYAL. We just go through the tree based on
the parameters. Even the last conclusion is not 100% correct, as you can see, there are some churns in
those group as well. But as majority are loyal at that group, we can conclude that she is loyal.
10
a. Note 1: you do NOT need to create the response variable ‘Churn’ in this data set. That is
the variable we want our model to predict.
b. Note 2: Don’t forget to double-check the spelling of your columns. Column names
should exactly match (case, spelling, etc.) those of the columns in the original training
data set.
c. Note 3: The column names have to be exactly the same as the columns in the data
passed to the Decision Tree operator.
3. Sample design using the Golf example. Your assignment is not exactly the same as this example,
but the screenshot on the next page conveys a similar idea, especially about how to use the
second Apply Model to predict new data of interest.
Always remember that the purpose of Apply Model is apply the model to predict a dataset.
a. The first Apply Model is used to perform cross-validation, making sure the model is still
robust when predicting new data. The new data in this case is the test dataset split from
the original data.
b. The second Apply Model is used to predict the new data of interest to you.
11
Q7: Show the screenshot of your final Process View AND the predicted result of the above five
records.
12
4. This technique of performing prediction can be used for any supervised learning models,
including ensemble methods (section 4.7 in the book), regression, logistic regression, and other
classification & regression models. Please keep a note of it. It will be very useful next time when
you wish to perform predictions using a verified model.
13