2 Data Mining Process

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2 Data Mining Process

2.1 CRISP-DM
Problem solving is a process. Where do you start? Is there a recommended sequence of steps
you should follow from start to finish? Most problem-solving processes start with a general
understanding of the situation and determination of goals. What does success look like? We may
then break down the problem into many sub-problems. Frequently, the main problem is too
large and complex to attack head-on. The sub-problems are smaller and more manageable. With
each sub-problem we initially do some exploration and then may construct models which are
evaluated and eventually this may lead to action and evaluation, before final deployment. Data
analytics is no different. One of the most popular frameworks for organizing the problem-solving
process is known as CRISP-DM (CRoss-Industry Standard Process for Data Mining). This
methodology has been adopted by IBM and is used within its SPSS Modeller software. You can
download documentation from
ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/User
Manual/CRISP-DM.pdf

CRISP-DM breaks down the steps into six phases:


• Business Understanding
• Data Understanding
• Data Preparation
• Modelling
• Evaluation
• Deployment

Figure 2-1 CRISP DM Framework for Data Mining

2 Data Mining Process Page 1 of 5


You will note that there is some back-and-forth among phases. What you discover in one phase
may cause you to go back and re-examine a previous phase. Although it looks like we start with
business understanding and end with deployment, the outer circle suggests that we
continuously repeat all the steps. Data analytics should be seen as an integral part of all business
processes, so we are never finished doing data analytics. There will always be new decisions to
be made.

2.2 Business Understanding


• What are the business objectives?
• Which business units will be affected?
• Are there conflicting/competing goals?
• Does the firm use/understand data mining?
• Can we measure success? Is it general, such as, "construct a profile of customers likely to
leave (churn)", or specific, such as, "reduce the churn rate by 10%"?
• Translate the business goal into a data mining goal. For example, “reduce customer churn”
may become, "build a model to predict when a customer will leave based upon known
characteristics of the customer’s past service usage”.
• Develop a project plan that identifies the steps in the project, assumptions, resource needs,
constraints and risks.

In Lewis Carroll’s book, Alice's Adventures in Wonderland, he wrote one of the most valuable
passages with respect to goal-setting. Alice has fallen down the rabbit hole and entered
Wonderland. She meets the Cheshire Cat and asks for directions.

Alice: Would you tell me, please, which way I ought to go from here?
Cat: That depends a good deal on where you want to get to.
Alice: I don’t much care where.
Cat: Then it doesn’t matter which way you go.
Alice: - so long as I get somewhere
Cat: Oh, you are sure to do that.

We must have a clear goal and a method for measuring progress towards attaining this goal.

Target: "If I know which customers are pregnant and when their baby is due, I can send them
appropriate promotions to better engage them as customers. Can you determine which
customers are pregnant and when the baby is due?"1

2 Data Mining Process Page 2 of 5


2.3 Data Understanding
• Identify the data that will be needed for the project.
o What variables should we measure?
o How much history do we need? Is data from 2 years ago too old to be useful?
o How does this data link back to our business problem?
• Acquire the data (extract from databases, do a survey, capture online activity,…).
• Look at the overall data quality –
o Are there missing values or attributes?
o Are values reasonable?
o Talk to those who know the data, how it is captured, what it really means, …
• Perform high level descriptive analysis of the data. Do the results look like what we
expected?
• Explore the data in more depth. Form suppositions for further analysis.

Target: "We have a wealth of data that we have captured about customers and we can buy more
data. But with respect to pregnancy, we have data about women that have signed up for our
baby shower registry. We know when the baby is due and we have detailed purchase records
(date of purchase, items bought, which Target store, cash or credit,….). We can construct a
purchasing profile before pregnancy and see how the profile changes through the pregnancy."2

2.4 Data Preparation


• Select the data that are relevant to the data mining project. Collect additional data if
necessary.
• Clean the data. How will you deal with
o missing values?
o outliers?
o unreasonable or inconsistent values?
• Will we be biasing our results and our decisions by “cleaning” the data? The dirty data may
hold important secrets.
• Construct data. Some variables may be functions of others (e.g., total cost or order = sum
of items in the order, percentage change since previous month,…). We may wish to scale or
normalize values to make comparisons of items that are measured on different scales.
• Filter data. Through the data exploration and cleaning process, you may wish to exclude
certain sub-populations.
• Keep a detailed record of all the changes that you have made – sampling, cleaning,
constructing, transforming, filtering,… You must be able to defend your analysis and be
prepared to explain your assumptions and actions.

Target: Need to organize the purchase transactions into a usable form. May need to group
transactions by product category (e.g., body lotions) and summarize transactions (e.g., average

2 Data Mining Process Page 3 of 5


number of purchases of lotions per month, or average volume of body lotion purchased per
month).

2.5 Modeling
• Select the modeling technique(s). There is an incredibly diverse range of modeling tools
available.
• Identify the assumptions and requirements necessary for the chosen technique.
• Develop a plan to train, test and evaluate the performance of the tool.
• Build the model. Describe the model’s behavior and interpretation, if possible.
• Evaluate the model. Assess the performance of the model based upon the evaluation data
set using the criteria established in the Business Understanding phase.
• Is the model reasonable? What do the domain experts think?

Target: “… able to identify about 25 products that, when analyzed together, allowed him to
assign each shopper a “pregnancy prediction” score. More importantly, he could also estimate
her due date to within a small window, so Target could send coupons timed to very specific
stages of her pregnancy.”3 -

Target used a classification model to identify whether a customer was pregnant and a value
estimation model to predict the due date.

2.6 Evaluation
• Does it meet the business objectives?
• How accurate are the predictions?
• What are your recommendations for future projects?
• What are lessons learned? What would you have done differently?

Target: … Take a fictional Target shopper named Jenny Ward, who is 23, lives in Atlanta and in
March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and
magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s
pregnant and that her delivery date is sometime in late August.4

2.7 Deployment
• Develop a plan for implementing the results. Who will do what? What systems need to be
changed? How? By whom? How will you monitor results going forward? How will you
evaluate benefits? What are risks/pitfalls that are possible during deployment?
• What is the maintenance plan?

2 Data Mining Process Page 4 of 5


• Produce a final report.

Target: "What’s more, because of the data attached to her Guest ID number, Target knows how
to trigger Jenny’s habits. They know that if she receives a coupon via e-mail, it will most likely cue
her to buy online. They know that if she receives an ad in the mail on Friday, she frequently uses
it on a weekend trip to the store. And they know that if they reward her with a printed receipt
that entitles her to a free cup of Starbucks coffee, she’ll use it when she comes back again."5

The next 12 units look at activities and issues associated with the phases Data Understanding to
Evaluation. The modeling will be discussed at a high level with specific modeling approaches
examined in depth after we understand “evaluation”. Business Understanding is problem
specific and will be indirectly addressed through examples in the chapters. Deployment is also
problem specific but some issues will be alluded to in examples. An important dimension of
deployment is ”storytelling” – how do you effectively communicate what you discover? This is
not part of the course, but some advice is given in an appendix.

Image Citations:
Figure 1-1: Image courtesy of Kenneth Jensen under CC BY-SA 3.0

Footnotes
1 Duhigg, C. How Companies Learn Your Secrets. New York Times Magazine, 16 February 2012.
https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html
2Ibid
3 Ibid
4 Ibid
5 Ibid

2 Data Mining Process Page 5 of 5

You might also like