2 Data Mining Process
2 Data Mining Process
2 Data Mining Process
2.1 CRISP-DM
Problem solving is a process. Where do you start? Is there a recommended sequence of steps
you should follow from start to finish? Most problem-solving processes start with a general
understanding of the situation and determination of goals. What does success look like? We may
then break down the problem into many sub-problems. Frequently, the main problem is too
large and complex to attack head-on. The sub-problems are smaller and more manageable. With
each sub-problem we initially do some exploration and then may construct models which are
evaluated and eventually this may lead to action and evaluation, before final deployment. Data
analytics is no different. One of the most popular frameworks for organizing the problem-solving
process is known as CRISP-DM (CRoss-Industry Standard Process for Data Mining). This
methodology has been adopted by IBM and is used within its SPSS Modeller software. You can
download documentation from
ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/User
Manual/CRISP-DM.pdf
In Lewis Carroll’s book, Alice's Adventures in Wonderland, he wrote one of the most valuable
passages with respect to goal-setting. Alice has fallen down the rabbit hole and entered
Wonderland. She meets the Cheshire Cat and asks for directions.
Alice: Would you tell me, please, which way I ought to go from here?
Cat: That depends a good deal on where you want to get to.
Alice: I don’t much care where.
Cat: Then it doesn’t matter which way you go.
Alice: - so long as I get somewhere
Cat: Oh, you are sure to do that.
We must have a clear goal and a method for measuring progress towards attaining this goal.
Target: "If I know which customers are pregnant and when their baby is due, I can send them
appropriate promotions to better engage them as customers. Can you determine which
customers are pregnant and when the baby is due?"1
Target: "We have a wealth of data that we have captured about customers and we can buy more
data. But with respect to pregnancy, we have data about women that have signed up for our
baby shower registry. We know when the baby is due and we have detailed purchase records
(date of purchase, items bought, which Target store, cash or credit,….). We can construct a
purchasing profile before pregnancy and see how the profile changes through the pregnancy."2
Target: Need to organize the purchase transactions into a usable form. May need to group
transactions by product category (e.g., body lotions) and summarize transactions (e.g., average
2.5 Modeling
• Select the modeling technique(s). There is an incredibly diverse range of modeling tools
available.
• Identify the assumptions and requirements necessary for the chosen technique.
• Develop a plan to train, test and evaluate the performance of the tool.
• Build the model. Describe the model’s behavior and interpretation, if possible.
• Evaluate the model. Assess the performance of the model based upon the evaluation data
set using the criteria established in the Business Understanding phase.
• Is the model reasonable? What do the domain experts think?
Target: “… able to identify about 25 products that, when analyzed together, allowed him to
assign each shopper a “pregnancy prediction” score. More importantly, he could also estimate
her due date to within a small window, so Target could send coupons timed to very specific
stages of her pregnancy.”3 -
Target used a classification model to identify whether a customer was pregnant and a value
estimation model to predict the due date.
2.6 Evaluation
• Does it meet the business objectives?
• How accurate are the predictions?
• What are your recommendations for future projects?
• What are lessons learned? What would you have done differently?
Target: … Take a fictional Target shopper named Jenny Ward, who is 23, lives in Atlanta and in
March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and
magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s
pregnant and that her delivery date is sometime in late August.4
2.7 Deployment
• Develop a plan for implementing the results. Who will do what? What systems need to be
changed? How? By whom? How will you monitor results going forward? How will you
evaluate benefits? What are risks/pitfalls that are possible during deployment?
• What is the maintenance plan?
Target: "What’s more, because of the data attached to her Guest ID number, Target knows how
to trigger Jenny’s habits. They know that if she receives a coupon via e-mail, it will most likely cue
her to buy online. They know that if she receives an ad in the mail on Friday, she frequently uses
it on a weekend trip to the store. And they know that if they reward her with a printed receipt
that entitles her to a free cup of Starbucks coffee, she’ll use it when she comes back again."5
The next 12 units look at activities and issues associated with the phases Data Understanding to
Evaluation. The modeling will be discussed at a high level with specific modeling approaches
examined in depth after we understand “evaluation”. Business Understanding is problem
specific and will be indirectly addressed through examples in the chapters. Deployment is also
problem specific but some issues will be alluded to in examples. An important dimension of
deployment is ”storytelling” – how do you effectively communicate what you discover? This is
not part of the course, but some advice is given in an appendix.
Image Citations:
Figure 1-1: Image courtesy of Kenneth Jensen under CC BY-SA 3.0
Footnotes
1 Duhigg, C. How Companies Learn Your Secrets. New York Times Magazine, 16 February 2012.
https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html
2Ibid
3 Ibid
4 Ibid
5 Ibid