Data Mining: Learning Objectives For Chapter 5
Data Mining: Learning Objectives For Chapter 5
Data Mining: Learning Objectives For Chapter 5
Data Mining
CHAPTER OVERVIEW
CHAPTER OUTLINE
1
Copyright © 2014 Pearson Education, Inc.
5.2 DATA MINING CONCEPTS AND APPLICATIONS
Application Case 5.1: Smarter Insurance: Infinity P&C
Improves Customer Service and Combats Fraud with
Predictive Analytics
A. DEFINITIONS, CHARACTERISTICS, AND BENEFITS
Technology Insights 5.1: A Simple Taxonomy of Data
Application Case 5.2: Harnessing Analytics to Combat
Crime: Predictive Analytics Helps Memphis Police
Department Pinpoint Crime and Focus Police Resources
B. HOW DATA MINING WORKS
1. Prediction
2. Classification
3. Clustering
4. Associations
5. Visualization and Time-Series Forecasting
C. DATA MINING VERSUS STATISTICS
Section 5.2 Review Questions
2
Copyright © 2014 Pearson Education, Inc.
1. Determining the Optimal Number of Clusters
2. Analysis Methods
3. K-Means Clustering Algorithm
D. ASSOCIATION RULE MINING
1. Apriori Algorithm
Section 5.5 Review Questions
Chapter Highlights
Key Terms
Questions for Discussion
Exercises
Teradata University Network (TUN) and Other Hands-on Exercises
Team Assignments and Role-Playing Projects
Internet Exercises
End of Chapter Application Case: Macys.com Enhances Its
Customers’ Shopping Experience with Analytics
Questions for the Case
References
3
Copyright © 2014 Pearson Education, Inc.
alive for students. The Internet exercises at the end are also helpful, especially the ones
that send students to vendor sites for case studies and white papers. It can be a good class
or group project for students to go to several different vendor sites and compare what
they find on each.
Utilizing large and information-rich transactional and customer data (that they
collect on a daily basis) to optimize their business processes is not a choice for
large-scale retailers anymore, but a necessity to stay competitive.
2. What are the top challenges for multi-channel retailers? Can you think of other
industry segments that face similar problems?
The retail industry is amongst the most challenging because of the change that
they have to deal with constantly. Understanding customer needs, wants, likes,
and dislikes is an ongoing challenge. As the volume and complexity of data
increase, so does the time spent on preparing and analyzing it.
Prior to the integration of SAS and Teradata, data for modeling and scoring
customers was stored in a data mart. This process required a large amount of time
to construct, bringing together disparate data sources and keeping statisticians
from working on analytics.
3. What are the sources of data that retailers such as Cabela’s use for their data
mining projects?
Cabela’s uses large and information-rich transactional and customer data (that
they collect on a daily basis). In addition, through Web mining they track
clickstream patterns of customers shopping online.
4. What does it mean to have “a single view of the customer”? How can it be
accomplished?
Having a single view of the customer means treating the customer as a single
entity across whichever channels the customer utilizes. Shopping channels include
brick-and-mortar, television, catalog, and e-commerce (through computers and
mobile devices). Achieving this single view helps to better focus marketing
efforts and drive increased sales.
4
Copyright © 2014 Pearson Education, Inc.
5. What type of analytics help did Cabela’s get from their efforts? Can you think of
any other potential benefits of analytics for large-scale retailers like Cabela’s?
Cabela’s has long relied on SAS statistics and data mining tools to help analyze
the data it gathers from sales transactions, market research, and demographic data
associated with its large database of customers. Using SAS data mining tools,
Cabela’s analysts create predictive models to optimize customer selection for all
customer contacts. Cabela’s uses these prediction scores to maximize marketing
spending across channels and within each customer’s personal contact strategy.
These efforts have allowed Cabela’s to continue its growth in a profitable manner.
In addition, dismantling the information silos, and integration of SAS and
Teradata, enabled them to create “a holistic view of the customer.” Since this
works so well for the sales/customer side of the business, it could also work in
other areas as well. Supply chain is one example. Analytics could help produce a
“holistic view of the vendor” as well.
6. What was the reason for Cabela’s to bring together SAS and Teradata, the two
leading vendors in the analytics marketplace?
Cabela’s was already using both for different elements of their business. Each of
the two systems was producing actionable analysis of data. But by being separate,
too much time was required to construct data marts, bringing together disparate
data sources and keeping statisticians from working on analytics. Now, with the
integration of the two systems, statisticians can leverage the power of SAS using
the Teradata warehouse as one source of information.
1. Define data mining. Why are there many different names and definitions for data
mining?
Data mining is the process through which previously unknown patterns in data
were discovered. Another definition would be “a process that uses statistical,
mathematical, and artificial learning techniques to extract and identify useful
information and subsequent knowledge from large sets of data.” This includes
most types of automated data analysis. A third definition: Data mining is the
process of finding mathematical patterns from (usually) large sets of data; these
can be rules, affinities, correlations, trends, or prediction models.
5
Copyright © 2014 Pearson Education, Inc.
Data mining has many definitions because it’s been stretched beyond those limits
by some software vendors to include most forms of data analysis in order to
increase sales using the popularity of data mining.
3. How would data mining algorithms deal with qualitative data like unstructured
texts from interviews?
Data in data mining is classified into two main categories of numerical form and
categorical form. Qualitative data is known as categorical data, which includes
nominal data (finite non-ordered values) such as gender type (male and female)
and ordinal data (finite ordered values) such as credit scoring and linguistic values
(low, medium and high). The qualitative data derived from a survey must be
transformed to a categorical mode in order to be dealt with the conventional data
mining algorithm.
The qualitative data such as sex, age group and education level needs to be
converted to a categorical type by allocating discrete variables with a finite
number of values (nominal values) such as binomial values, or by proper coding
of the linguistic vales such as low, medium and high.
Generally speaking, data mining tasks can be classified into three main categories:
prediction, association, and clustering. Based on the way in which the patterns are
extracted from the historical data, the learning algorithms of data mining methods
can be classified as either supervised or unsupervised. With supervised learning
algorithms, the training data includes both the descriptive attributes (i.e.,
independent variables or decision variables) as well as the class attribute (i.e.,
output variable or result variable). In contrast, with unsupervised learning the
6
Copyright © 2014 Pearson Education, Inc.
training data includes only the descriptive attributes. Figure 5.3 (p. 198) shows a
simple taxonomy for data mining tasks, along with the learning methods, and
popular algorithms for each of the data mining tasks.
5. What are the key differences between the major data mining methods?
Prediction: the act of telling about the future. It differs from simple guessing by
taking into account the experiences, opinions, and other relevant information
in conducting the task of foretelling. A term that is commonly associated with
prediction is forecasting. Even though many believe that these two terms are
synonymous, there is a subtle but critical difference between the two. Whereas
prediction is largely experience and opinion based, forecasting is data and
model based. That is, in order of increasing reliability, one might list the
relevant terms as guessing, predicting, and forecasting, respectively. In data
mining terminology, prediction and forecasting are used synonymously, and
the term prediction is used as the common representation of the act.
Classification: analyzing the historical behavior of groups of entities with similar
characteristics, to predict the future behavior of a new entity from its
similarity to those groups
Clustering: finding groups of entities with similar characteristics
Association: establishing relationships among items that occur together
Sequence discovery: finding time-based associations
Visualization: presenting results obtained through one or more of the other
methods
Regression: a statistical estimation technique based on fitting a curve defined by
a mathematical equation of known type but unknown parameters to existing
data
Forecasting: estimating a future data value based on past data values.
Applications are listed near the beginning of this section (pp. 201-203): CRM,
banking, retailing and logistics, manufacturing and production, brokerage and
securities trading, insurance, computer hardware and software, government and
defense, travel, healthcare, medicine, entertainment, homeland security and law
enforcement, and sports.
2. Identify at least five specific applications of data mining and list five common
characteristics of these applications.
This question expands on the prior question by asking for common characteristics.
Several such applications and their characteristics are listed on pp. 201-203.
7
Copyright © 2014 Pearson Education, Inc.
3. What do you think is the most prominent application area for data mining? Why?
Students’ answers will differ depending on which of the applications (most likely
banking, retailing and logistics, manufacturing and production, government,
healthcare, medicine, or homeland security) they think is most in need of greater
certainty. Their reasons for selection should relate to the application area’s need
for better certainty and the ability to pay for the investments in data mining.
4. Can you think of other application areas for data mining not discussed in this
section? Explain.
Students should be able to identify an area that can benefit from greater prediction
or certainty. Answers will vary depending on their creativity.
Similar to other information systems initiatives, a data mining project must follow
a systematic project management process to be successful. Several data mining
processes have been proposed: CRISP-DM, SEMMA, and KDD.
2. Why do you think the early phases (understanding of the business and
understanding of the data) take the longest in data mining projects?
Students should explain that the early steps are the most unstructured phases
because they involve learning. Those phases (learning/understanding) cannot be
automated. Extra time and effort are needed upfront because any mistake in
understanding the business or data will most likely result in a failed BI project.
4. What are the main data preprocessing steps? Briefly describe each step and
provide relevant examples.
Data preprocessing is essential to any successful data mining study. Good data
leads to good information; good information leads to good decisions. Data
preprocessing includes four main steps (listed in Table 5.1 on page 209):
data consolidation: access, collect, select and filter data
8
Copyright © 2014 Pearson Education, Inc.
data cleaning: handle missing data, reduce noise, fix errors
data transformation: normalize the data, aggregate data, construct new
attributes
data reduction: reduce number of attributes and records; balance skewed
data
Association rule mining is a popular data mining method that is commonly used
as an example to explain what data mining is and what it can do to a
technologically less savvy audience. Association rule mining aims to find
interesting relationships (affinities) between variables (items) in large databases.
2. In the final step of data processing, how would data reduction facilitate decision
analysis? Give an example.
Data reduction is the final step of data processing. Data reduction can help getting
insight the data set and focusing on interconnection of key variables. If the
randomly selected sample can fully represent the data set the decision analysis
9
Copyright © 2014 Pearson Education, Inc.
and the interpretation of the processed data will be straightforward. For example,
in organ transportation, data set, a single binary variable can be used to show the
blood type match (1) and no-match (0) instead of using multi nominal values for
representing the blood type of the donor and the recipient. Such simplification can
increase the information content towards a knowledge pattern while reducing the
complexity of the data relationship.
4. What are some of the criteria for comparing and selecting the best classification
technique?
10
Copyright © 2014 Pearson Education, Inc.
6. Define Gini index. What does it measure?
The Gini index and information gain (entropy) are two popular ways to determine
branching choices in a decision tree. The Gini index measures the purity of a
sample. If everything in a sample belongs to one class, the Gini index value is
zero.
Cluster algorithms are used when the data records do not have predefined class
identifiers (i.e., it is not known to what class a particular record belongs).
Classification methods learn from previous examples containing inputs and the
resulting class labels, and once properly trained they are able to classify future
cases. Clustering partitions pattern records into natural segments or clusters.
The most commonly used clustering algorithms are k-means and self-organizing
maps.
Examples of these vendors include IBM (IBM SPSS Modeler), SAS (Enterprise
Miner), StatSoft (Statistica Data Miner), KXEN (Infinite Insight), Salford
(CART, MARS, TreeNet, RandomForest), Angoss (KnowledgeSTUDIO,
KnowledgeSeeker), and Megaputer (PolyAnalyst). Most of the more popular tools
are developed by the largest statistical software companies (SPSS, SAS, and
StatSoft).
2. Why do you think the most popular tools are developed by statistics companies?
11
Copyright © 2014 Pearson Education, Inc.
Data mining techniques involved the use of statistical analysis and modeling. So
it’s a natural extension of their business offerings.
Probably the most popular free and open source data mining tool is Weka. Others
include RapidMiner and Microsoft’s SQL Server.
4. What are the main differences between commercial and free data mining software
tools?
The main difference between commercial tools, such as Enterprise Miner and
Statistica, and free tools, such as Weka and RapidMiner, is computational
efficiency. The same data mining task involving a rather large dataset may take a
whole lot longer to complete with the free software, and in some cases it may not
even be feasible (i.e., crashing due to the inefficient use of computer memory).
5. What would be your top five selection criteria for a data mining tool? Explain.
Students’ answers will differ. Criteria they are likely to mention include cost,
user-interface, ease-of-use, computational efficiency, hardware compatibility,
type of business problem, vendor support, and vendor reputation.
Data that is collected, stored, and analyzed in data mining often contains
information about real people. This includes identification, demographic,
financial, personal, and behavioral information. Most of these data can be
accessed through some third-party data providers. In order to maintain the privacy
and protection of individuals’ rights, data mining professionals have ethical (and
often legal) obligations.
2. How do you think the discussion between privacy and data mining will progress?
Why?
12
Copyright © 2014 Pearson Education, Inc.
society be able to find a happy medium between privacy and data mining.
(Answers will vary by student.)
4. What do you think are the reasons for these myths about data mining?
Students’ answers will differ. Some answers might relate to fear of analytics, fear
of the unknown, or fear of looking dumb.
5. What are the most common data mining mistakes/blunders? How can they be
minimized and/or eliminated?
1. How did Infinity P&C improve customer service with data mining?
13
Copyright © 2014 Pearson Education, Inc.
One out of five claims is fraudulent. Rather than putting all five customers
through an investigatory process, SPSS helps Infinity ‘fast-track’ four of them and
close their cases within a matter of days. This results in much happier customers,
contributes to a more efficient workflow with improved cycle times, and improves
retention due to an overall better claims experience.
2. What were the challenges, the proposed solution, and the obtained results?
The solution involved using IBM SPSS predictive analytics tools. Based on “red
flag” claims, they used this to develop rules for rating and identifying potential
frauds. A key benefit of the IBM SPSS system is its ability to continually analyze
and score these claims, which helps ensure that we get the claim to the right
adjuster at the right time.
As a result of implementing the IBM SPSS analytics tools, Infinity P&C has
doubled the accuracy of its fraud identification, contributing to a return on
investment of 403 percent per a Nucleus Research study. With SPSS, Infinity
P&C has reduced SIU referral time from an average of 45–60 days to
approximately 1–3 days. Predictive analytics also helped with subrogation, the
process of collecting damages from the at-fault driver’s insurance company.
The Special Investigative Unit (SIU) started with low hanging fruit, applying data
mining for constructing discovery rules from “red flag” claims. They then
leveraged the company’s credit-rating approach toward coming up with a similar
scoring mechanism of claims (higher scores indicate greater chance of fraud).
They added rules to the SPSS rule-base to help flag potential subrogation claims.
This incremental approach to implementation works well with a tool like SPSS.
1. How did the Memphis Police Department use data mining to better combat crime?
They started mining MPD’s crime data banks to help zero in on where and when
criminals were hitting hardest. This began as a pilot program called Operation
Blue CRUSH, or Crime Reduction Utilizing Statistical History. Their use of data
mining enabled them to focus police resources intelligently by putting them in the
right place, on the right day, at the right time. Today, the MPD is continuing to
explore new ways to exploit statistical analysis in its crime-fighting mission.
14
Copyright © 2014 Pearson Education, Inc.
2. What were the challenges, the proposed solution, and the obtained results?
Crime across the metro area was surging, and city leaders were growing
impatient. The solution, a project called Blue CRUSH, involves IBM SPSS
Modeler, which enables officers to unlock the intelligence hidden in the
department’s huge digital library of crime records and police reports going back
nearly a decade. This has put a serious dent in Memphis area crime. Since the
program was launched, the number of Part One crimes—a category of serious
offenses including homicide, rape, aggravated assault, auto theft, and larceny—
has plummeted, dropping 27 percent from 2006 to 2010.
1. How can data mining be used to fight terrorism? Comment on what else can be
done beyond what is covered in this short application case.
The application case discusses use of data mining to detect money laundering and
other forms of terrorist financing. Other applications could be to track the
behavior and movement of potential terrorists, as well as text mining emails,
blogs, and social media threads.
2. Do you think that, although data mining is essential for fighting terrorist cells, it
also jeopardizes individuals’ rights to privacy?
1. How can data mining be used for ultimately curing illnesses like cancer?
Even though cancer research has traditionally been clinical and biological in
nature, in recent years data-driven analytic studies have become a common
complement. In medical domains where data- and analytics-driven research have
been applied successfully, novel research directions have been identified to
further advance the clinical and biological studies. Using data mining techniques,
researchers are able to identify novel patterns, paving the road toward a cancer-
free society.
2. What do you think are the promises and major challenges for data miners in
contributing to medical and biological research endeavors?
According to the American Cancer Society, half of all men and one-third of all
women in the United States will develop cancer during their lifetimes;
15
Copyright © 2014 Pearson Education, Inc.
approximately 1.5 million new cancer cases will be diagnosed in 2013. Cancer is
the second most common cause of death in the United States and in the world,
exceeded only by cardiovascular disease. Data mining shows tremendous promise
for helping to understand cancer, leading to better treatment and saved lives. Data
mining is not meant to replace medical professionals and researchers, but to
complement their invaluable efforts to provide data-driven new research
directions and to ultimately save more lives. Without the cooperation and
feedback from the medical experts, data mining results are not of much use. The
patterns found via data mining methods should be evaluated by medical
professionals who have years of experience in the problem domain to decide
whether they are logical, actionable, and novel to warrant new research directions.
Application Case 5.5: 2degrees Gets a 1275 Percent Boost in Churn Identification
1. What does 2degrees do? Why is it important for 2degrees to accurately identify
churn?
2. What were the challenges, the proposed solution, and the obtained results?
Their main challenge was to identify the most likely churn customers. For this,
they used 11Ants Customer Churn Analyzer. This tool allowed them to get up and
running quickly and very economically. They could test the water and determine
what the ROI was likely to be for predictive analytics. A carefully controlled
experiment was run over a period of 3 months. Customers identified as churners
were 1275 percent more likely to be churners than customers chosen at random.
The benefits to 2degrees and their customers included identification and
appropriateness, so that likely churners would get more intense marketing than
unlikely churners.
3. How can data mining help in identifying customer churn? How do some
companies do it without using data mining tools and techniques?
Application Case 5.6: Data Mining Goes to Hollywood: Predicting Financial Success
16
Copyright © 2014 Pearson Education, Inc.
of Movies
The movie industry is the “land of hunches and wild guesses” due to the difficulty
associated with forecasting product demand, making the movie business in
Hollywood a risky endeavor. If Hollywood could better predict financial success,
this would mitigate some of the financial risk.
2. How can data mining be used to predict the financial success of movies before the
start of their production process?
The way Sharda and Delen did it was to use data from movies between 1998 and
2005 as training data, and movies of 2006 as test data. They applied individual
and ensemble prediction models, and were able to identify significant variables
impacting financial success. They also showed that by using sensitivity analysis,
decision makers can predict with fairly high accuracy how much value a specific
actor (or a specific release date, or the addition of more technical effects, etc.)
brings to the financial success of a film, making the underlying system an
invaluable decision aid.
3. How do you think Hollywood performed, and perhaps is still performing, this task
without the help of data mining tools and techniques?
Most is done by gut feel and trial-and-error. This may keep the movie business as
a financially risky endeavor, but also allows for creativity. Sometimes uncertainty
is a good thing.
1. What do you think about data mining and its implications concerning privacy?
What is the threshold between knowledge discovery and privacy infringement?
2. Did Target go too far? Did they do anything illegal? What do you think they
should have done? What do you think they should do now (quit these types of
practices)?
Target might have made a tactical mistake, but they certainly didn’t do anything
illegal. They did not use any information that violates customer privacy; rather,
17
Copyright © 2014 Pearson Education, Inc.
they used transactional data that most every other retail chain is collecting and
storing (and perhaps analyzing) about their customers. Indeed, even the father
apologized when realizing his daughter was actually pregnant. The fact is, we live
in a world of massive data, and we are all as consumers leaving traces of our
buying behavior for anyone to see. (Answers will vary by student.)
1. Define data mining. Why are there many names and definitions for data mining?
Data mining is the process through which previously unknown patterns in data
were discovered. Another definition would be “a process that uses statistical,
mathematical, and artificial learning techniques to extract and identify useful
information and subsequent knowledge from large sets of data.” This includes
most types of automated data analysis. A third definition: Data mining is the
process of finding mathematical patterns from (usually) large sets of data; these
can be rules, affinities, correlations, trends, or prediction models.
Data mining has many definitions because it’s been stretched beyond those limits
by some software vendors to include most forms of data analysis in order to
increase sales using the popularity of data mining.
2. What are the main reasons for the recent popularity of data mining?
18
Copyright © 2014 Pearson Education, Inc.
the analyses, availability of historical data, a business need for the data mining
software.
Students can view the answer in Figure 5.1 (p. 193), which shows that data
mining is a composite or blend of multiple disciplines or analytical tools and
techniques.
5. Discuss the main data mining methods. What are the fundamental differences
among them?
Prediction is the act of telling about the future. It differs from simple guessing by
taking into account the experiences, opinions, and other relevant information in
conducting the task of foretelling. A term that is commonly associated with
prediction is forecasting. Even though many believe that these two terms are
synonymous, there is a subtle but critical difference between the two. Whereas
prediction is largely experience and opinion based, forecasting is data and model
based.
6. What are the main data mining application areas? Discuss the commonalities of
these areas that make them a prospect for data mining studies.
Applications are listed near the beginning of section 5.3: CRM, banking, retailing
and logistics, manufacturing and production, brokerage and securities trading,
insurance, computer hardware and software, government and defense, travel,
19
Copyright © 2014 Pearson Education, Inc.
healthcare, medicine, entertainment, homeland security and law enforcement, and
sports.
The commonalities are the need for predictions and forecasting for planning
purposes and to support decision making.
7. Why do we need a standardized data mining process? What are the most
commonly used data mining processes?
8. Explain how the validity of available data affects model development and
assessment.
9. In what step of data processing should missing data be distinguished? How will it
affect data clearing and transformation?
Missing data must be indicated in the data cleaning stage and needs to be filled
with most appropriate values. Data transformation cannot be completed with
missing data because the corresponding variable(s) must be specified with a range
of (no missing) values, and introducing a possible mathematical function which
can help data transformation through creating new informative variables.
10. Why do we need data preprocessing? What are the main tasks and relevant
techniques used in data preprocessing?
Data preprocessing is essential to any successful data mining study. Good data
leads to good information; good information leads to good decisions. Data
preprocessing includes four main steps (listed in Table 5.1 on page 209):
data consolidation: access, collect, select and filter data
data cleaning: handle missing data, reduce noise, fix errors
data transformation: normalize the data, aggregate data, construct new
attributes
20
Copyright © 2014 Pearson Education, Inc.
data reduction: reduce number of attributes and records; balance skewed
data
11. Distinguish between an inconsistent value and missing value. When can
inconsistent values be removed from the data set and missing data be ignored?
Missing value is a natural part of a data set and needs to be imputed by a most
appropriate value. In contrast, inconsistency in a data set is a unusual value within
a variable in a date asset that should be handled using domain knowledge and/or
expert opinions.
12. What is the main difference between classification and clustering? Explain using
concrete examples.
13. Discuss the common ways to deal with the missing data? How is missing data
dealt with in a survey where some respondents have not replied to a few
questions?
14. What are the privacy issues with data mining? Do you think they are
substantiated?
Data that is collected, stored, and analyzed in data mining often contains
information about real people. This includes identification, demographic,
financial, personal, and behavioral information. Most of these data can be
21
Copyright © 2014 Pearson Education, Inc.
accessed through some third-party data providers. In order to maintain the privacy
and protection of individuals’ rights, data mining professionals have ethical (and
often legal) obligations. As time goes on, this will continue to be a public debate.
15. What are the most common myths and mistakes about data mining?
22
Copyright © 2014 Pearson Education, Inc.