Data Mining: Learning Objectives For Chapter 5

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 22

Chapter 5:

Data Mining

Learning Objectives for Chapter 5


1. Describe data mining as an enabling technology for business analytics
2. Understand the objectives and benefits of data mining
3. Become familiar with the wide range of applications of data mining
4. Learn the standardized data mining processes
5. Understand the steps involved in data preprocessing for data mining
6. Learn different methods and algorithms of data mining
7. Build awareness of the existing data mining software tools
8. Understand the privacy issues, pitfalls, and myths of data mining

CHAPTER OVERVIEW

Generally speaking, data mining is a way to develop intelligence (i.e., actionable


information or knowledge) from data that an organization collects, organizes, and stores.
A wide range of data mining techniques are being used by organizations to gain a better
understanding of their customers and their own operations and to solve complex
organizational problems. In this chapter, we study data mining as an enabling technology
for business analytics, learn about the standard processes of conducting data mining
projects, understand and build expertise in the use of major data mining techniques,
develop awareness of the existing software tools, and explore privacy issues, and
common myths and pitfalls that are often associated with data mining.

CHAPTER OUTLINE

5.1 OPENING VIGNETTE: CABELA’S REELS IN MORE CUSTOMERS


WITH ADVANCED ANALYTICS AND DATA MINING
 Questions for the Opening Vignette
A. WHAT WE CAN LEARN FROM THIS VIGNETTE

1
Copyright © 2014 Pearson Education, Inc.
5.2 DATA MINING CONCEPTS AND APPLICATIONS
 Application Case 5.1: Smarter Insurance: Infinity P&C
Improves Customer Service and Combats Fraud with
Predictive Analytics
A. DEFINITIONS, CHARACTERISTICS, AND BENEFITS
 Technology Insights 5.1: A Simple Taxonomy of Data
 Application Case 5.2: Harnessing Analytics to Combat
Crime: Predictive Analytics Helps Memphis Police
Department Pinpoint Crime and Focus Police Resources
B. HOW DATA MINING WORKS
1. Prediction
2. Classification
3. Clustering
4. Associations
5. Visualization and Time-Series Forecasting
C. DATA MINING VERSUS STATISTICS
 Section 5.2 Review Questions

5.3 DATA MINING APPLICATIONS


 Application Case 5.3: A Mine on Terrorist Funding
 Section 5.3 Review Questions

5.4 DATA MINING PROCESS


A, Step 1: Business Understanding
B. Step 2: Data Understanding
C. Step 3: Data Preparation
D. Step 4: Model Building
 Application Case 5.4: Data Mining in Cancer Research
E. Step 5: Testing and Evaluation
F. Step 6: Deployment
G. OTHER DATA MINING STANDARDIZED PROCESSES AND
METHODOLOGIES
 Section 5.4 Review Questions

5.5 DATA MINING METHODS


A. CLASSIFICATION
B. ESTIMATING THE TRUE ACCURACY OF CLASSIFICATION
MODELS
1. Simple Split
2. k-Fold Cross-Validation
3. Additional Classification Assessment Methodologies
4. Classification Techniques
5. Decision Trees
 Application Case 5.5: 2degrees Gets a 1275 Percent Boost
in Churn Identification
C. CLUSTER ANALYSIS FOR DATA MINING

2
Copyright © 2014 Pearson Education, Inc.
1. Determining the Optimal Number of Clusters
2. Analysis Methods
3. K-Means Clustering Algorithm
D. ASSOCIATION RULE MINING
1. Apriori Algorithm
 Section 5.5 Review Questions

5.6 DATA MINING SOFTWARE TOOLS


 Application Case 5.6: Data Mining Goes to Hollywood:
Predicting Financial Success of Movies
 Section 5.6 Review Questions

5.7 DATA MINING PRIVACY ISSUES, MYTHS, AND BLUNDERS


A. DATA MINING AND PRIVACY ISSUES
 Application Case 5.7: Predicting Customer Buying Patterns
—The Target Story
B. DATA MINING MYTHS AND BLUNDERS
 Section 5.7 Review Questions

Chapter Highlights
Key Terms
Questions for Discussion
Exercises
Teradata University Network (TUN) and Other Hands-on Exercises
Team Assignments and Role-Playing Projects
Internet Exercises
 End of Chapter Application Case: Macys.com Enhances Its
Customers’ Shopping Experience with Analytics
 Questions for the Case
References

TEACHING TIPS/ADDITIONAL INFORMATION        


Data mining was mentioned briefly in previous chapters. Students have
encountered the term several times but will only have a general idea of what it means. By
the time they finish this chapter, students will have a comprehensive understanding of
key data mining topics and applications.
The order of sections in this chapter follows a logical sequence to make teaching
this topic easy. After the opening vignette, Section 5.2 discusses the general concepts.
Section 5.3 covers a wide range of applications. Section 5.4 focuses on the data mining
process. Section 5.5 describes the most popular data mining methods and explains their
representative techniques. Section 5.6 covers software vendors, commercial tools, and
open source tools for modeling and data visualization. Finally, section 5.7 covers privacy
concerns.
This chapter contains seven application cases that can help the material come

3
Copyright © 2014 Pearson Education, Inc.
alive for students. The Internet exercises at the end are also helpful, especially the ones
that send students to vendor sites for case studies and white papers. It can be a good class
or group project for students to go to several different vendor sites and compare what
they find on each.

ANSWERS TO END OF SECTION REVIEW QUESTIONS      


Section 5.1 Review Questions

1. Why should retailers, especially omni-channel retailers, pay extra attention to


advanced analytics and data mining?

Utilizing large and information-rich transactional and customer data (that they
collect on a daily basis) to optimize their business processes is not a choice for
large-scale retailers anymore, but a necessity to stay competitive.

2. What are the top challenges for multi-channel retailers? Can you think of other
industry segments that face similar problems?

The retail industry is amongst the most challenging because of the change that
they have to deal with constantly. Understanding customer needs, wants, likes,
and dislikes is an ongoing challenge. As the volume and complexity of data
increase, so does the time spent on preparing and analyzing it.

Prior to the integration of SAS and Teradata, data for modeling and scoring
customers was stored in a data mart. This process required a large amount of time
to construct, bringing together disparate data sources and keeping statisticians
from working on analytics.

3. What are the sources of data that retailers such as Cabela’s use for their data
mining projects?

Cabela’s uses large and information-rich transactional and customer data (that
they collect on a daily basis). In addition, through Web mining they track
clickstream patterns of customers shopping online.

4. What does it mean to have “a single view of the customer”? How can it be
accomplished?

Having a single view of the customer means treating the customer as a single
entity across whichever channels the customer utilizes. Shopping channels include
brick-and-mortar, television, catalog, and e-commerce (through computers and
mobile devices). Achieving this single view helps to better focus marketing
efforts and drive increased sales.

4
Copyright © 2014 Pearson Education, Inc.
5. What type of analytics help did Cabela’s get from their efforts? Can you think of
any other potential benefits of analytics for large-scale retailers like Cabela’s?

Cabela’s has long relied on SAS statistics and data mining tools to help analyze
the data it gathers from sales transactions, market research, and demographic data
associated with its large database of customers. Using SAS data mining tools,
Cabela’s analysts create predictive models to optimize customer selection for all
customer contacts. Cabela’s uses these prediction scores to maximize marketing
spending across channels and within each customer’s personal contact strategy.
These efforts have allowed Cabela’s to continue its growth in a profitable manner.
In addition, dismantling the information silos, and integration of SAS and
Teradata, enabled them to create “a holistic view of the customer.” Since this
works so well for the sales/customer side of the business, it could also work in
other areas as well. Supply chain is one example. Analytics could help produce a
“holistic view of the vendor” as well.

6. What was the reason for Cabela’s to bring together SAS and Teradata, the two
leading vendors in the analytics marketplace?

Cabela’s was already using both for different elements of their business. Each of
the two systems was producing actionable analysis of data. But by being separate,
too much time was required to construct data marts, bringing together disparate
data sources and keeping statisticians from working on analytics. Now, with the
integration of the two systems, statisticians can leverage the power of SAS using
the Teradata warehouse as one source of information.

7. What is in-database analytics, and why would you need it?

In-database analytics refers to the practice of applying analytics directly to a


database or data warehouse rather than the traditional practice of first
transforming into the analytics application’s data format. The time it takes to
transform production data into a data warehouse format can be very long. In-
database analytics eliminates this need.

Section 5.2 Review Questions

1. Define data mining. Why are there many different names and definitions for data
mining?

Data mining is the process through which previously unknown patterns in data
were discovered. Another definition would be “a process that uses statistical,
mathematical, and artificial learning techniques to extract and identify useful
information and subsequent knowledge from large sets of data.” This includes
most types of automated data analysis. A third definition: Data mining is the
process of finding mathematical patterns from (usually) large sets of data; these
can be rules, affinities, correlations, trends, or prediction models.

5
Copyright © 2014 Pearson Education, Inc.
Data mining has many definitions because it’s been stretched beyond those limits
by some software vendors to include most forms of data analysis in order to
increase sales using the popularity of data mining.

2. What recent factors have increased the popularity of data mining?

Following are some of the most pronounced reasons:


• More intense competition at the global scale driven by customers’ ever-changing
needs and wants in an increasingly saturated marketplace.
• General recognition of the untapped value hidden in large data sources.
• Consolidation and integration of database records, which enables a single view
of customers, vendors, transactions, etc.
• Consolidation of databases and other data repositories into a single location in
the form of a data warehouse.
• The exponential increase in data processing and storage technologies.
• Significant reduction in the cost of hardware and software for data storage and
processing.
• Movement toward the de-massification (conversion of information resources
into nonphysical form) of business practices.

3. How would data mining algorithms deal with qualitative data like unstructured
texts from interviews?

Data in data mining is classified into two main categories of numerical form and
categorical form. Qualitative data is known as categorical data, which includes
nominal data (finite non-ordered values) such as gender type (male and female)
and ordinal data (finite ordered values) such as credit scoring and linguistic values
(low, medium and high). The qualitative data derived from a survey must be
transformed to a categorical mode in order to be dealt with the conventional data
mining algorithm.
The qualitative data such as sex, age group and education level needs to be
converted to a categorical type by allocating discrete variables with a finite
number of values (nominal values) such as binomial values, or by proper coding
of the linguistic vales such as low, medium and high.

4. What are some major data mining methods and algorithms?

Generally speaking, data mining tasks can be classified into three main categories:
prediction, association, and clustering. Based on the way in which the patterns are
extracted from the historical data, the learning algorithms of data mining methods
can be classified as either supervised or unsupervised. With supervised learning
algorithms, the training data includes both the descriptive attributes (i.e.,
independent variables or decision variables) as well as the class attribute (i.e.,
output variable or result variable). In contrast, with unsupervised learning the

6
Copyright © 2014 Pearson Education, Inc.
training data includes only the descriptive attributes. Figure 5.3 (p. 198) shows a
simple taxonomy for data mining tasks, along with the learning methods, and
popular algorithms for each of the data mining tasks.

5. What are the key differences between the major data mining methods?

Prediction: the act of telling about the future. It differs from simple guessing by
taking into account the experiences, opinions, and other relevant information
in conducting the task of foretelling. A term that is commonly associated with
prediction is forecasting. Even though many believe that these two terms are
synonymous, there is a subtle but critical difference between the two. Whereas
prediction is largely experience and opinion based, forecasting is data and
model based. That is, in order of increasing reliability, one might list the
relevant terms as guessing, predicting, and forecasting, respectively. In data
mining terminology, prediction and forecasting are used synonymously, and
the term prediction is used as the common representation of the act.
Classification: analyzing the historical behavior of groups of entities with similar
characteristics, to predict the future behavior of a new entity from its
similarity to those groups
Clustering: finding groups of entities with similar characteristics
Association: establishing relationships among items that occur together
Sequence discovery: finding time-based associations
Visualization: presenting results obtained through one or more of the other
methods
Regression: a statistical estimation technique based on fitting a curve defined by
a mathematical equation of known type but unknown parameters to existing
data
Forecasting: estimating a future data value based on past data values.

Section 5.3 Review Questions

1. What are the major application areas for data mining?

Applications are listed near the beginning of this section (pp. 201-203): CRM,
banking, retailing and logistics, manufacturing and production, brokerage and
securities trading, insurance, computer hardware and software, government and
defense, travel, healthcare, medicine, entertainment, homeland security and law
enforcement, and sports.

2. Identify at least five specific applications of data mining and list five common
characteristics of these applications.

This question expands on the prior question by asking for common characteristics.
Several such applications and their characteristics are listed on pp. 201-203.

7
Copyright © 2014 Pearson Education, Inc.
3. What do you think is the most prominent application area for data mining? Why?

Students’ answers will differ depending on which of the applications (most likely
banking, retailing and logistics, manufacturing and production, government,
healthcare, medicine, or homeland security) they think is most in need of greater
certainty. Their reasons for selection should relate to the application area’s need
for better certainty and the ability to pay for the investments in data mining.

4. Can you think of other application areas for data mining not discussed in this
section? Explain.

Students should be able to identify an area that can benefit from greater prediction
or certainty. Answers will vary depending on their creativity.

Section 5.4 Review Questions

1. What are the major data mining processes?

Similar to other information systems initiatives, a data mining project must follow
a systematic project management process to be successful. Several data mining
processes have been proposed: CRISP-DM, SEMMA, and KDD.

2. Why do you think the early phases (understanding of the business and
understanding of the data) take the longest in data mining projects?

Students should explain that the early steps are the most unstructured phases
because they involve learning. Those phases (learning/understanding) cannot be
automated. Extra time and effort are needed upfront because any mistake in
understanding the business or data will most likely result in a failed BI project.

3. List and briefly define the phases in the CRISP-DM process.

CRISP-DM provides a systematic and orderly way to conduct data mining


projects. This process has six steps. First, an understanding of the data and an
understanding of the business issues to be addressed are developed concurrently.
Next, data are prepared for modeling; are modeled; model results are evaluated;
and the models can be employed for regular use.

4. What are the main data preprocessing steps? Briefly describe each step and
provide relevant examples.

Data preprocessing is essential to any successful data mining study. Good data
leads to good information; good information leads to good decisions. Data
preprocessing includes four main steps (listed in Table 5.1 on page 209):
 data consolidation: access, collect, select and filter data

8
Copyright © 2014 Pearson Education, Inc.
 data cleaning: handle missing data, reduce noise, fix errors
 data transformation: normalize the data, aggregate data, construct new
attributes
 data reduction: reduce number of attributes and records; balance skewed
data

5. In the CRISP-DM process, what does ‘data reduction’ mean?

In the CRISP-DM process, data reduction means reduction of records and/or


variables. In real world business problems the data sets are huge and the number
of variable is large. In order to be able to manage data, variables which are
usually columns in a data base, and records which are usually the number of rows
in the data base need to be reduced for the purposed of data mining projects. Each
variable can be considered as a dimension to the problem and therefore a large
number of variables could produce greater complexity for data mining. On the
other hand, it is infeasible to deal with millions of records, and therefore a number
of sampling methods are used to select a part of a large data set for data mining.

Section 5.5 Review Questions

1. Identify at least three of the main data mining methods.

Classification learns patterns from past data (a set of information—traits,


variables, features—on characteristics of the previously labeled items, objects, or
events) in order to place new instances (with unknown labels) into their respective
groups or classes. The objective of classification is to analyze the historical data
stored in a database and automatically generate a model that can predict future
behavior.

Cluster analysis is an exploratory data analysis tool for solving classification


problems. The objective is to sort cases (e.g., people, things, events) into groups,
or clusters, so that the degree of association is strong among members of the same
cluster and weak among members of different clusters.

Association rule mining is a popular data mining method that is commonly used
as an example to explain what data mining is and what it can do to a
technologically less savvy audience. Association rule mining aims to find
interesting relationships (affinities) between variables (items) in large databases.

2. In the final step of data processing, how would data reduction facilitate decision
analysis? Give an example.

Data reduction is the final step of data processing. Data reduction can help getting
insight the data set and focusing on interconnection of key variables. If the
randomly selected sample can fully represent the data set the decision analysis

9
Copyright © 2014 Pearson Education, Inc.
and the interpretation of the processed data will be straightforward. For example,
in organ transportation, data set, a single binary variable can be used to show the
blood type match (1) and no-match (0) instead of using multi nominal values for
representing the blood type of the donor and the recipient. Such simplification can
increase the information content towards a knowledge pattern while reducing the
complexity of the data relationship.

3. List and briefly define at least two classification techniques.

• Decision tree analysis. Decision tree analysis (a machine-learning technique) is


arguably the most popular classification technique in the data mining arena.
• Statistical analysis. Statistical classification techniques include logistic
regression and discriminant analysis, both of which make the assumptions that
the relationships between the input and output variables are linear in nature,
the data is normally distributed, and the variables are not correlated and are
independent of each other.
• Case-based reasoning. This approach uses historical cases to recognize
commonalities in order to assign a new case into the most probable category.
• Bayesian classifiers. This approach uses probability theory to build classification
models based on the past occurrences that are capable of placing a new
instance into a most probable class (or category).
• Genetic algorithms. The use of the analogy of natural evolution to build directed
search-based mechanisms to classify data samples.
• Rough sets. This method takes into account the partial membership of class
labels to predefined categories in building models (collection of rules) for
classification problems.

4. What are some of the criteria for comparing and selecting the best classification
technique?

 The amount and availability of historical data


 The types of data, categorical, interval, ration, etc.
 What is being predicted -- class or numeric value
 The purpose or objective

5. Briefly describe the general algorithm used in decision trees.

A general algorithm for building a decision tree is as follows:


1. Create a root node and assign all of the training data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of the split. Split the data into
mutually exclusive (nonoverlapping) subsets along the lines of the specific
split and mode to the branches.
4. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is
reached (e.g., the node is dominated by a single class label).

10
Copyright © 2014 Pearson Education, Inc.
6. Define Gini index. What does it measure?
The Gini index and information gain (entropy) are two popular ways to determine
branching choices in a decision tree. The Gini index measures the purity of a
sample. If everything in a sample belongs to one class, the Gini index value is
zero.

7. Give examples of situations in which cluster analysis would be an appropriate


data mining technique.

Cluster algorithms are used when the data records do not have predefined class
identifiers (i.e., it is not known to what class a particular record belongs).

8. What is the major difference between cluster analysis and classification?

Classification methods learn from previous examples containing inputs and the
resulting class labels, and once properly trained they are able to classify future
cases. Clustering partitions pattern records into natural segments or clusters.

9. What are some of the methods for cluster analysis?

The most commonly used clustering algorithms are k-means and self-organizing
maps.

10. Give examples of situations in which association would be an appropriate data


mining technique.

Association rule mining is appropriate to use when the objective is to discover


two or more items (or events or concepts) that go together. Students’ answers will
differ.

Section 5.6 Review Questions

1. What are the most popular commercial data mining tools?

Examples of these vendors include IBM (IBM SPSS Modeler), SAS (Enterprise
Miner), StatSoft (Statistica Data Miner), KXEN (Infinite Insight), Salford
(CART, MARS, TreeNet, RandomForest), Angoss (KnowledgeSTUDIO,
KnowledgeSeeker), and Megaputer (PolyAnalyst). Most of the more popular tools
are developed by the largest statistical software companies (SPSS, SAS, and
StatSoft).

2. Why do you think the most popular tools are developed by statistics companies?

11
Copyright © 2014 Pearson Education, Inc.
Data mining techniques involved the use of statistical analysis and modeling. So
it’s a natural extension of their business offerings.

3. What are the most popular free data mining tools?

Probably the most popular free and open source data mining tool is Weka. Others
include RapidMiner and Microsoft’s SQL Server.

4. What are the main differences between commercial and free data mining software
tools?

The main difference between commercial tools, such as Enterprise Miner and
Statistica, and free tools, such as Weka and RapidMiner, is computational
efficiency. The same data mining task involving a rather large dataset may take a
whole lot longer to complete with the free software, and in some cases it may not
even be feasible (i.e., crashing due to the inefficient use of computer memory).

5. What would be your top five selection criteria for a data mining tool? Explain.

Students’ answers will differ. Criteria they are likely to mention include cost,
user-interface, ease-of-use, computational efficiency, hardware compatibility,
type of business problem, vendor support, and vendor reputation.

Section 5.7 Review Questions

1. What are the privacy issues in data mining?

Data that is collected, stored, and analyzed in data mining often contains
information about real people. This includes identification, demographic,
financial, personal, and behavioral information. Most of these data can be
accessed through some third-party data providers. In order to maintain the privacy
and protection of individuals’ rights, data mining professionals have ethical (and
often legal) obligations.

2. How do you think the discussion between privacy and data mining will progress?
Why?

As technology advances and more information about people becomes easier to


get, the privacy debate will adjust accordingly. People’s expectations about
privacy will become tempered by their desires for the benefits of data mining,
from individualized customer service to higher security. As with all issues of
social import, the privacy issue will include social discourse, legal and legislative
decisions, and corporate decisions. The fact that companies often choose to self-
regulate (e.g., by ensuring their data is de-identified) implies that we may as a

12
Copyright © 2014 Pearson Education, Inc.
society be able to find a happy medium between privacy and data mining.
(Answers will vary by student.)

3. What are the most common myths about data mining?

Data mining provides instant, crystal-ball predictions.


Data mining is not yet viable for business applications.
Data mining requires a separate, dedicated database.
Only those with advanced degrees can do data mining.
Data mining is only for large firms that have lots of customer data.

4. What do you think are the reasons for these myths about data mining?

Students’ answers will differ. Some answers might relate to fear of analytics, fear
of the unknown, or fear of looking dumb.

5. What are the most common data mining mistakes/blunders? How can they be
minimized and/or eliminated?

 Selecting the wrong problem for data mining.


 Ignoring what your sponsor thinks data mining is and what it really can
and cannot do.
 Leaving insufficient time for data preparation. It takes more effort than
one often expects.
 Looking only at aggregated results and not at individual records.
 Being sloppy about keeping track of the mining procedure and results.
 Ignoring suspicious findings and quickly moving on.
 Running mining algorithms repeatedly and blindly. It is important to think
hard enough about the next stage of data analysis. Data mining is a very
hands-on activity.
 Believing everything you are told about data.
 Believing everything you are told about your own data mining analysis.
 Measuring your results differently from the way your sponsor measures
them.
Ways to minimize these risks are basically the reverse of these items.

ANSWERS TO APPLICATION CASE QUESTIONS FOR DISCUSSION  


Application Case 5.1: Smarter Insurance: Infinity P&C Improves Customer Service
and Combats Fraud with Predictive Analytics

1. How did Infinity P&C improve customer service with data mining?

13
Copyright © 2014 Pearson Education, Inc.
One out of five claims is fraudulent. Rather than putting all five customers
through an investigatory process, SPSS helps Infinity ‘fast-track’ four of them and
close their cases within a matter of days. This results in much happier customers,
contributes to a more efficient workflow with improved cycle times, and improves
retention due to an overall better claims experience.

2. What were the challenges, the proposed solution, and the obtained results?

Infinity P&C faces a significant challenge in recognizing fraudulent claims. Fraud


represents a $20 billion exposure to the insurance industry and in certain venues
could be an element in around 40 percent of claims.

The solution involved using IBM SPSS predictive analytics tools. Based on “red
flag” claims, they used this to develop rules for rating and identifying potential
frauds. A key benefit of the IBM SPSS system is its ability to continually analyze
and score these claims, which helps ensure that we get the claim to the right
adjuster at the right time.

As a result of implementing the IBM SPSS analytics tools, Infinity P&C has
doubled the accuracy of its fraud identification, contributing to a return on
investment of 403 percent per a Nucleus Research study. With SPSS, Infinity
P&C has reduced SIU referral time from an average of 45–60 days to
approximately 1–3 days. Predictive analytics also helped with subrogation, the
process of collecting damages from the at-fault driver’s insurance company.

3. What was their implementation strategy? Why is it important to produce results as


early as possible in data mining studies?

The Special Investigative Unit (SIU) started with low hanging fruit, applying data
mining for constructing discovery rules from “red flag” claims. They then
leveraged the company’s credit-rating approach toward coming up with a similar
scoring mechanism of claims (higher scores indicate greater chance of fraud).
They added rules to the SPSS rule-base to help flag potential subrogation claims.
This incremental approach to implementation works well with a tool like SPSS.

Application Case 5.2: Harnessing Analytics to Combat Crime: Predictive Analytics


Helps Memphis Police Department Pinpoint Crime and Focus Police Resources

1. How did the Memphis Police Department use data mining to better combat crime?

They started mining MPD’s crime data banks to help zero in on where and when
criminals were hitting hardest. This began as a pilot program called Operation
Blue CRUSH, or Crime Reduction Utilizing Statistical History. Their use of data
mining enabled them to focus police resources intelligently by putting them in the
right place, on the right day, at the right time. Today, the MPD is continuing to
explore new ways to exploit statistical analysis in its crime-fighting mission.

14
Copyright © 2014 Pearson Education, Inc.
2. What were the challenges, the proposed solution, and the obtained results?

Crime across the metro area was surging, and city leaders were growing
impatient. The solution, a project called Blue CRUSH, involves IBM SPSS
Modeler, which enables officers to unlock the intelligence hidden in the
department’s huge digital library of crime records and police reports going back
nearly a decade. This has put a serious dent in Memphis area crime. Since the
program was launched, the number of Part One crimes—a category of serious
offenses including homicide, rape, aggravated assault, auto theft, and larceny—
has plummeted, dropping 27 percent from 2006 to 2010.

Application Case 5.3: A Mine on Terrorist Funding

1. How can data mining be used to fight terrorism? Comment on what else can be
done beyond what is covered in this short application case.

The application case discusses use of data mining to detect money laundering and
other forms of terrorist financing. Other applications could be to track the
behavior and movement of potential terrorists, as well as text mining emails,
blogs, and social media threads.

2. Do you think that, although data mining is essential for fighting terrorist cells, it
also jeopardizes individuals’ rights to privacy?

Yes, because it inevitably involves tracking personal and financial data of


individuals. (As an opinion question, students’ answers will vary.)

Application Case 5.4: Data Mining in Cancer Research

1. How can data mining be used for ultimately curing illnesses like cancer?

Even though cancer research has traditionally been clinical and biological in
nature, in recent years data-driven analytic studies have become a common
complement. In medical domains where data- and analytics-driven research have
been applied successfully, novel research directions have been identified to
further advance the clinical and biological studies. Using data mining techniques,
researchers are able to identify novel patterns, paving the road toward a cancer-
free society.

2. What do you think are the promises and major challenges for data miners in
contributing to medical and biological research endeavors?

According to the American Cancer Society, half of all men and one-third of all
women in the United States will develop cancer during their lifetimes;

15
Copyright © 2014 Pearson Education, Inc.
approximately 1.5 million new cancer cases will be diagnosed in 2013. Cancer is
the second most common cause of death in the United States and in the world,
exceeded only by cardiovascular disease. Data mining shows tremendous promise
for helping to understand cancer, leading to better treatment and saved lives. Data
mining is not meant to replace medical professionals and researchers, but to
complement their invaluable efforts to provide data-driven new research
directions and to ultimately save more lives. Without the cooperation and
feedback from the medical experts, data mining results are not of much use. The
patterns found via data mining methods should be evaluated by medical
professionals who have years of experience in the problem domain to decide
whether they are logical, actionable, and novel to warrant new research directions.

Application Case 5.5: 2degrees Gets a 1275 Percent Boost in Churn Identification

1. What does 2degrees do? Why is it important for 2degrees to accurately identify
churn?

2degrees is New Zealand’s fastest growing mobile telecommunications company.


Customer churn (customers leaving) is an all-too-common problem in the mobile
telecommunications industry. So, it makes sense that they would be interested in
identifying customers most at risk of churning.

2. What were the challenges, the proposed solution, and the obtained results?

Their main challenge was to identify the most likely churn customers. For this,
they used 11Ants Customer Churn Analyzer. This tool allowed them to get up and
running quickly and very economically. They could test the water and determine
what the ROI was likely to be for predictive analytics. A carefully controlled
experiment was run over a period of 3 months. Customers identified as churners
were 1275 percent more likely to be churners than customers chosen at random.
The benefits to 2degrees and their customers included identification and
appropriateness, so that likely churners would get more intense marketing than
unlikely churners.

3. How can data mining help in identifying customer churn? How do some
companies do it without using data mining tools and techniques?

Identifying customers most at risk of churning is done by analyzing data such as


time on network, days since last top-up, activation channel, whether the customer
ported their number or not, customer plan, and outbound calling behaviors over
the preceding 90 days. Companies without data mining cannot score customers as
often; previously 2degrees would do this once per month, but now they can do it
daily if they want.

Application Case 5.6: Data Mining Goes to Hollywood: Predicting Financial Success

16
Copyright © 2014 Pearson Education, Inc.
of Movies

1. Why is it important for Hollywood professionals to predict the financial success


of movies?

The movie industry is the “land of hunches and wild guesses” due to the difficulty
associated with forecasting product demand, making the movie business in
Hollywood a risky endeavor. If Hollywood could better predict financial success,
this would mitigate some of the financial risk.

2. How can data mining be used to predict the financial success of movies before the
start of their production process?

The way Sharda and Delen did it was to use data from movies between 1998 and
2005 as training data, and movies of 2006 as test data. They applied individual
and ensemble prediction models, and were able to identify significant variables
impacting financial success. They also showed that by using sensitivity analysis,
decision makers can predict with fairly high accuracy how much value a specific
actor (or a specific release date, or the addition of more technical effects, etc.)
brings to the financial success of a film, making the underlying system an
invaluable decision aid.

3. How do you think Hollywood performed, and perhaps is still performing, this task
without the help of data mining tools and techniques?

Most is done by gut feel and trial-and-error. This may keep the movie business as
a financially risky endeavor, but also allows for creativity. Sometimes uncertainty
is a good thing.

Application Case 5.7: Predicting Customer Buying Patterns—The Target Story

1. What do you think about data mining and its implications concerning privacy?
What is the threshold between knowledge discovery and privacy infringement?

There is a tradeoff between knowledge discovery and privacy rights. Retailers


should be sensitive about this when targeting their advertising based on data
mining results, especially regarding topics that could be embarrassing to their
customers. Otherwise they risk offending these customers, which could hurt their
bottom line. (Answers will vary by student.)

2. Did Target go too far? Did they do anything illegal? What do you think they
should have done? What do you think they should do now (quit these types of
practices)?

Target might have made a tactical mistake, but they certainly didn’t do anything
illegal. They did not use any information that violates customer privacy; rather,

17
Copyright © 2014 Pearson Education, Inc.
they used transactional data that most every other retail chain is collecting and
storing (and perhaps analyzing) about their customers. Indeed, even the father
apologized when realizing his daughter was actually pregnant. The fact is, we live
in a world of massive data, and we are all as consumers leaving traces of our
buying behavior for anyone to see. (Answers will vary by student.)

ANSWERS TO END OF CHAPTER QUESTIONS FOR DISCUSSION   

1. Define data mining. Why are there many names and definitions for data mining?

Data mining is the process through which previously unknown patterns in data
were discovered. Another definition would be “a process that uses statistical,
mathematical, and artificial learning techniques to extract and identify useful
information and subsequent knowledge from large sets of data.” This includes
most types of automated data analysis. A third definition: Data mining is the
process of finding mathematical patterns from (usually) large sets of data; these
can be rules, affinities, correlations, trends, or prediction models.

Data mining has many definitions because it’s been stretched beyond those limits
by some software vendors to include most forms of data analysis in order to
increase sales using the popularity of data mining.

2. What are the main reasons for the recent popularity of data mining?

Following are some of the most pronounced reasons:


• More intense competition at the global scale driven by customers’ ever-changing
needs and wants in an increasingly saturated marketplace.
• General recognition of the untapped value hidden in large data sources.
• Consolidation and integration of database records, which enables a single view
of customers, vendors, transactions, etc.
• Consolidation of databases and other data repositories into a single location in
the form of a data warehouse.
• The exponential increase in data processing and storage technologies.
• Significant reduction in the cost of hardware and software for data storage and
processing.
• Movement toward the de-massification (conversion of information resources
into nonphysical form) of business practices.

3. Discuss what an organization should consider before making a decision to


purchase data mining software.

Before making a decision to purchase data mining software, organizations should


consider the standard criteria to use when investing in any major software:
cost/benefit analysis, people with the expertise to use the software and perform

18
Copyright © 2014 Pearson Education, Inc.
the analyses, availability of historical data, a business need for the data mining
software.

4. Distinguish data mining from other analytical tools and techniques.

Students can view the answer in Figure 5.1 (p. 193), which shows that data
mining is a composite or blend of multiple disciplines or analytical tools and
techniques.

5. Discuss the main data mining methods. What are the fundamental differences
among them?

Three broad categories of data mining methods are prediction (classification or


regression), clustering, and association.

Prediction is the act of telling about the future. It differs from simple guessing by
taking into account the experiences, opinions, and other relevant information in
conducting the task of foretelling. A term that is commonly associated with
prediction is forecasting. Even though many believe that these two terms are
synonymous, there is a subtle but critical difference between the two. Whereas
prediction is largely experience and opinion based, forecasting is data and model
based.

Classification is analyzing the historical behavior of groups of entities with


similar characteristics, to predict the future behavior of a new entity from its
similarity to those groups.

Clustering is finding groups of entities with similar characteristics.

Association is establishing relationships among items that occur together.

The fundamental differences are:

 Prediction (classification or regression) predicts future cases or conditions


based on historical data.

 Clustering partitions pattern records into natural segments or clusters.


Each segment’s members share similar characteristics.

 Association is used to discover two or more items (or events or concepts)


that go together.

6. What are the main data mining application areas? Discuss the commonalities of
these areas that make them a prospect for data mining studies.

Applications are listed near the beginning of section 5.3: CRM, banking, retailing
and logistics, manufacturing and production, brokerage and securities trading,
insurance, computer hardware and software, government and defense, travel,

19
Copyright © 2014 Pearson Education, Inc.
healthcare, medicine, entertainment, homeland security and law enforcement, and
sports.

The commonalities are the need for predictions and forecasting for planning
purposes and to support decision making.

7. Why do we need a standardized data mining process? What are the most
commonly used data mining processes?

In order to systematically carry out data mining projects, a general process is


usually followed. Similar to other information systems initiatives, a data mining
project must follow a systematic project management process to be successful.
Several data mining processes have been proposed: CRISP-DM, SEMMA, and
KDD.

8. Explain how the validity of available data affects model development and
assessment.

Validity of available data is crucial for appropriate model development and


assessment. Having a valid large data set, a sample of the data set is randomly
selected in order to truly represent the whole data set. A model is then built and
developed based on the data sample to address the business needs. The model is
assessed for the purpose of accuracy and generality. The results of model
evaluation and proper interpretation of knowledge pattern fully depend on validity
of the sampled data used for model building.

9. In what step of data processing should missing data be distinguished? How will it
affect data clearing and transformation?

Missing data must be indicated in the data cleaning stage and needs to be filled
with most appropriate values. Data transformation cannot be completed with
missing data because the corresponding variable(s) must be specified with a range
of (no missing) values, and introducing a possible mathematical function which
can help data transformation through creating new informative variables.

10. Why do we need data preprocessing? What are the main tasks and relevant
techniques used in data preprocessing?

Data preprocessing is essential to any successful data mining study. Good data
leads to good information; good information leads to good decisions. Data
preprocessing includes four main steps (listed in Table 5.1 on page 209):
 data consolidation: access, collect, select and filter data
 data cleaning: handle missing data, reduce noise, fix errors
 data transformation: normalize the data, aggregate data, construct new
attributes

20
Copyright © 2014 Pearson Education, Inc.
 data reduction: reduce number of attributes and records; balance skewed
data

11. Distinguish between an inconsistent value and missing value. When can
inconsistent values be removed from the data set and missing data be ignored?

Missing value is a natural part of a data set and needs to be imputed by a most
appropriate value. In contrast, inconsistency in a data set is a unusual value within
a variable in a date asset that should be handled using domain knowledge and/or
expert opinions.

12. What is the main difference between classification and clustering? Explain using
concrete examples.

Classification learns patterns from past data (a set of information—traits,


variables, features—on characteristics of the previously labeled items, objects, or
events) in order to place new instances (with unknown labels) into their respective
groups or classes. The objective of classification is to analyze the historical data
stored in a database and automatically generate a model that can predict future
behavior. Classifying customer-types as likely to buy or not buy is an example.

Cluster analysis is an exploratory data analysis tool for solving classification


problems. The objective is to sort cases (e.g., people, things, events) into groups,
or clusters, so that the degree of association is strong among members of the same
cluster and weak among members of different clusters. Customers can be
grouped according to demographics.

13. Discuss the common ways to deal with the missing data? How is missing data
dealt with in a survey where some respondents have not replied to a few
questions?

There are several techniques such as imputation methods (mean, median,


min/max, mode), which can help estimate and allocate appropriate values to the
missing values. The responses to the survey questions would reflect the values of
corresponding variables, which are investigated in the survey. If the missing
values happen completely in random, the few individuals with incomplete data
may not affect the prospective knowledge pattern and removing them from the
analysis could be justified. Otherwise, the missing values must be estimated by an
appropriate imputation method.

14. What are the privacy issues with data mining? Do you think they are
substantiated?

Data that is collected, stored, and analyzed in data mining often contains
information about real people. This includes identification, demographic,
financial, personal, and behavioral information. Most of these data can be

21
Copyright © 2014 Pearson Education, Inc.
accessed through some third-party data providers. In order to maintain the privacy
and protection of individuals’ rights, data mining professionals have ethical (and
often legal) obligations. As time goes on, this will continue to be a public debate.

As technology advances and more information about people becomes easier to


get, the privacy debate will adjust accordingly. People’s expectations about
privacy will become tempered by their desires for the benefits of data mining,
from individualized customer service to higher security. As with all issues of
social import, the privacy issue will include social discourse, legal and legislative
decisions, and corporate decisions. The fact that companies often choose to self-
regulate (e.g., by ensuring their data is de-identified) implies that we may as a
society be able to find a happy medium between privacy and data mining.

15. What are the most common myths and mistakes about data mining?

 Data mining provides instant, crystal-ball predictions.


 Data mining is not yet viable for business applications.
 Data mining requires a separate, dedicated database.
 Only those with advanced degrees can do data mining.
 Data mining is only for large firms that have lots of customer data.

22
Copyright © 2014 Pearson Education, Inc.

You might also like