Techniques For Predictive Modeling: Learning Objectives For Chapter 6
Techniques For Predictive Modeling: Learning Objectives For Chapter 6
Techniques For Predictive Modeling: Learning Objectives For Chapter 6
CHAPTER OVERVIEW
Predictive modeling is perhaps the most commonly practiced branch in data mining. It
allows decision makers to estimate what the future holds by means of learning from the
past. In this chapter, we study the internal structures, capabilities/limitations, and
applications of the most popular predictive modeling techniques, such as artificial neural
networks, support vector machines, and k-nearest neighbor. These techniques are capable
of addressing both classification- and regression-type prediction problems. Often, they
are applied to complex prediction problems where other techniques are not capable of
producing satisfactory results. In addition to these three (that are covered in this chapter),
other notable prediction modeling techniques include regression (linear or nonlinear),
logistic regression (for classification-type prediction problems), naive Bayes
(probabilistically oriented classification modeling), and different types of decision trees
(covered in Chapter 5).
CHAPTER OUTLINE
1
Copyright © 2014 Pearson Education, Inc.
6.2 BASIC CONCEPTS OF NEURAL NETWORKS
A. BIOLOGICAL AND ARTIFICIAL NEURAL NETWORKS
Technology Insights 6.1: The Relationship Between
Biological and Artificial Neural Networks
Application Case 6.1: Neural Networks Are Helping to
Save Lives in the Mining Industry
B. ELEMENTS OF ANN
1. Processing Elements
2. Network Structure
C. NETWORK INFORMATION PROCESSING
1. Input
2. Outputs
3. Connection Weights
4. Summation Function
5. Transformation (Transfer) Function
6. Hidden Layers
D. NEURAL NETWORK ARCHITECTURES
1. Kohonen’s Self-Organizing Feature Maps
2. Hopfield Networks
Application Case 6.2: Predictive Modeling Is Powering the
Power Generators
Section 6.2 Review Questions
2
Copyright © 2014 Pearson Education, Inc.
1. Numericizing the Data
2. Normalizing the Data
3. Select the Kernel Type and Kernel Parameters
4. Deploy the Model
A. SUPPORT VECTOR MACHINES VERSUS ARTIFICIAL NEURAL
NETWORKS
Section 6.6 Review Questions
Chapter Highlights
Key Terms
Questions for Discussion
Exercises
Teradata University Network (TUN) and Other Hands-On Exercises
Team Assignments and Role-Playing Projects
Internet Exercises
End of Chapter Application Case: Coors Improves Beer
Flavors with Neural Networks
Questions for the Case
References
3
Copyright © 2014 Pearson Education, Inc.
approach that emulates the human brain and that is the most frequently depicted in
popular media (for example, consider Data’s “brain” in the Star Trek series).
Where the book refers to artificial neurons “more or less resembling the structure
of” their biological counterparts in Section 6.2 (the first “content” section of the chapter),
students must understand that this does not mean a physical resemblance. An artificial
neuron is a silicon construct, usually embodied in a software model. The ANN equivalent
of sending a signal to another biological neuron is placing data in a shared location or
sending a software message. The structural resemblance is functional, not physical. This
may seem obvious to an instructor, but is often not equally obvious to students seeing the
concept for the first time.
Be sure to discuss some common and comparative features between the three
approaches. Bring up the distinction between “classification” and “regression” problems,
as well as what it means to “train” a system in a supervised learning environment. Stress
the importance in all of these models of picking the right parameters, and discuss
approaches (such as cross-validation) for this. Students should be able to associate
“hyperplanes” with SVM, “similarity measures” with nearest-neighbor, and
“backpropagation” with neural networks. Also, a distinction should be clearly made
between feedforward and recurrent neural networks. Even if students can’t implement the
algorithms, knowing the terminology will help them conceptualize the different
approaches. Also, point out that in many of the application cases, multiple approaches can
be used. In fact, the opening vignette explicitly espouses the value of using and
comparing several algorithms for the accuracy and efficiency.
Demand for healthcare services is increasing because of the aging population, but
the supply side is having problems keeping up with the level and quality of
service. Healthcare systems ought to significantly improve their operational
effectiveness (doing the right thing, such as diagnosing and treating accurately)
and efficiency (doing it the right way, such as using the least amount of resources
and time). Clinical decision support systems that use the outcome of data mining
studies can support healthcare managers and/or medical professionals in making
accurate and timely decisions to optimally allocate resources in order to increase
the quantity and quality of medical services.
2. What factors do you think are the most important in better understanding and
managing healthcare? Consider both managerial and clinical aspects of
healthcare.
4
Copyright © 2014 Pearson Education, Inc.
Healthcare systems ought to significantly improve their operational effectiveness
(doing the right thing, such as diagnosing and treating accurately) and efficiency
(doing it the right way, such as using the least amount of resources and time).
Effectiveness is probably more of a clinical concern, while efficiency is more of a
managerial concern.
Clinical decision support systems that use the outcome of data mining studies
(such as the ones presented in this case study) are shown to be useful and
reasonably accurate predictors, especially if used in combination. These are not
meant to replace healthcare managers and/or medical professionals. Rather, they
intend to support them in making accurate and timely decisions to optimally
allocate resources in order to increase the quantity and quality of medical
services. There still is a long way to go before we can see these decision aids
being used extensively in healthcare practices. Among others, there are
behavioral, ethical, and political reasons for this resistance to adoption. Maybe the
need and the government incentives for better healthcare systems will expedite
the adoption.
4. What were the outcomes of the study? Who can use these results? How can the
results be implemented?
The main outcome of this study was to show the efficacy of data mining in
predicting the outcome and in analyzing the prognostic factors of complex
medical procedures such as CABG surgery. The study showed that using a
number of prediction methods (as opposed to only one) in a competitive
experimental setting has the potential to produce better predictive as well as
explanatory results. SVM, ANN, and both C5 and CART decision trees were
used.
5. Search the Internet to locate two additional cases where predictive modeling is
used to understand and manage complex medical procedures.
1. What is an ANN?
5
Copyright © 2014 Pearson Education, Inc.
2. Explain the following terms: neuron, axon, and synapse.
A neuron is defined in the book as one of the cells in the human brain. More
generally, and as students may have learned in biology courses, it is an
electrically excitable cell in the nervous system of an animal.
A synapse is a junction between two neurons through which signals pass from one
neuron to the next, perhaps being altered en route.
Weights define the impact that a given input has on a neuron in the next layer. As
such, they embody what the network has learned so far. As a network learns, its
weights are adjusted.
The summation function determines the total input to a neuron by calculating the
weighted sum of its individual input values. Its output is input to the
transformation (transfer) function. The transformation (or transfer) function
determines the output of a neuron from its input (i.e., output of the summation
function).
5. What are the most common ANN architectures? How do they differ from each
other?
6
Copyright © 2014 Pearson Education, Inc.
With Hopfield networks, each neuron is connected to every other neuron within
the network.
These are shown in the flowchart of Figure 6.9. After testing (step 8), it is
possible to return to a previous step:
Step 1: Collect data
Step 2: Separate (data) into training and test sets
Step 3: Define a network structure
Step 4: Select a learning algorithm
Step 5: Set parameters and values
Step 6: Initialize weights and start training
Step 7: Stop training, freeze the network weights
Step 8: Test the trained network
Step 9: Implementation: use the network with new cases
1.
2. What are some of the design parameters for developing a neural network?
Some design parameters to consider include the type of network to employ, the
number of nodes (input, hidden, and output) and layers, the types of
transformation functions within each neuron, the original weight settings, and the
acceptable delta (error) level.
Many commercial ANN software products function like software shells. They
provide a set of standard architectures, learning algorithms, and parameters, along
with the ability to manipulate the data. Some development tools can support up to
several dozen network paradigms and learning algorithms. Most of the leading
7
Copyright © 2014 Pearson Education, Inc.
data mining tools (e.g., SAS Enterprise Miner, IBM SPSS Modeler, Statistica
Data Miner) include neural network learning algorithms. Some specialized neural
network products include California Scientific (BrainMaker), NeuralWare,
NeuroDimension Inc., Ward Systems Group (Neuroshell), and Megaputer. Others
are implemented as spreadsheet add-ins. In addition, there are class libraries and
APIs for languages such as Java and C++. Mathematical applications such as
MATLAB also include neural network algorithms.
After testing and training, the network is deployed for use on unknown new cases.
It might be used as a stand-alone system or as part of another software system
where new input data will be presented to it and its output will be a recommended
decision. At this point the recommendations provided by the neural network are
considered to be valid, because it has been extensively trained on training data
and tested with test data.
8
Copyright © 2014 Pearson Education, Inc.
Section 6.5 Review Questions
Support vector machines (SVM) are supervised learning methods that produce
input-output functions from a set of labeled training data. Both classification
functions and regression functions are possible in SVMs, and these can be either
linear or nonlinear functions. For example, given a classification-type prediction
problem, linear classifiers (hyperplanes) can separate the data into multiple
subsections, each representing one of the classes. If there are n classes to group
data into, then the hyperplane will have n-1 dimensions.
SVMs are popular because of their superior predictive power and their theoretical
foundation. SVMs have demonstrated highly competitive performance in
numerous real-world prediction problems. A significant advantage of SVMs is
that while ANNs may suffer from multiple local minima, the solutions to SVMs
are global and unique. Two more advantages of SVMs are that they have a simple
geometric interpretation and give a sparse solution. The reason that SVMs often
outperform ANNs in practice is that they successfully deal with the “over fitting”
problem, which is a big issue with ANNs. One disadvantage with SVM is the
selection of the kernel type and kernel function parameters. A second and perhaps
more important limitation of SVMs are the speed and size, both in the training
and testing cycles. Model building in SVMs involves complex and time-
demanding calculations. From the practical point of view, perhaps the most
serious problem with SVMs is the high algorithmic complexity and extensive
memory requirements of the required quadratic programming in large-scale tasks.
3. What is the meaning of “maximum margin hyperplanes”? Why are they important
in SVM?
Although many linear classifiers (hyperplanes) can separate the data into multiple
subsections, only one hyperplane achieves the maximum separation between the
classes. This is the hyperplane whose distance from the nearest data points is
maximized. The trick is to find the parallel hyperplanes that separate the classes
and whose margin of distance is at a maximum. The assumption is that the larger
the margin or distance between these parallel hyperplanes, the better the
generalization power of the classifier.
The kernel trick is a method for converting a linear classifier algorithm into a
nonlinear one by using a nonlinear function to map the original observations into
9
Copyright © 2014 Pearson Education, Inc.
a higher-dimensional space; this makes a linear classification in the new space
equivalent to nonlinear classification in the original space. This is what enables
the general hyperplane approach to SVMs (which are inherently linear) to solve
nonlinear classification problems. Common kernel types are polynomial, radial
basis function (RBF), Gaussian RBF, and sigmoid.
1. What are the main steps and decision points in developing a SVM model?
First, numericize the data. Each data instance must be represented as a vector of
numeric values, including the categorical variable (the classification). Second,
normalize (scale) the data. This prevents larger-magnitude attributes from
dominating the others during the learning process. Next, select the kernel type and
the kernel parameters. Finally, deploy the model.
2. How do you determine the optimal kernel type and kernel parameters?
You can do this experimentally, trying different ones out and comparing results.
Often the RBF is a good start. Deciding on the best parameters for a kernel
involves a parameter search method, such as cross-validation or grid search.
A significant advantage of SVMs is that while ANNs may suffer from multiple
local minima, the solutions to SVMs are global and unique. Two more advantages
of SVMs are that they have a simple geometric interpretation and give a sparse
solution. The reason that SVMs often outperform ANNs in practice is that they
successfully deal with the “over fitting” problem, which is a big issue with ANNs.
4. What are the common application areas for SVM? Conduct a search on the
Internet to identify popular application areas and specific SVM software tools
used in those applications.
10
Copyright © 2014 Pearson Education, Inc.
The k-nearest neighbor algorithm is among the simplest of all machine-learning
algorithms. It is easy to understand (and explain to others) what it does and how it
does it.
2. What are the advantages and disadvantages of kNN as compared to ANN and
SVM?
Compared to both ANN and SVM, the k-nearest neighbor algorithm is very
simple to learn and implement. But it is a lazy learner, often reaching local rather
than global minima/maxima. In addition, the accuracy of the kNN algorithm can
be significantly different with different values of k. Furthermore, the predictive
power of the kNN algorithm degrades with the presence of noisy, inaccurate, or
irrelevant features.
One critical factor is selection of the best similarity metric for determining what is
a “nearest” neighbor. A second is the selection of the correct parameter (i.e., the k
value). This can be done using cross-validation.
11
Copyright © 2014 Pearson Education, Inc.
1. How did neural networks help save lives in the mining industry?
The Council for Scientific and Industrial Research (CSIR) in South Africa
developed a device with an embedded neural network that assists any miner in
making an objective decision when determining the integrity of the hanging wall.
This helps prevent death and injury from rock falls, a common danger to miners.
2. What were the challenges, the proposed solution, and the obtained results?
In the mining industry, most of the underground injuries and fatalities are due to
rock falls (i.e., fall of hanging wall/roof). The method that has been used for many
years in the mines when determining the integrity of the hanging wall is to tap the
hanging wall with a sounding bar and listen to the sound emitted. An experienced
miner can differentiate an intact/solid hanging wall from a detached/loose hanging
wall by the sound that is emitted. But this method is subjective. The proposed
solution is to provide miners with a device that uses a trained neural network to
record and classify sounds to identify a hanging wall as either intact or detached.
The multilayer perceptron-type ANN architecture that was built achieved better
than 70 percent prediction accuracy on sample data. At this point, the system is in
its prototype-testing phase.
1. What are the key environmental concerns in the electric power industry?
Even though some energy-generation methods are favored over others, all forms
of electricity generation have positive and negative aspects. Some are
environmentally favored but are economically unjustifiable; others are
economically superior but environmentally prohibitive. In a market economy, the
options with fewer overall costs are generally chosen above all other sources. It is
not clear yet which form can best meet the necessary demand for electricity
without permanently damaging the environment. Current trends indicate that
increasing the shares of renewable energy and distributed generation from mixed
sources has the promise of reducing/balancing environmental and economic risks.
2. What are the main application areas for predictive modeling in the electric power
industry?
12
Copyright © 2014 Pearson Education, Inc.
Predictive modeling can be used to optimize operational parameters to produce
cleaner combustion and more stable flame temperatures. Another application is to
predict problems, such as failures or maintenance issues, before they happen.
Modeling can also be used to reduce NOx emissions.
3. How was predictive modeling used to address a variety of problems in the electric
power industry?
Application Case 6.3: Sensitivity Analysis Reveals Injury Severity Factors in Traffic
Accidents
1. How does sensitivity analysis shed light on the black box (i.e., neural networks)?
2. Why would someone choose to use a black-box tool like neural networks over
theoretically sound, mostly transparent statistical tools like logistic regression?
3. In this case, how did neural networks and sensitivity analysis help identify injury-
severity factors in traffic accidents?
ANN and sensitivity analysis helped estimate the significance of the crash factors
on the level of injury severity sustained by the driver. This study was a two-step
process. In the first step, the testers developed a series of prediction models (one
for each injury severity level) to capture the in-depth relationships between the
crash-related factors and a specific level of injury severity. In the second step,
they conducted sensitivity analysis on the trained neural network models to
identify the prioritized importance of crash-related factors as they relate to
different injury severity levels.
The study revealed that the variable seatbelt use was the most important
determinant for predicting higher levels of injury severity but it was one of the
least significant predictors for lower levels of injury severity. Other interesting
13
Copyright © 2014 Pearson Education, Inc.
findings involved gender (good predictor for low injury severity, but not for high)
and age (vice versa).
Student attrition has become one of the most challenging problems for decision
makers in academic institutions. In spite of all of the programs and services to
help retain students, according to the U.S. Department of Education, Center for
Educational Statistics (nces.ed.gov), only about half of those who enter higher
education actually graduate with a bachelor’s degree. High rates of student
attrition usually result in loss of financial resources, lower graduation rates, and
inferior perception of the school in the eyes of all stakeholders.
2. How can predictive analytics (ANN, SVM, and so forth) be used to better manage
student retention?
3. What are the main challenges and potential solutions to the use of analytics in
retention management?
In order to meet the challenges cited in the answer to #2, the main goals of
analytic studies in this area are to (1) develop models to correctly identify the
freshman students who are most likely to drop out after their freshman year, and
(2) identify the most important variables by applying sensitivity analyses on
developed models. The model building approach will involve the same steps you
usually perform in predictive analytics tasks, including data
collection/consolidation/preprocessing, cross-validation for parameter selection,
use of various algorithms (ANN, SVM, etc.), and sensitivity analysis.
Application Case 6.5: Efficient Image Recognition and Categorization with kNN
14
Copyright © 2014 Pearson Education, Inc.
Application areas of image recognition and categorization range from agriculture
to homeland security, personalized marketing to environmental protection. Image
recognition is an integral part of an artificial intelligence field called computer
vision. While the field of visual recognition and category recognition has been
progressing rapidly, much remains to be done to reach human-level performance.
Current approaches are capable of dealing with only a limited number of
categories (100 or so categories) and are computationally expensive.
kNN classifiers are natural in this setting, and have computational advantages
over SVMs. But they suffered from the problem of high variance (in bias-variance
decomposition) in the case of limited sampling. By combining kNN with SVM,
you can improve performance while maintaining computational advantage.
Another possible hybrid is combining kNN with Naïve Bayes algorithm.
1. Discuss the evolution of ANN. How have biological networks contributed to the
development of artificial networks? How are the two networks similar?
Research on Artificial Neural Network (ANN) started nearly half a century ago.
McCulloch and Pitts (1943) introduced a simple model of a binary artificial
neuron that captured some functions of biological neurons. Using information-
processing machines to model the brain, McCulloch and Pitts built their neural
network model using a large number of interconnected artificial binary neurons.
With this foundation, neural network research became quite popular in the late
1950’s and early 1960’s. Introduction of new network topologies, new activation
functions, and new algorithms, as well as progress in neuroscience and cognitive
science have influenced recent ANN research to a great extent. Advances in
theory and methodology have overcome many obstacles that hindered neural
network research a few decades ago. Evidenced by the appealing results of
numerous studies, neural networks are gaining in acceptance and popularity.
The functional aspects of biological networks have contributed to the elementary
development of ANN, although the intricacies of biological neural network
cannot be replicated artificially. ANN’s have far fewer neurons than biological
networks. (See 6.2 Basic Concepts of Neural Networks for more details.)
2. What are the major concepts related to network information processing in ANN?
Explain the summation and transformation functions and their combined effects
on ANN performance.
15
Copyright © 2014 Pearson Education, Inc.
The major concepts that are related to network information processing in ANN
are inputs, outputs, connection weights, summation function, transformation
(transfer) function, and hidden layers.
The summation and transformation functions use a neuron’s inputs to create its
output. First, the summation function aggregates the inputs into a single value
based on weighted summation. Second, the transformation function converts the
input value to an output value, generally between 0 and 1 (see 6.2 Basic Concepts
of Neural Networks – Network Information Processing for more details.)
3. Discuss the common ANN architectures. What are the main differences between
Kohonen’s self-organizing feature maps and Hopfield networks?
16
Copyright © 2014 Pearson Education, Inc.
2. Separate data into training, validation, and testing sets
3. Decide on a network architecture and structure
4. Select a learning algorithm
5. Set network parameters and initialize their values
6. Initialize weights and start training (and validation)
7. Stop training, freeze the network weights
8. Test the trained network
9. Deploy the network for use on unknown new cases.
This procedure is repeated for the entire set of input vectors until the desired.
Output and the actual output agree within some predetermined tolerance. Given
the calculation requirements for iteration, a large network can take a very long
time to train; therefore, in one variation, a set of cases is run forward and an
aggregated error is fed backward to speed up learning.
The score for “yes” is stronger than the score for “no.” The factors that led to the
scores are unknown, but the two scores can be expected to be independent of each
other and may be based on different combinations of inputs. Were a simple
threshold value of 0.5 used to determine the validity of an output value this
network would have affirmed creditworthiness, denied non-creditworthiness.
The ANN output suggests that the applicant is probably a good (though perhaps
not outstanding) credit risk. The relatively high score for non-creditworthiness
suggests a possible problem in the applicant’s background that should be looked
into further before credit is granted.
6. Stock markets can be unpredictable. Factors that may cause changes in stock
prices are only imperfectly known, and understanding them has not been
successful. Would ANN be a viable solution? Compare and contrast ANN with
other decision support technologies.
17
Copyright © 2014 Pearson Education, Inc.
The factors that lead to changes in stock prices are imperfectly known. An ANN
might be able to find them in a mass of data because it has no preconceived
notions about what they should be. However, an ANN could fail in this
application if it cannot identify the relevant factors, or if it identifies a set of
factors that would have predicted stock movements during one time period (when
the market was driven by one set of factors) but are not useful in predicting them
in another period (when it is driven by different factors). Attempting to use factors
derived from analysis of an earlier period to guide investments in a later one could
be a recipe for bankruptcy.
Other decision support technologies are driven by human guidance in some way
and rely ultimately on human decision makers to identify the relevant factors.
To determine the link between the chemical composition of a beer, which can be
measured, and its flavor, which cannot be measured but which consumers can
detect and care about.
3. Why were the results of the Coors neural network initially poor, and what was
done to improve the results?
The results were initially poor for two reasons. First, the network was only trained
with one type of beer, so the variation in its inputs was low. Second, it was only
trained for one flavor factor, so its performance was impacted by the inputs that
had no impact on that factor but whose variations within the training sample
created distracting “noise.”
To improve the results, Coors trained the neural network using a wider variety of
products and more combinations of inputs.
If this project is successful, Coors might ultimately be able to control the flavors
of its beers via their chemical composition, which could be monitored
automatically and presumably controlled during the brewing process. Being able
to do this depends, of course, on more than knowing the relationships between the
chemicals in beer and its flavor.
18
Copyright © 2014 Pearson Education, Inc.
5. What modifications would you provide to improve the results of beer flavor
prediction?
Based on the next-to-last paragraph of the case, I would refine the sensitivity of
the instrumentation, measure a larger number of flavor-active compounds, and
measure the factors that contribute to mouth-feel and the beer’s physical
characteristics. Whether this is practical, if practical whether it is cost-effective,
and whether it would result in a more effective beer production process than
Coors can achieve with experienced, professional brew masters, is (as far as we
can tell from this case) still an open question.
19
Copyright © 2014 Pearson Education, Inc.