Data Warehousing & Mining: Unit - Iv

Data Warehousing & Mining
UNIT – IV
Prof. S. K. Pandey, I.T.S, Ghaziabad

Syllabus of Unit - IV
 Knowledge Discovery
 Data Mining - Introduction to Data-Mining
 Techniques of Data-Mining:
– Decision Trees
– Neural Networks
– Nearest Neighbor & Clustering
– Genetic Algorithms
 Rule Introduction : Selecting & Using the Right

Technique.
Prof. S.K. Pandey, I.T.S, Ghaziabad 2
Knowledge Discovery
 Knowledge discovery in databases (KDD) is the field that is
evolving to provide automated analysis solutions.
 Knowledge discovery is defined as ``the process of extraction of
implicit, unknown, and potentially useful information from data'‘. In
fact under conventions, the knowledge discovery process takes the
raw results from data mining (the process of extracting trends or
patterns from data) and carefully and accurately transforms them into
useful and understandable information. This information is not
typically retrievable by standard techniques but is uncovered through
the use of AI techniques.
 KDD is a growing field: There are many knowledge discovery
methodologies in use and under development. Some of these
techniques are generic, while others are domain-specific.

Knowledge Discovery Techniques
• Learning algorithms are an integral part of KDD. Learning
techniques may be supervised or unsupervised. In general,
supervised learning techniques enjoy a better success rate as defined in
terms of usefulness of discovered knowledge.
•There are many different approaches that are classified as KDD

techniques.
• Probabilistic Approach
• Statistical Approach
• Classification Approach (e.g. Bayesian classification, Decision tree analysis.
• Deviation and Trend Analysis
• Genetic algorithms and Neural Networks, and
• Multi-paradigmatic approach (Hybrid approach)

 Probabilistic Approach
This family of KDD techniques utilizes graphical representation models to
compare different knowledge representations. These models are based on
probabilities and data independencies. They are useful for applications
involving uncertainty and applications structured such that a probability
may be assigned to each `òutcome'' or bit of discovered knowledge.
Probabilistic techniques may be used in diagnostic systems and in planning and
control systems. Automated probabilistic tools are available both commercially
and in the public domain.
 Statistical Approach
The statistical approach uses rule discovery and is based on data relationships.
An `ìnductive learning algorithm can automatically select useful join paths and
attributes to construct rules from a database with many relations'‘. This type of
induction is used to generalize patterns in the data and to construct rules from the
noted patterns. Online analytical processing (OLAP) is an example of a
statistically-oriented approach..

Classification Approach
Classification is probably the oldest and most widely-used of all the KDD approaches. This approach
groups data according to similarities or classes. There are many types of classification techniques and
numerous automated tools available.
1. Bayesian Approach to KDD `ìs a graphical model that uses directed arcs exclusively to form
directed acyclic graph'‘ Although the Bayesian approach uses probabilities and a graphical means of
representation, it is also considered a type of classification.
Bayesian networks are typically used when the uncertainty associated with an outcome can be expressed
in terms of a probability. This approach relies on encoded domain knowledge and has been used for
diagnostic systems. Other pattern recognition applications, including the Hidden Markov Model, can be
modeled using a Bayesian approach.
2. Pattern Discovery and Data Cleaning is another type of classification that systematically reduces
a large database to a few pertinent and informative records. If redundant and uninteresting data is
eliminated, the task of discovering patterns in the data is simplified. This approach works on the premise of
the old adage, ``less is more''. The pattern discovery and data cleaning techniques are useful for reducing
enormous volumes of application data, such as those encountered when analyzing automated sensor
recordings. Once the sensor readings are reduced to a manageable size using a data cleaning technique, the
patterns in the data may be more easily recognized.
3. The Decision Tree Approach uses production rules, builds a directed acyclic graph based on data
premises, and classifies data according to its attributes. This method requires that data classes are discrete
and predefined. The primary use of this approach is for predictive models that may be appropriate for
either classification or regression techniques.

 Deviation and Trend Analysis - Pattern detection by filtering important
trends is the basis for this KDD approach. Deviation and trend analysis techniques
are normally applied to temporal databases. A good application for this type of
KDD is the analysis of traffic on large telecommunications networks.
 Neural Networks may be used as a method of knowledge discovery. Neural
networks are particularly useful for pattern recognition, and are sometimes
grouped with the classification approaches. There are tools available in the public
domain and commercially. Genetic algorithms, also used for classification, are
similar to neural networks although they are typically considered more powerful.
There are tools for the genetic approach available commercially.
 Hybrid Approach – This approach combines more than one approaches and
is also called a multi-paradigmatic approach. Although implementation may be
more difficult, hybrid tools are able to combine the strengths of various
approaches. Some of the commonly used methods combine visualization
techniques, induction, neural networks, and rule-based systems to achieve the
desired knowledge discovery. Deductive databases and genetic algorithms have
also been used in hybrid approaches.
Data Mining - Introduction to Data Mining
Data mining has been defined in almost as many ways as there are authors who
have written about it. Because it sits at the interface between statistics, computer
science, artificial intelligence, machine learning, database management and data
visualization (to name some of the fields), the definition changes with the
perspective of the user:
"Data mining is the process of exploration and analysis, by automatic or
semiautomatic means, of large quantities of data in order to discover meaningful
patterns and rules." (M. J. A. Berry and G. S. Linoff)

"Data mining is finding interesting structure (patterns, statistical models,
relationships) in databases." (U. Fayyad, S. Chaudhuri and P. Bradley)

"Data mining is the application of statistics in the form of exploratory data
analysis and predictive models to reveal patterns and trends in very large data
sets." ("Insightful Miner 3.0 User Guide")

Data Mining Contd….
The non-trivial extraction of novel, implicit, and actionable knowledge from
large datasets.
• Extremely large datasets
• Discovery of the non-obvious
• Useful knowledge that can improve processes
• Can not be done manually
Technology to enable data exploration, data analysis, and data visualization of
very large databases at a high level of abstraction, without a specific hypothesis
in mind. Sophisticated data search capability that uses statistical algorithms to
discover patterns and correlations in data.
9
Prof. S.K. Pandey, I.T.S, Ghaziabad
Contd…
• We think of data mining as the process of identifying valid, novel, potentially
useful, and ultimately comprehensible understandable patterns or models in data to
make crucial business decisions.
• "Valid" means that the patterns hold in general, "novel" that we did not know
the pattern beforehand, and "understandable" means that we can interpret and
comprehend the patterns. Hence, like statistics, data mining is not only modeling
and prediction, nor a product that can be bought, but a whole problem solving
cycle/process that must be mastered through team effort.
• Defining the right business problem is the trickiest part of successful data
mining because it is exclusively a communication problem. The technical people
analyzing data need to understand what the business really needs. Even the most
advanced algorithms cannot figure out what is most important.
• Data preprocessing or data cleaning or data preparation is also a key part of
data mining. Quality decisions and quality mining results come from quality data.
Data are always dirty and are not ready for data mining in the real world. For
example, data need to be integrated from different sources; data contain missing
values. i.e. incomplete data; data are noisy, i.e. contain outliers or errors, and
inconsistent values (i.e. contain discrepancies in codes or names); data are not at
the right level of aggregation. 10
Evolution of Data Mining

Techniques of Data-Mining
– Decision Trees
– Neural Networks
– Nearest Neighbor & Clustering
– Genetic Algorithms

Decision Trees
 Decision trees are a simple yet successful technique for supervised
classification learning.
 A decision tree (or tree diagram) is a decision support tool that uses a tree-
like graph or model of decisions and their possible consequences, including
chance event outcomes, resource costs, and utility. Decision trees are
commonly used in operations research, specifically in decision analysis, to help
identify a strategy most likely to reach a goal. Another use of decision trees is
as a descriptive means for calculating conditional probabilities.
 A decision Tree consists of 3 types of nodes:-
1. Decision nodes - commonly represented
by squares
2. Chance nodes - represented by circles
3. End nodes - represented by triangles

Applications of Decision Trees
 Decision trees are a form of Data Mining Technology. These have been used
for problems in varying domains ranging from Credit Card attrition prediction
to time series prediction of the exchange rates. Some of the major applications
can be summarized as:
 Exploration
 Data Pre-processing
 Prediction
 Applications Score Card
 Clusters
 Links (between predictors)
 Outliers
 Rules
 Sequences (in time series prediction)
 Text Classification and Information Retrieval

Advantages of Decision Trees
Amongst other data mining methods, decision trees have various advantages:
1.Simple to understand and interpret. People are able to understand decision tree models
after a brief explanation.
2.Requires little data preparation. Other techniques often require data normalization,
dummy variables need to be created and blank values to be removed.
3.Able to handle both numerical and categorical data. Other techniques are usually
specialized in analyzing datasets that have only one type of variable. Ex: relation rules can be
used only with nominal variables while neural networks can be used only with numerical
variables.
4.Uses a white box model. If a given situation is observable in a model the explanation for
the condition is easily explained by Boolean logic. An example of a black box model is an
artificial neural network since the explanation for the results is difficult to understand.
5.Possible to validate a model using statistical tests. That makes it possible to account for
the reliability of the model.
6.Robust. Performs well even if its assumptions are somewhat violated by the true model
from which the data were generated.
7. Perform well with large data in a short time. Large amounts of data can be analysed
using personal computers in a time short enough to enable stakeholders to take decisions
based on its analysis.

Limitations
 The problem of learning an optimal decision tree is known to be NP-complete.
Consequently, practical decision-tree learning algorithms are based on
heuristic algorithms such as the greedy algorithm where locally optimal
decisions are made at each node. Such algorithms cannot guarantee to return
the globally optimal decision tree.
 Decision-tree learners create over-complex trees that do not generalize the data
well. This is called over-fitting. Mechanisms such as pruning are necessary to
avoid this problem.
 There are concepts that are hard to learn because decision trees do not express
them easily, such as XOR, parity or multiplexer problems. In such cases, the
decision tree becomes prohibitively large. Approaches to solve the problem
involve either changing the representation of the problem domain or using
learning algorithms based on more expressive representations (such as
statistical relational learning or inductive logic programming).

Neural Networks
 True Neural Networks are biological systems (also known as
Brain) that detects pattern, make predictions and learn.
 The Artificial ones are computer programs implementing
sophisticated pattern detection and machine learning algorithms
on a computer to build predictive models from large historical
databases.
 Neural Networks are very powerful predictive modeling
techniques but some of the power comes at the expanse of ease
of use and ease of deployment.
 Neural Networks create very complex model that are almost
always impossible to fully understand, even by the experts.

 Neural networks, with their remarkable ability to derive meaning from
complicated or imprecise data, can be used to extract patterns and detect trends
that are too complex to be noticed by either humans or other computer
techniques.
 A trained neural network can be thought of as an "expert" in the category of
information it has been given to analyze. This expert can then be used to provide
projections given new situations of interest and answer "what if“ questions.
 Other advantages include:
– Adaptive learning: An ability to learn how to do tasks based on the data given for
training or initial experience.
– Self-Organization: An ANN can create its own organization or representation of the
information it receives during learning time.
– Real Time Operation: ANN computations may be carried out in parallel, and special
hardware devices are being designed and manufactured which take advantage of this
capability.
– Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However, some
network capabilities may be retained even with major network damage.
Contd…….
The shortcomings in the understanding of the Neural Network
model have been successfully addressed in two ways:
1.The Neural Network is packaged up into a complete solution such as Fraud
detection. This allows Neural Networks to be carefully crafted for one particular
application, and once it has been proved successful, it can be used over and over
again without requiring a deep understanding of how it works.
2.The Neural network is packaged up with expert consulting services. Here the
neural network is deployed by trusted experts who have a track record of
success. The expert either are able to explain the models or trust that the models
do work.
The first technique has seemed to work quite well because when the technique is
used for a well defined problem, many of the difficulties in preprocessing the data can be
automated and interpretation of the model is less of an issue since entire industries begin to use
technology successfully and a level of trust is created. Examples of such applications are
Falcon System by HNC for Credit Card Fraud Detection and ModelMax package for
Direct Marketing by Advanced S/w Applications.
19
Nearest Neighbor & Clustering
 Nearest Neighbor Prediction techniques are among the oldest
techniques used in Data Mining.
 Nearest neighbor is a Prediction technique that is quite similar to
clustering; its essence is that in order to determine what a
prediction value is in one record, the user should look for records
with similar predictor values in the historical databases and use the
prediction value from the record that is “nearest” to the unknown
record.
 Example of income of nearest neighbor in your area of residence. If you had to
predict someone’s income based only on knowledge of their neighbor’s income,
your best chance of being right would be to predict the incomes of the neighbors
who live closest to the unknown person.
 Nearest Neighbor Prediction algorithm work in very much the
same way except that “nearness” in a database may consist of a
variety of factors other than just where the person lives.
Nearest Neighbor for Prediction
 The Nearest-Neighbor Prediction Algorithm, states that:

“Objects that are “near” to each other will have similar
prediction values as well. Thus, if we know the prediction
value of one of the objects, we can predict it for its nearest
neighbors”.
 One of the Classic place where nearest neighbor is used for
prediction has been in Text Retrieval.

Business Score Card
 The Business Score Card is used to assess the business value of
Data Mining techniques.
 The real world problems of the business community are thus taken
into consideration in evaluating the technique rather than some
more academic measures Speed and Performance.
 Three majors are most critical factors for building a usable data
mining system into the business process:
 Automation
 Clarity
 Return on Investment
 Above measures reflect what ends up being some of the most critical aspects of
whether a Data Mining System is successfully deployed, as just an academic
exercise, or becomes a case study in how not to implement such a system.

Where to use Nearest Neighbor and
Clustering Predictions
 This method is used in wide variety of applications, ranging from
personal bankruptcy prediction to computer recognition of a
Person’s handwriting.
 These methods are also used every day by people who may not
even realize that they are doing kind of clustering.
 Clustering is sometime used to mean Segmentation – which most
marketing people will tell, is useful in coming up with a birds-eye
view of business.
 Two commercial applications of Clustering Systems are PRIZM
System from Claritas Corp. and MicroVision from Equifax
Corporation.
 Application of Clustering in Outlier Analysis.
Difference between Nearest Neighbor
and Clustering Predictions
 The main distinction between Clustering and the Nearest-neighbor technique
are summarized in following table:
S. No. Nearest Neighbor Clustering
1 It is used for Prediction as well as Data It is used mostly for consolidating Data
Consolidation. into a high-level view and general
grouping of records into like behaviors.
2 Space is defined by the problem to be Space is defined as default n-dimensional
solved (Supervised Learning) using space, or is defined by the user, or is a
Euclidean Distance. Space is allocated predefined space driven by past
by assigning One dimension to each experience (Un-supervised Learning).
Predictor.
3. Generally only uses distance metrics to Can use other metrics besides distance to
determine nearness determine nearness of two records- for
example linking points together.

Genetic Algorithms
• Genetic Algorithms were invented to mimic some of the
processes observed in natural evolution.
• Many people, biologists included, are astonished that life at the

level of complexity that we observe could have evolved in the
relatively short time suggested by the fossil record.
• The idea with GA is to use this power of evolution to solve

optimization problems.
•The father of the original Genetic Algorithm was John Holland

who invented it in the early 1970's.

Example
A simple example of Genetic Algorithms first proposed by Alex Singer, would be a
two gene chromosome that encoded the solution to a simple direct marketing
problem “What is the Optimal Number of coupons that should be put into a
coupon mailer in order to Optimize Profit?”
At first it might seem a very simple problem to solve – simply mail out as many
coupon as possible, thus optimizing the possibility of a consumer both receiving and
actually using a coupon. The problem is made a little bit more complicated,
however, because several factors affect whether a coupon packet mailer makes a
profit:
- The more coupons there are, the more the mailer weighs and the higher the mailer
cost (thus decreasing profit).
-Any coupon that does not appear in the mailer is not used by the consumer,
resulting in lost revenues
- If there are too many coupons in the mailer, the consumer will be overloaded and
not choose to use any of the coupons.

What are Genetic Algorithms
Genetic Algorithms loosely refer to these simulated evolutionary
systems, but more precisely they are the algorithms that dictate
how populations of organisms should be formed, evaluated and
modified.
They can also define how the genetic material of the simulated
chromosome is converted into a Computer Program that can
solve a real world problem
The problems which can be solved using Genetic Algorithms

vary from Optimizing a variety of Data Mining techniques such
as Neural Networks and Nearest Neighbor to the Optimization of
negotiating for Oil rights.

How do they relate to Evolution
• In many ways Genetic Algorithms stay true to the processes available in
biological evolution. Some of the analogs in Genetic Algorithms that appear in
natural evolution include:
• Organism – represents the Computer Program being optimized.
• Population – the Collection of organisms undergoing simulated evolution
• Chromosome – in biology the chromosome contain the genetic makeup of
the organisms and fully define how the organism will develop from its
genotype (genetic definition) with environmental influences to its phenotype
(outward behavior and appearance)
• Fitness – the calculation with which an organisms value can be
determined for selection and survival of the fittest.
• Gene – the basic building block of the chromosome which defines one
particular feature of the simulated organism.
• Locus – the position on the chromosome that contains a particular gene
(e.g. the location that determines eye color).
Contd……
•Locus – the position on the chromosome that contains a particular gene
(e.g. the location that determines eye color).
• Allele – the value of Gene (e.g. Blue for the locus for eye color)
• Mutation – a random change of the value of a Gene (Allele)
• Mating - the process by which two simulated organisms swap pieces of

computer program in a simulated crossover.
• Selection – the process by which the simulated organisms that are the best
at solving the particular problem are retained and the less successful are
weeded out by deleting them from computer memory.

How can they be used in Businesses
• Although they would have to be classified generally as an emerging
science, genetics algorithms have a wide variety of uses in business.
There are three main areas to which they can be applied:
• Optimization – Given a business problem with certain variables and a
well defined definition of profit, a strategic algorithm can be used to
automatically determine the optimal values for the variables that optimize the
profit.
• Prediction – They have been used as meta-level operators that are used to
help optimize the other data mining algorithms.
• Simulation – Sometimes a specific business problem is not well defined in

terms of what the profit is or whether some solution is better than the other.
The business person instead just has a large number of entities (usually
customers or competitors) that they would like to simulate via simple
interaction rules over time.
How the Genetic Algorithm Works
• The following steps generally occur in a computer system using genetic algorithms:
1. Define the problem to be solved, providing a way to encode the problem in a
chromosome and a way to measure the goodness of the solution encoded in
the Chromosome.
2. Initialize a Population of Chromosome with random values.
3. Evaluate the fitness of each organism in population using the previously
defined fitness function.
4. Allow the multiple copies of the genetic material of the best chromosome to
be made, and delete the organism that are less fit.
5. Allow the new population of organism to undergo mutation and sexual
reproduction.
6. Evaluate the fitness of each organism in the population using the previously
defined fitness function.
7. Stop if any of the following criteria are met:
•A Solution has been found that is good enough.
•The system has run for the prespecified number of generations.
•The system has stopped making progress towards improvements. S
8. If no stopping crieteria is met, then return to Step - 4
Rule Introduction : Selecting & Using
the Right Technique.
 Rule Introduction is one of the major forms
of Data Mining and is perhaps the most
common form of Knowledge Discovery in
un-supervised learning systems.

Data Warehousing & Mining: Unit - Iv

Uploaded by

Copyright:

Available Formats

Data Warehousing & Mining: Unit - Iv

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing & Mining: Unit - Iv

Uploaded by

Copyright:

Available Formats

Data Warehousing & Mining

Prof. S. K. Pandey, I.T.S, Ghaziabad

 Rule Introduction : Selecting & Using the Right

Prof. S.K. Pandey, I.T.S, Ghaziabad 3

•There are many different approaches that are classified as KDD

Prof. S.K. Pandey, I.T.S, Ghaziabad 4

Prof. S.K. Pandey, I.T.S, Ghaziabad 5

Prof. S.K. Pandey, I.T.S, Ghaziabad 6

Prof. S.K. Pandey, I.T.S, Ghaziabad 8

Prof. S.K. Pandey, I.T.S, Ghaziabad 11

Prof. S.K. Pandey, I.T.S, Ghaziabad 12

Prof. S.K. Pandey, I.T.S, Ghaziabad 13

Prof. S.K. Pandey, I.T.S, Ghaziabad 14

Prof. S.K. Pandey, I.T.S, Ghaziabad 15

Prof. S.K. Pandey, I.T.S, Ghaziabad 16

Prof. S.K. Pandey, I.T.S, Ghaziabad 17

 The Nearest-Neighbor Prediction Algorithm, states that:

Prof. S.K. Pandey, I.T.S, Ghaziabad 21

Prof. S.K. Pandey, I.T.S, Ghaziabad 22

S. No. Nearest Neighbor Clustering

Prof. S.K. Pandey, I.T.S, Ghaziabad 24

• Many people, biologists included, are astonished that life at the

• The idea with GA is to use this power of evolution to solve

•The father of the original Genetic Algorithm was John Holland

Prof. S.K. Pandey, I.T.S, Ghaziabad 25

Prof. S.K. Pandey, I.T.S, Ghaziabad 26

The problems which can be solved using Genetic Algorithms

Prof. S.K. Pandey, I.T.S, Ghaziabad 27

• Mutation – a random change of the value of a Gene (Allele)

• Mating - the process by which two simulated organisms swap pieces of

Prof. S.K. Pandey, I.T.S, Ghaziabad 29

• Simulation – Sometimes a specific business problem is not well defined in

Prof. S.K. Pandey, I.T.S, Ghaziabad 32

You might also like