Data Warehousing & Mining: Unit - Iv
Data Warehousing & Mining: Unit - Iv
Data Warehousing & Mining: Unit - Iv
UNIT – IV
9
Prof. S.K. Pandey, I.T.S, Ghaziabad
Contd…
• We think of data mining as the process of identifying valid, novel, potentially
useful, and ultimately comprehensible understandable patterns or models in data to
make crucial business decisions.
• "Valid" means that the patterns hold in general, "novel" that we did not know
the pattern beforehand, and "understandable" means that we can interpret and
comprehend the patterns. Hence, like statistics, data mining is not only modeling
and prediction, nor a product that can be bought, but a whole problem solving
cycle/process that must be mastered through team effort.
• Defining the right business problem is the trickiest part of successful data
mining because it is exclusively a communication problem. The technical people
analyzing data need to understand what the business really needs. Even the most
advanced algorithms cannot figure out what is most important.
• Data preprocessing or data cleaning or data preparation is also a key part of
data mining. Quality decisions and quality mining results come from quality data.
Data are always dirty and are not ready for data mining in the real world. For
example, data need to be integrated from different sources; data contain missing
values. i.e. incomplete data; data are noisy, i.e. contain outliers or errors, and
inconsistent values (i.e. contain discrepancies in codes or names); data are not at
the right level of aggregation. 10
Prof. S.K. Pandey, I.T.S, Ghaziabad
Evolution of Data Mining
– Decision Trees
– Neural Networks
– Nearest Neighbor & Clustering
– Genetic Algorithms
Decision-tree learners create over-complex trees that do not generalize the data
well. This is called over-fitting. Mechanisms such as pruning are necessary to
avoid this problem.
There are concepts that are hard to learn because decision trees do not express
them easily, such as XOR, parity or multiplexer problems. In such cases, the
decision tree becomes prohibitively large. Approaches to solve the problem
involve either changing the representation of the problem domain or using
learning algorithms based on more expressive representations (such as
statistical relational learning or inductive logic programming).
2.The Neural network is packaged up with expert consulting services. Here the
neural network is deployed by trusted experts who have a track record of
success. The expert either are able to explain the models or trust that the models
do work.
The first technique has seemed to work quite well because when the technique is
used for a well defined problem, many of the difficulties in preprocessing the data can be
automated and interpretation of the model is less of an issue since entire industries begin to use
technology successfully and a level of trust is created. Examples of such applications are
Falcon System by HNC for Credit Card Fraud Detection and ModelMax package for
Direct Marketing by Advanced S/w Applications.
19
Prof. S.K. Pandey, I.T.S, Ghaziabad
Nearest Neighbor & Clustering
Nearest Neighbor Prediction techniques are among the oldest
techniques used in Data Mining.
Nearest neighbor is a Prediction technique that is quite similar to
clustering; its essence is that in order to determine what a
prediction value is in one record, the user should look for records
with similar predictor values in the historical databases and use the
prediction value from the record that is “nearest” to the unknown
record.
Example of income of nearest neighbor in your area of residence. If you had to
predict someone’s income based only on knowledge of their neighbor’s income,
your best chance of being right would be to predict the incomes of the neighbors
who live closest to the unknown person.
Nearest Neighbor Prediction algorithm work in very much the
same way except that “nearness” in a database may consist of a
variety of factors other than just where the person lives.
Prof. S.K. Pandey, I.T.S, Ghaziabad 20
Nearest Neighbor for Prediction
1 It is used for Prediction as well as Data It is used mostly for consolidating Data
Consolidation. into a high-level view and general
grouping of records into like behaviors.
2 Space is defined by the problem to be Space is defined as default n-dimensional
solved (Supervised Learning) using space, or is defined by the user, or is a
Euclidean Distance. Space is allocated predefined space driven by past
by assigning One dimension to each experience (Un-supervised Learning).
Predictor.
3. Generally only uses distance metrics to Can use other metrics besides distance to
determine nearness determine nearness of two records- for
example linking points together.
At first it might seem a very simple problem to solve – simply mail out as many
coupon as possible, thus optimizing the possibility of a consumer both receiving and
actually using a coupon. The problem is made a little bit more complicated,
however, because several factors affect whether a coupon packet mailer makes a
profit:
- The more coupons there are, the more the mailer weighs and the higher the mailer
cost (thus decreasing profit).
-Any coupon that does not appear in the mailer is not used by the consumer,
resulting in lost revenues
- If there are too many coupons in the mailer, the consumer will be overloaded and
not choose to use any of the coupons.
They can also define how the genetic material of the simulated
chromosome is converted into a Computer Program that can
solve a real world problem
• Allele – the value of Gene (e.g. Blue for the locus for eye color)
• Selection – the process by which the simulated organisms that are the best
at solving the particular problem are retained and the less successful are
weeded out by deleting them from computer memory.
• Prediction – They have been used as meta-level operators that are used to
help optimize the other data mining algorithms.