Data Mining
Data Mining
Data Mining
Agenda
Data Mining Definition
Data Mining Comparisons
Data Evolution
KDD Process
Data Mining Process
Data Mining Techniques
Applications
CASE STUDY-DM application in GIS
Pattern Recognition
Definition
DATA MINING is defined as “the nontrivial extraction of implicit,
previously unknown, and potentially useful information from data“
and "the science of extracting useful information from large data
sets or databases" .
Fraud detection
Which types of transactions are likely to be fraudulent, given the demographics and
transactional history of a particular customer?
Customer relationship management:
Which of my customers are likely to be the most loyal, and which are most likely to
leave for a competitor? :
Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"
Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses
Problem formulation
Data collection
subset data: sampling might hurt if highly skewed data
feature selection: principal component analysis
Pre-processing: cleaning
name/address cleaning, different meanings (annual, yearly), duplicate
removal, supplying missing values
Transformation:
map complex objects e.g. time series data to features e.g. frequency
Choosing mining task and mining method
Result evaluation and Visualization
Process Standardization
CRISP-DM:
CRoss Industry Standard Process for Data Mining
Initiative launched Sept.1996
Business Understanding
understanding project objectives and data
mining problem identification
Data Understanding
capture and understand data for quality
issues
Data Preparation
data cleaning, merging data and deriving
attributes
Phases
Modeling
select the data mining technique and build
the model
Evaluation
evaluate results and approved model
Deployment
put the models into practice, monitoring and
maintenance
DM Techniques
Based on two models
Verification Model
-user makes hypothesis
-tests hypothesis to verify its validity
Discovery Model
-automatically discovering important information hidden in
the data
-data is sifted in search of frequently occurring patterns,
trends and generalizations
-no guidance from user
DM Techniques
Collaborative filtering:
Is company software?
· Pros
· Cons
+ Reasonable training
Cannot handle complicated
time
relationship between features
+ Fast application
simple decision boundaries
+ Easy to interpret
problems with lots of missing
+ Easy to implement
data
+ Can handle large
number of features
Neural networks
Involves developing mathematical structures with the ability to
learn
Best at identifying patterns or trends in data
Well suited for prediction or forecasting needs
Useful for learning complex data like handwriting, speech and
image recognition
· Pros · Cons
+ Can learn more complicated Slow training time
class boundaries Hard to interpret
+ Fast application Hard to implement: trial
+ Can handle large number of and error for choosing
features number of nodes
Data Mining Applications
Limited Information
Noise or missing Data
User interaction
Uncertainty
Size,updates and irrelevant fields
Other Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications, financial transactions
from an online stream of event identify fraudulent events
Manufacturing and production:
automatically adjust knobs when process parameter changes
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between
diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
Introduction
In USA, traffic crashes cause one death
every 13 minutes and an injury every 15
seconds
If current US crash rate remains
unchanged
* One child out of every 84 born today will die
violently in a motor vehicle crash
1. Data Capture
2. Data Merging
3. Data Description
4. Statistical Analysis
5. Data Pre-Processing
Data Transformation
Data Cleaning
6. Building Data-Mining Models
Data Capture
Data Merging
Accidents
Data Cleaning
Irregularities were tracked, listed and corrected or removed from the
dataset
Building Data-Mining Models
This process is iterative
Several input variables and model parameters are explored
The 45 variables were narrowed down to the most important ones
Resulting in a set of variables that was used in building the data-mining
models
Clustering- Input Variables:
Area type, accident type, main contributing cause, site location,
highway name, driver Age, accident Lane, traffic control, and time of
the accident
Profiling -Input Variables:
Other discrete and categorical variables and some interesting
continuous variables were input as supplementary variables
Variables used to profile the clusters but not to define them
Building Data-Mining Models
Model 1
The freeway accidents dataset involving one or more injuries
Modal values for each cluster variable given below.
Cluster 4 6 5 9 3 8 1 2 7
Percentage 36.15% 11.60% 11.11% 8.54% 7.95% 7.95% 5.90% 5.76% 5.03%
Local Name Palmetto Palmetto Palmetto I-95 I-95 I-95 I-95 I-95 Palmetto
Expressw Expressw Expressw Expressw
ay ay ay ay
Type of Rear-End Hit Conc. Rear-End Sideswipe Rear-End Angle Rear-End Angle Angle
Accident Barrier
Wall
Accident Time 4-7 PM 9PM-7AM 4-7 PM 1-4PM 4-7 PM 9PM-7AM 1-4PM 1-4PM 9PM-7AM
Site Location Not at Not at Exit Ramp Not at Not at Not at Bridge Not at At
Interchange Interchange Interchange Interchange Interchange Interchange Intersection
Accident Lane 1 Median Ramp 3 2 Side of Ramp Side of Ramp
Road Road
Contributing Careless Careless Careless Improper No Careless Careless Careless No
Cause Driving Act Driving Act Driving Act Lane Improper Driving Act Driving Act Driving Act Improper
Change Driving Act Driving Act
Driver 1 Age [25 –35] [25 –35] [15 –25] [35 –45] [25 –35] [25 –35] [25 –35] [25 –35] [25 –35]
Type of Control No Control No Control No Control No Control No Control No Control No Control No Control No Control
Model 2
The freeway accidents dataset that involved one or more
fatalities
Model 2
The freeway accidents dataset that involved one or more fatalities
APPLICATIONS:
Computer Vision
Speech Recognition
Automated Target Recognition
Optical Character Recognition
Seismic Analysis
Man and Machine Diagnostics
Fingerprint Identification
Age Preprocessing / Segmentation
Industrial Inspection
Financial Forecast
Medical Diagnosis
ECG Signal Analysis
Emerging PR Applications
Problem Input Output
Speech recognition Speech waveforms Spoken words, speaker
identity
Non-destructive testing Ultrasound, eddy current, Presence/absence of flaw,
acoustic emission waveforms type of flaw
Detection and diagnosis of EKG, EEG waveforms Types of cardiac
disease conditions, classes of brain
conditions
Natural resource Multispectral images Terrain forms, vegetation
identification cover
Aerial reconnaissance Visual, infrared, radar images Tanks, airfields
Character recognition (page Optical scanned image Alphanumeric characters
readers, zip code, license
plate)
Emerging PR Applications (cont’d)
Category “A”
Category “B”
Template
Input scene
• Patterns represented in a feature space
• Statistical model for pattern generation in feature space
• Basically adopted for numeric data.
Structural Pattern Recognition
Describe complicated objects in terms of simple primitives and
structural relationships.
Decision-making when features are non-numeric or structural
Scene
N
L
X Object Background
T
M Y Z
D E M N
D E
L T X Y Z
Artificial Neural Networks
Massive parallelism is essential for complex pattern recognition
tasks (e.g., speech and image recognition)
Human take only a few hundred ms for most cognitive tasks;
suggests parallel computation
A typical Pattern Recognition System:
Sensing:
INPUT
The sensor converts images or
sounds or other physical inputs
SENSING into signal data.
The input to the PR system is a
SEGMENTATION transducer such as a camera or a
microphone. Therefore the
difficulty of the problem depends
FEATURE EXTRATION on the limitations and
characteristics of the transducer-
its bandwidth, resolution,
CLASSIFICATION sensitivity, distortion, signal to
noise ratio etc.,
Segmentation:
POST-PROCESSING
Isolates sensed objects from the
background or from other data.
OUTPUT
• Feature extraction:
Measures object properties that are useful for the classification. It aims to
create discriminative features good for classification.
So, the aim is to extract features which are “good” for classification.
Good features:
Objects from the same class have similar feature values.
Objects from different classes have different values.
“good” features “bad” features
Classification:
Takes in the feature vector from the feature extractor and assigns the object to a
category.
The variability of the feature values for objects in the same category may be due
to complexity of data or due to noise
How can the classifier best cope with the variability and what is the best
performance possible?
Is it possible to extract all the values of a particular feature of a particular input?
What should the classifier do in case some of the feature data is missing?
In case of multiple classifiers, how does the “super” classifier pool the
evidence to arrive at the best decision? Is the majority always right?
How would a classifier know when to base the decision on a minority
opinion?
The Design Cycle
Data collection
Feature Choice
Model Choice
Training
Evaluation
Computational
Complexity
The pattern recognition design cycle
Data collection
Probably the most time-intensive component of a PR project.
Can account for a large part of the cost of developing a PR system.
How many examples are enough?
Is the data set adequately large to yield accurate results?
Feature choice
Critical design step.
Requires basic prior knowledge
Does prior knowledge always yield relevant information?
Model choice
How do we determine when a hypothetical model differs from the true
model underlying the existing patterns?
Determining which model is best suited for the problem at hand.
Training
Using the data to determine the classifier – training the classifier.
Supervised, unsupervised and reinforcement learning.
Evaluation
How well does the trained model do?
Evaluation necessary to measure the performance of the system and also
to identify the need for improvements in its components.
Overfitting vs. generalization
The need to arrive at a best complexity for a model.
Conclusion
The number, magnitude and complexity of subproblems in PR is
overwhelming.
Though it is still an emerging field, we have various applications
successfully running.
There still remain several fascinating unsolved problems providing
immense opportunities for progress.