Data Mining

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 63

By Rajeev Kumar

Agenda
Data Mining Definition
Data Mining Comparisons
Data Evolution
KDD Process
Data Mining Process
Data Mining Techniques
Applications
CASE STUDY-DM application in GIS
Pattern Recognition
Definition
 DATA MINING is defined as “the nontrivial extraction of implicit,
previously unknown, and potentially useful information from data“
and "the science of extracting useful information from large data
sets or databases" .

 It is also said to be the search for the relationships and global


patterns that exist in large databases but are hidden among vast
amounts of data.

 The patterns must be


 valid: hold on new data with some certainty
 novel: non-obvious to the system
 useful: should be possible to act on the item
 understandable: humans should be able to interpret the pattern
Why Data Mining?
Competitive Advantage!
“The secret of success is to know something that
nobody else knows.”
-Aristotle Onassis
PHOTO: HULTON-DEUTSCH COLL
 Human analysis skills are inadequate
 Volume and dimensionality of the data
 High data growth rate
 Availability of:
• Data
• Storage
• Computational power
• Off-the-shelf software
• Expertise
Why Data Mining?
 Credit ratings/targeted marketing:
 Given a database of 100,000 names, which persons are the least likely to default on
their credit cards?
 Identify likely responders to sales promotions

 Fraud detection
 Which types of transactions are likely to be fraudulent, given the demographics and
transactional history of a particular customer?
 Customer relationship management:
 Which of my customers are likely to be the most loyal, and which are most likely to
leave for a competitor? :

Data Mining helps extract such information


Data Mining works with Warehouse Data

Data Warehousing provides the


Enterprise with a memory

Data Mining provides the


Enterprise with intelligence
Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies

Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"

Data Access "What were unit Relational Oracle, Sybase, Retrospective,


(1980s) sales in New databases Informix, IBM, dynamic data
England last (RDBMS), Microsoft delivery at record
March?" Structured Query level
Language (SQL),
ODBC

Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses

Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,


(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive
unit sales next multiprocessor SGI, SAS, NCR, information
month? Why?" computers, massive Oracle, numerous delivery
databases startups
The Knowledge Discovery Process
in data.

 Problem formulation
 Data collection
 subset data: sampling might hurt if highly skewed data
 feature selection: principal component analysis
 Pre-processing: cleaning
 name/address cleaning, different meanings (annual, yearly), duplicate
removal, supplying missing values
 Transformation:
 map complex objects e.g. time series data to features e.g. frequency
 Choosing mining task and mining method
 Result evaluation and Visualization

Knowledge discovery is an iterative process


The data mining Process

Why Should There be a Standard Process?


The data mining process must be reliable and repeatable by people with
little data mining background.

Process Standardization

CRISP-DM:
 CRoss Industry Standard Process for Data Mining
 Initiative launched Sept.1996

• CRISP-DM provides a uniform framework for


– guidelines
– experience documentation

• CRISP-DM is flexible to account for differences


– Different business/agency problems
– Different data
The data mining Process
Phases in the DM Process:
Phases

Business Understanding
understanding project objectives and data
mining problem identification
Data Understanding
capture and understand data for quality
issues
Data Preparation
data cleaning, merging data and deriving
attributes
Phases

Modeling
select the data mining technique and build
the model
Evaluation
evaluate results and approved model
Deployment
put the models into practice, monitoring and
maintenance
DM Techniques
Based on two models

Verification Model
-user makes hypothesis
-tests hypothesis to verify its validity

Discovery Model
-automatically discovering important information hidden in
the data
-data is sifted in search of frequently occurring patterns,
trends and generalizations
-no guidance from user
DM Techniques

1.Discovery of Association Rules


2.Clustering
3.Discovery of Classification Rules
4.Frequent Episodes
5.Deviation Detection
6.Neural Networks
7.Genetic Algorithms
8.Rough Set Techniques
9.Support Vector Machines
Association Rules
Purchasing of one product when another product is purchased
represents an AR
 Used mainly in retail stores to
-Assist in marketing
-Shelf management
-Inventory control
Support means how often X and Y occur together as a percentage of the
total transactions
Confidence means how much a particular item is dependant on the other
 Given set T of groups of items
Example: set of item sets purchased
 Goal: find all rules on itemsets of the form a-->b such that
 support of a and b > user threshold s
 conditional probability (confidence) of b given a > user
threshold c
 Example: Milk --> bread
 Purchase of product A --> service B
Clustering

Unsupervised learning when old data with class labels not


available e.g. when introducing a new product.
Group/cluster existing customers based on time series of
payment history such that similar customers in same
cluster.
Key requirement: Need a good measure of similarity
between instances.
Identify micro-markets and develop policies for each
Clustering- Applications
 Customer segmentation e.g. for targeted marketing
 Group/cluster existing customers based on time series of
payment history such that similar customers in same cluster.
 Identify micro-markets and develop policies for each

 Collaborative filtering:

Given database of user preferences, predict preference of new user


Example: predict what new movies you will like based on
 your past preferences
 others with similar past preferences
 their preferences for the new movies
Decision trees
 Tree where internal nodes are simple decision rules on
one or more attributes and leaf nodes are predicted
class labels.

Is company software?

Product On site job


development

Good Bad Bad Good


Decision tree
 Widely used learning method
 Easy to interpret: can be re-represented as if-then-else rules
 Does not require any prior knowledge of data distribution, works well
on noisy data.
 Has been applied to:
 classify medical patients based on the disease,
 equipment malfunction by cause,
 loan applicant by likelihood of payment.

· Pros
· Cons
+ Reasonable training
­ Cannot handle complicated
time
relationship between features
+ Fast application
­ simple decision boundaries
+ Easy to interpret
­ problems with lots of missing
+ Easy to implement
data
+ Can handle large
number of features
Neural networks
 Involves developing mathematical structures with the ability to
learn
 Best at identifying patterns or trends in data
 Well suited for prediction or forecasting needs
 Useful for learning complex data like handwriting, speech and
image recognition

· Pros · Cons
+ Can learn more complicated ­Slow training time
class boundaries ­ Hard to interpret
+ Fast application ­ Hard to implement: trial
+ Can handle large number of and error for choosing
features number of nodes
Data Mining Applications 

 Some examples of “successes":


 1. Decision trees constructed from bank-loan histories to produce
algorithms to decide whether to grant a loan.
 2. Patterns of traveler behavior mined to manage the sale of
discounted seats on planes, rooms in hotels,etc.
 3. “Bread and Butter." Observation that customers who buy bread
are more likely to buy butter than average, allowed supermarkets
to place bread and butter nearby, knowing many customers would
walk between them.
 4. Skycat and Sloan Sky Survey: clustering sky objects by their
radiation levels in different bands allowed astronomers to
distinguish between galaxies, nearby stars, and many other kinds
of celestial objects.
 5. Comparison of the genotype of people with/without a condition
allowed the discovery of a set of genes that together account for
many cases of diabetes. This sort of mining has become much
more important as the human genome has fully been decoded
Issues and Challenges in DM

Limited Information
Noise or missing Data
User interaction
Uncertainty
Size,updates and irrelevant fields
Other Applications
 Banking: loan/credit card approval
 predict good customers based on old customers
 Customer relationship management:
 identify those who are likely to leave for a competitor.
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, financial transactions
 from an online stream of event identify fraudulent events
 Manufacturing and production:
 automatically adjust knobs when process parameter changes
 Medicine: disease outcome, effectiveness of treatments
 analyze patient disease history: find relationship between
diseases
 Molecular/Pharmaceutical: identify new drugs
 Scientific data analysis:
 identify new galaxies by searching for sub clusters
 Web site/store design and promotion:
 find affinity of visitor to pages and modify layout
Introduction
In USA, traffic crashes cause one death
every 13 minutes and an injury every 15
seconds
If current US crash rate remains
unchanged
* One child out of every 84 born today will die
violently in a motor vehicle crash

* Six out of every 10 children will be injured in a


highway crash over a lifetime

 The economic impact to the U.S. is


roughly $150 billion
Deaths and injuries on Florida’s highways
cost society $14 billion annually, ranking
Florida in the top five nationally (FDOT,
2003)
In addition, the insecurity on the roads has an important effect on the economic costs
associated with traffic accidents
 Over the last two decades, increasingly large amount of
transportation data have been stored electronically
 This volume is expected to continue to grow considerably in the future
 A tremendous amount of data pertain to transportation safety
 Despite this wealth of data, we have been unable to fully capitalize on
its value
 This is because the information implicit in the data is not easily
discernable without the use of proper data analysis techniques
 Advanced modeling and analysis of such data are needed to
understand relationships and dependencies
 Two Major challenges face transportation engineers today:
 How to extract the most relevant information from the vast amount of
available traffic data
 How to use the wealth of data to enhance our understanding of traffic
behavior
 Data mining is an emerging field that promotes the progress of data
analysis
Why Data Mining ?

When dealing with large and complex data sets, such as


transportation data, the use of data mining techniques seems useful
for knowledge discovery and in identifying relevant variables that
make strong contribution towards better understanding of safety
patterns, problems and causes.
Data mining applications in transportation is still in its infancy, but
have a high potential for growth

Applications of Data Mining in Transportation:


 Improve Traffic Signal Timing Plans

 Measure Freeway Performance

 Improve Aviation Safety

 Improve delivery and quality of traffic safety information


Developed Analytical Methodology
Step1- GIS: identify relevant freeway features at each accident location and
integrate them with crash database
Step2- Preliminary Statistical Analysis: better understanding of data
Step3- Data Mining Clustering Techniques: identify clusters of common
accidents, and conditions under which accidents are more likely to cause
death or injury
Step4- Data Mining Profiling Techniques: profile accidents in terms of
accident and freeway characteristics
Step5- Visualization Techniques: better understanding & presentation of
results
Clustering Analysis:

 Clustering = Unsupervised classification


 Place objects into clusters suggested by the data, not defined a priori
 Studies have shown that clustering methods are an important tool when
analyzing traffic accidents as these methods are able to identify groups of
road users, vehicles and road segments which would be suitable targets
for countermeasures
Demographic clustering is a distribution-based data mining technique
that provides fast and natural clustering of very large databases
The Data Mining Process:

1. Data Capture
2. Data Merging
3. Data Description
4. Statistical Analysis
5. Data Pre-Processing
 Data Transformation
 Data Cleaning
6. Building Data-Mining Models
Data Capture

 The 1999 crash data for MDC Freeways was utilized


 Crash data:
 Rich source of accident related information
 Contains 39 attributes that describe the accident: Roadway section,
date, time, accident type, driver age, lighting condition, traffic control,
and road condition
But
 This does not provide information about physical characteristics of
the roadway at the accident location, which is necessary to develop
appropriate countermeasures .
Data Merging
 Roadway Characteristics Inventory (RCI) database, contains various
physical and administrative related features to the roadway networks
 Speed limit, local name, number of lanes, shoulder type, median type, and
median width were extracted from the RCI database using GIS Software
package (Arc View)
 Arc Avenue was used to write a special script to merge crash data and
roadway attribute tables
 Spatial reference and analysis were performed to identify the roadway
features at each accident location
Crash and roadway characteristic data merging

Data Merging

Median Width Shoulder Data

Accidents

Median Width at each Shoulder Type at each


Accident Location Accident Location
Data Description
 Merged dataset contains 5,870 records (one for each accident)
 The 45 attributes describing each accident can be divided into seven
dimensions:
 Accident Information (e.g. number of injuries, number of fatalities,
accident type, point of impact)
 Road Information (e.g. road condition, road surface, road side)
 Traffic Information (e.g. accident lane number)
 Drivers Information (e.g. drivers age)
 Geographical Information (e.g. roadway section, area type)
 Environmental Conditions Information (e.g. lighting condition, weather
condition, date and time of the accident)
&
 Roadway feature Information (e.g. number of lanes, speed limit, local
name, median width)
Statistical Analysis
 The database:
 Large
 Noisy and/or missing data
 No clearly defined expectations about the kind of clusters to be
discovered
 Preliminary statistical analysis was performed
 Basic statistics, such as maximum, minimum, mean, variance, and
frequencies were calculated for numeric fields
 Frequencies for continuous numeric fields are calculated
 Frequencies for categorical fields and discrete numeric fields were
calculated
 These statistics helped in providing a better understanding of the data and
in speeding up the problem identification process
Data Pre-Processing
Data transformation
Datasets:
 Most variables are categorical
 Discrete numeric
 Continuous

 Data Cleaning
Irregularities were tracked, listed and corrected or removed from the
dataset
Building Data-Mining Models
 This process is iterative
 Several input variables and model parameters are explored
 The 45 variables were narrowed down to the most important ones
 Resulting in a set of variables that was used in building the data-mining
models
 Clustering- Input Variables:
 Area type, accident type, main contributing cause, site location,
highway name, driver Age, accident Lane, traffic control, and time of
the accident
Profiling -Input Variables:
 Other discrete and categorical variables and some interesting
continuous variables were input as supplementary variables
 Variables used to profile the clusters but not to define them
Building Data-Mining Models
 Model 1
The freeway accidents dataset involving one or more injuries
Modal values for each cluster variable given below.

Cluster 4 6 5 9 3 8 1 2 7
Percentage 36.15% 11.60% 11.11% 8.54% 7.95% 7.95% 5.90% 5.76% 5.03%

Local Name Palmetto Palmetto Palmetto I-95 I-95 I-95 I-95 I-95 Palmetto
Expressw Expressw Expressw Expressw
ay ay ay ay
Type of Rear-End Hit Conc. Rear-End Sideswipe Rear-End Angle Rear-End Angle Angle
Accident Barrier
Wall
Accident Time 4-7 PM 9PM-7AM 4-7 PM 1-4PM 4-7 PM 9PM-7AM 1-4PM 1-4PM 9PM-7AM

Site Location Not at Not at Exit Ramp Not at Not at Not at Bridge Not at At
Interchange Interchange Interchange Interchange Interchange Interchange Intersection
Accident Lane 1 Median Ramp 3 2 Side of Ramp Side of Ramp
Road Road
Contributing Careless Careless Careless Improper No Careless Careless Careless No
Cause Driving Act Driving Act Driving Act Lane Improper Driving Act Driving Act Driving Act Improper
Change Driving Act Driving Act
Driver 1 Age [25 –35] [25 –35] [15 –25] [35 –45] [25 –35] [25 –35] [25 –35] [25 –35] [25 –35]

Type of Control No Control No Control No Control No Control No Control No Control No Control No Control No Control
Model 2
The freeway accidents dataset that involved one or more
fatalities
 Model 2
The freeway accidents dataset that involved one or more fatalities

Detailed visualization of one of the clusters involving 26.67% of the population


The analytical methodology developed in this study:
 Provide an insight to identify roadway and drivers’ characteristics
that contributes to severe accidents on Miami-Dade County
Freeway System
 Provides information for transportation planners to improve
planning, operating, and managing the freeway system
 Could help interested agencies effectively allocate resources to
improve safety measures in those areas with high accident
frequency
Definiton:
“The assignment of a physical object or event to one of several
pre-specified categories” –Duda and Hart
“The science that concerns the description or classification
(recognition) of measurements” –Schalkoff
Pattern Recognition is concerned with answering the question
“What is this?” –Morse

It is the study of how machines can :


•observe the environment
• learn to distinguish patterns of interest from their
background
•make sound and reasonable decisions about the
categories of the patterns.
A pattern is an object, process or event that can be given a name.
A pattern class (or category) is a set of patterns sharing common attributes and usually
originating from the same source.
During recognition (or classification) given objects are assigned to prescribed classes.
A classifier is a machine which performs classification.

APPLICATIONS:
 Computer Vision
 Speech Recognition
 Automated Target Recognition
 Optical Character Recognition
 Seismic Analysis
 Man and Machine Diagnostics
 Fingerprint Identification
 Age Preprocessing / Segmentation
 Industrial Inspection
 Financial Forecast
 Medical Diagnosis
 ECG Signal Analysis
Emerging PR Applications
Problem Input Output
Speech recognition Speech waveforms Spoken words, speaker
identity
Non-destructive testing Ultrasound, eddy current, Presence/absence of flaw,
acoustic emission waveforms type of flaw
Detection and diagnosis of EKG, EEG waveforms Types of cardiac
disease conditions, classes of brain
conditions
Natural resource Multispectral images Terrain forms, vegetation
identification cover
Aerial reconnaissance Visual, infrared, radar images Tanks, airfields
Character recognition (page Optical scanned image Alphanumeric characters
readers, zip code, license
plate)
Emerging PR Applications (cont’d)

Problem Input Output

Identification and counting Slides of blood samples, micro- Type of cells


of cells sections of tissues
Inspection (PC boards, IC Scanned image (visible, Acceptable/unacceptable
masks, textiles) infrared)
Manufacturing 3-D images (structured light, Identify objects, pose,
laser, stereo) assembly
Web search Key words specified by a user Text relevant to the user
Fingerprint identification Input image from fingerprint Owner of the fingerprint,
sensors fingerprint classes
Online handwriting Query word written by a user Occurrence of the word in
retrieval the database
 Key Objectives:
 Process the sensed data to eliminate noise
 Given a sensed pattern, choose the best-fitting model for it and
then assign it to class associated with the model.

Classification v/s Clustering

• Classification (known categories)


• Clustering (creation of new categories)

Category “A”

Category “B”

Classification (Recognition) Clustering


(Supervised Classification) (Unsupervised Classification)
A basic pattern classification system contains
A sensor
A preprocessing mechanism
A feature extraction mechanism (manual or automated)
A classification algorithm
A set of examples (training set) already classified or described
Main PR Approaches:
Template matching
 The pattern to be recognized is matched against a stored
template while taking into account all allowable pose
(translation and rotation) and scale changes.
Statistical pattern recognition
 Focuses on the statistical properties of the patterns (i.e.,
probability densities).
Structural Pattern Recognition
 Describe complicated objects in terms of simple primitives and
structural relationships.
Syntactic pattern recognition
 Decisions consist of logical rules or grammars.
Artificial Neural Networks
 Inspired by biological neural network models.
Template Matching

Template

Input scene
• Patterns represented in a feature space
• Statistical model for pattern generation in feature space
• Basically adopted for numeric data.
Structural Pattern Recognition
 Describe complicated objects in terms of simple primitives and
structural relationships.
 Decision-making when features are non-numeric or structural

Scene
N
L
X Object Background
T
M Y Z
D E M N
D E
L T X Y Z
Artificial Neural Networks
 Massive parallelism is essential for complex pattern recognition
tasks (e.g., speech and image recognition)
 Human take only a few hundred ms for most cognitive tasks;
suggests parallel computation
A typical Pattern Recognition System:
 Sensing:
INPUT
The sensor converts images or
sounds or other physical inputs
SENSING into signal data.
The input to the PR system is a
SEGMENTATION transducer such as a camera or a
microphone. Therefore the
difficulty of the problem depends
FEATURE EXTRATION on the limitations and
characteristics of the transducer-
its bandwidth, resolution,
CLASSIFICATION sensitivity, distortion, signal to
noise ratio etc.,
 Segmentation:
POST-PROCESSING
Isolates sensed objects from the
background or from other data.
OUTPUT
• Feature extraction:
Measures object properties that are useful for the classification. It aims to
create discriminative features good for classification.
So, the aim is to extract features which are “good” for classification.
Good features:
Objects from the same class have similar feature values.
Objects from different classes have different values.
“good” features “bad” features
 Classification:

Takes in the feature vector from the feature extractor and assigns the object to a
category.
The variability of the feature values for objects in the same category may be due
to complexity of data or due to noise
How can the classifier best cope with the variability and what is the best
performance possible?
Is it possible to extract all the values of a particular feature of a particular input?
What should the classifier do in case some of the feature data is missing?

The classification consists of determining to which region a feature vector x belongs


to.
Borders between decision boundaries are called decision regions.
 Post-processing
A post-processor takes care of other considerations like the effects of
context and the costs of errors to decide on the appropriate action.

Poses a lot of challenges like:


 How do we tackle the classification in case of error rate?

 How do we incorporate knowledge about costs and try to use the


minimum cost model without affecting our classification decision.

 In case of multiple classifiers, how does the “super” classifier pool the
evidence to arrive at the best decision? Is the majority always right?
How would a classifier know when to base the decision on a minority
opinion?
The Design Cycle

 Data collection
 Feature Choice
 Model Choice
 Training
 Evaluation
 Computational
Complexity
The pattern recognition design cycle
 Data collection
Probably the most time-intensive component of a PR project.
Can account for a large part of the cost of developing a PR system.
How many examples are enough?
Is the data set adequately large to yield accurate results?

 Feature choice
Critical design step.
Requires basic prior knowledge
Does prior knowledge always yield relevant information?

 Model choice
How do we determine when a hypothetical model differs from the true
model underlying the existing patterns?
Determining which model is best suited for the problem at hand.
 Training
Using the data to determine the classifier – training the classifier.
Supervised, unsupervised and reinforcement learning.

 Evaluation
How well does the trained model do?
Evaluation necessary to measure the performance of the system and also
to identify the need for improvements in its components.
Overfitting vs. generalization
The need to arrive at a best complexity for a model.
Conclusion
 The number, magnitude and complexity of subproblems in PR is
overwhelming.
 Though it is still an emerging field, we have various applications
successfully running.
 There still remain several fascinating unsolved problems providing
immense opportunities for progress.

You might also like