Data Mining
Data Mining
Data Mining
6
Data Mining Definition
• Finding hidden information in a huge store of data
• Fit data to a model
• Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
7
What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
• Alternative names
– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
• What is not data mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
8
Potential Applications
• Market analysis and management
– target marketing, CRM, market basket analysis, cross selling,
market segmentation
• Risk analysis and management
– Forecasting, customer retention, quality control, competitive
analysis
• Fraud detection and management
• Text mining (news group, email, documents) and Web analysis.
– Intelligent query answering
9
Market Analysis and Management
• Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls,
– Target marketing (Find clusters of “model” customers who
share the same characteristics: interest, income level,
spending habits, etc.)
• Determine customer purchasing patterns over time
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
10
Market Analysis and Management
• Customer profiling
– data mining can tell you what types of customers buy what
products (clustering or classification)
• Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new
customers
• Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and
variation)
11
Fraud Detection and Management
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage accidents
to collect on insurance
– money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
12
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD): process of
finding useful information and patterns in data.
• Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD process.
13
KDD Process: Several Key Steps
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction,
invariant representation
• Choosing functions of data mining
– summarization, classification, regression, association,
clustering
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
• Use of discovered knowledge
Knowledge Discovery (KDD) Process
– Data mining—core of
Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers,
16
Database Systems, OLTP
Data Mining: Confluence of
Multiple Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
17
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms •Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines
•Bayes Theorem
•Regression Analysis
•Algorithm Design Techniques •EM Algorithm
•Algorithm Analysis •K-Means Clustering
•Data Structures •Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
DATA MINING VESIT 18
M.VIJAYALAKSHMI
Data Mining Functionalities
• Multidimensional concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
• Frequent patterns, association, correlation vs. causality
– Milk Fruit [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
– Construct models (functions) that describe and distinguish
classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
– Predict some unknown or missing numerical values
Data Mining Functionalities (2)
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass
similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general
behavior of the data
– Noise or exception? - fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: digital camera large SD Mem.
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Data Mining Issues
• Human Interaction • Multimedia Data
• Overfitting • Missing Data
• Outliers • Irrelevant Data
• Interpretation • Noisy Data
• Visualization • Changing Data
• Large Datasets • Integration
• High Dimensionality • Application
21
Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of
abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
22
Major Issues in Data Mining (2)
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and
global information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
– Protection of data security, integrity, and privacy
23
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
24
Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time
25
Are All the “Discovered” Patterns Interesting?
• Data mining may generate thousands of patterns:
– Human-centered, query-based, focused mining
• Interestingness measures
– A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a
user seeks to confirm
• Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
Find All and Only Interesting Patterns?
• Find all the interesting patterns: Completeness
– Can a data mining system find all the interesting patterns? Do
we need to find all of the interesting patterns?
– Heuristic vs. exhaustive search
– Association vs. classification vs. clustering
Pattern Evaluation
Knowledge
Data Mining Engine -Base
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data Data
Databases Data integration Warehouse Repository
July 7, 2024
Data Mining Models and Tasks
© Prentice Hall 37
Predictive models
40
Basic Data Mining Tasks
(cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential patterns.
41
Ex: Time Series Analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior
42
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms •Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms
43
KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality
44
KDD Issues (cont’d)
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application
45
Data Mining Techniques
http://www.alice-soft.com/html/tech_dt.htm
Example: Using Decision Trees to Predict
Classifications - ALICE d'ISoft
Split the records according to most discriminating attribute:
housing type
http://www.alice-soft.com/html/tech_dt.htm
Example Classification Rule: People who rent their home and
earn more than 7853 Francs have an 86% success rate.
http://www.alice-soft.com/html/tech_dt.htm
Data Mining Techniques
Association – or link analysis – search all details or
transactions from operational systems for patterns
with a high probability of repetition
• Results to development of associative algorithm that
correlates one set of events or items with another set of
events or items
• e.g. of association rules or patterns:
– 83% of all records that contain items A, B, C also
contain items D and E
– 83% - confidence factor
Data Mining Techniques
LEFT RIGHT
Marakas, G.M. (2002) Decision support systems in the 21st Century. 2nd Ed, Prentice Hall
An Example
Green Red Peppers Yellow
Rule: Peppers IMPLIES Peppers
IMPLIES Bananas IMPLIES
Bananas Bananas
Lift 1.37 1.43 1.17
Support 3.77 8.58 22.12
Confidence 85.96 89.47 73.09
Pretzels 0 1 1 0 2
Pizza and Cola sell together more often than any other
combo; a cross-marketing opportunity?
Milk sells well with everything – people probably come here
specifically to buy it.
Marakas, G.M. (2002) Decision support systems in the 21st Century. 2nd Ed, Prentice Hall
Limitations of Market Basket Analysis
Marakas, G.M. (2002) Decision support systems in the 21st Century. 2nd Ed, Prentice Hall
Market Basket Analysis in PolyAnalyst
http://www.megaputer.com/products/pa/algorithms/ba.php3
HealthCare Fraud Example
Market Basket Analysis + Summary Statistics reveal providers
sharing a large number of patients >>>Potential Provider Fraud
http://www.megaputer.com
Data Mining Techniques
Sequencing or time-series analysis – techniques that
relate events in time
• Prediction of interest rate fluctuations or stock performance
based on a series of preceding events
• E.g. buying sequence: parents buy promotional toys associated
with a particular movie within 2 weeks after renting the movie
>flyer campaign for promotional toys should be linked to
customer lists created a s a results of movie rentals
• sequence of customer purchases > catalogue of specific product
types can be target-mailed to the customer
Marakas, G.M. (2002) Decision support systems in the 21st Century. 2nd Ed, Prentice Hall
Association and Sequencing
Association and sequencing tools analyse data to discover
rules that identify patterns of behaviour. An association
tool will find rules such as:
• When people buy diapers they also buy beer 50 percent of the time.
• People who have purchased a VCR are three times more likely to
purchase a camcorder in the time period two to four months after the
VCR was purchased.
http://www.dbmsmag.com/9807m03.html
Association and Sequencing
Example in care management, procedure interactions
and pharmaceutical interactions
http://www.dbmsmag.com/9807m03.html
Association and Sequencing
http://www.dbmsmag.com/9807m03.html
Association and Sequencing
Example in fraud detection in telecommunications and
insurance:
http://www.dbmsmag.com/9807m03.html
Data Mining Techniques
Clustering – technique for creating partitions so that
all members of each set are similar according to
some metric or set of metrics
• e.g., credit card purchase data
• Cluster 1: business-issues gold card, meals charged
on weekdays, mean values greater than $250
• Cluster 2: personal platinum card, meals charged on
weekends, mean value $175, bottle of wine
charged more than 65% of the time
Marakas, G.M. (2002) Decision support systems in the 21st Century. 2nd Ed, Prentice Hall
Clustering- Example
Identifying natural clusters of patient populations
p://www.enee.umd.edu/medlab/papers/dcsThShort/thpaper1.html
Clustering- Example
Identifying natural clusters of patient populations
p://www.enee.umd.edu/medlab/papers/dcsThShort/thpaper1.html
Current Limitations and Challenges to
Data Mining
Marakas, G.M. (2002) Decision support systems in the 21st Century. 2nd Ed, Prentice Hall
Summary
• Business intelligence systems with data mining tools allow
the systems to find hidden patterns from large datasets, and
use these patterns to turn data into actionable information
• BIS using data mining tools need data visualisation tools,
to present to the end-user such hidden patterns
• Hidden patterns when placed onto the hands of decision
makers, become actionable information or business
intelligence
References
Marakas, G.M. (2002) Decision support systems in the 21st
Century. 2nd Ed, Prentice Hall (or other editions)
Power, D. (2002) Decision Support Systems: Concepts and
Resources for Managers, Quorum Books.
FREE online resource: Data Mining booklet
http://www.twocrows.com/intro-dm.pdf
Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time
73
Regression Analysis
Module 3
Regression
Dependent
variable
y’ = b0 + b1X ± є
є
variable (y)
Dependent
B1 = slope
b0 (y intercept) = ∆y/ ∆x
Observatio
n: y ^
Prediction
Dependent
variable :y
Zero
Independent variable (x)
Prediction
error: ε
Observation: y
^
Prediction:
y
Zero
Dependent
variable
^ 2
SSR = ∑ ( y – y ) (measure of explained
variation)
^
SSE = ∑ ( y – y ) (measure
2
of unexplained
variation)
2
R = SSR = SSR
SST SSR + SSE
2
The value of R can range between 0 and 1, and the
higher its value the more accurate the regression
model is. It is often referred to as a percentage.
Standard Error of Regression
y=A+β*x+ε
∆y
β= ∆x
Multiple Linear Regression
A multiple regression
1 1 2 2takes the form:
y = A + β X + β X + … + β k Xk + ε
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.982655 Y = B0 + B1 X1 + B2X2 + B3X3 - - - +/- Error
R Square 0.96561 Total = Estimated/Predicted +/- Error
Adjusted R Square 0.959879
Standard Error 26.01378
Observations 15
ANOVA
df SS MS F Significance F
Regression 2 228014.6 114007.3 168.4712 1.65E-09
Residual 12 8120.603 676.7169
Total 14 236135.2
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%
Intercept 562.151 21.0931 26.65094 4.78E-12 516.1931 608.1089
Temperature -5.436581 0.336216 -16.1699 1.64E-09 -6.169133 -4.704029
Insulation -20.01232 2.342505 -8.543127 1.91E-06 -25.1162 -14.90844