Lec 1
Lec 1
Lec 1
Books:
1. Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber
and Jian Pei.
2. Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and
Vipin Kumar
3. Machine Learning: A Probabilistic Perspective by Kevin P. Murphy
Why Data Mining?
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
5
What Is Data Mining?
6
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
• Data mining plays an essential role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases 7
KDD Process: Several Key Steps
• Learning the application domain
– relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data processing
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from database)
– Data transformation (data transformed or consolidated into forms appropriate for mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
• Use of discovered knowledge
8
A Typical View from ML and Statistics
9
Multi-Dimensional View of Data Mining
• Data to be mined
– Database data (extended-relational, object-oriented, heterogeneous,
legacy), transactional data, stream, spatiotemporal, time-series, sequence,
text and web, multi-media, graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
– Regression, association, classification, clustering, trend/deviation, outlier
analysis, summarization, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Machine learning, statistics, pattern recognition, visualization, Natural
Language Processing, Image Processing, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
10
Different Kinds of Datasets
11
Relational Databases
• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.
Preprocessing: Images of
different fishes are isolated from
one another and from background
Domain knowledge:
◦ A sea bass is generally longer than a salmon
Related feature: (or attribute)
◦ Length
Training the classifier:
◦ Some examples are provided to the classifier in this
form: <fish_length, fish_name>
◦ These examples are called training examples
◦ The classifier learns itself from the training examples,
how to distinguish Salmon from Bass based on the
fish_length
Classification
Test/Unlabeled
• So the overall Training Data
Data
this
Feature vector Feature vector
Testing against
Training model/
Classification
Model Prediction/Eval
uation
Classification
Pre-
12, salmon If len > 12, then
processing, Training
15, sea bass sea bass else
Feature
8, salmon salmon
extraction
5, sea bass
Training data
Feature vector Model
Labeled data
Unlabeled data
Classification
• Why error?
Insufficient training data
Too few features
Too many/irrelevant features
Overfitting / specialization
Overfitting: model learns to perform well on training data but fails to perform on unseen data
Classification
Classification
• New Feature:
– Average lightness of the fish scales
Classification
Classification
Evaluation/Prediction
Test data Feature vector
Classification: Performance
Accuracy:
% of test data correctly classified
In our first example, accuracy was 3 out 4 = 75%
In our second example, accuracy was 4 out 4 =
100%
False positive:
Negative class incorrectly classified as positive
Usually, the larger class is the negative class
Suppose
salmon is negative class
sea bass is positive class
Classification
false positive
false negative
Classification
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their geographical
and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
Cluster Analysis
– Class label is unknown: group data to form new classes
– Clusters of objects are formed based on the principle of
maximizing intra-class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.
37
Clustering: Application 2
• Document Clustering:
– Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate
a new document or search term to clustered documents.
Fraud Detection
• Outlier Analysis
– Data that do no comply with the general behavior or model.
– Outliers are usually discarded as noise or exceptions.
– Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts
39
ASSOCIATION RULE MINING
Association Rule Discovery
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
REGRESSION
Regression
• Predict future values based on past values
• Linear Regression assumes linear relationship
exists.
y = c0 + c1 x1 + … + cn xn
• Find values to best fit the data
• Some real-world examples for regression analysis include:
45
Major Issues in Data Mining
– Data mining query languages and ad hoc data mining
• High-level query languages need to be developed
• Should be integrated with a DB/DW query language
– Presentation and visualization of results
• Knowledge should be easily understood and directly usable
• High level languages, visual representations or other expressive forms
• Require the DM system to adopt the above techniques
– Handling noisy or incomplete data
• Require data cleaning methods and data analysis methods that can handle noise
– Pattern evaluation – the interestingness problem
• How to develop techniques to access the interestingness of discovered patterns,
especially with subjective measures bases on user beliefs or expectations
46
Major Issues in Data Mining
• Performance Issues
– Efficiency and scalability
• Huge amount of data
• Running time must be predictable and acceptable
– Parallel, distributed and incremental mining algorithms
• Divide the data into partitions and processed in parallel
• Incorporate database updates without having to mine the entire data again
from scratch