Lec 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Data Mining

Books:
1. Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber
and Jian Pei.
2. Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and
Vipin Kumar
3. Machine Learning: A Probabilistic Perspective by Kevin P. Murphy
Why Data Mining?

• The Explosive Growth of Data: from terabytes to petabytes


– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
2
Evolution of Sciences
• Before 1600, empirical science Empirical means based on observations or experience.
Theoretical means based on theories and hypotheses.
• 1600-1950s, theoretical science
– Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
• 1950s-1990s, computational science
– Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics, or
linguistics.)
– Computational Science traditionally meant simulation. It grew out of our inability
to find closed-form solutions for complex mathematical models.
• 1990-now, data science
– The flood of data from new scientific instruments and simulations
– The ability to economically store and manage petabytes of data online
– The Internet and computing Grid that makes all these archives universally
accessible
– Scientific info. management, acquisition, organization, query, and visualization
tasks scale almost linearly with data volumes.
3
Evolution of Database Technology
Evolution of Database Technology

• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems

5
What Is Data Mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data

6
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
• Data mining plays an essential role in
the knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 7
KDD Process: Several Key Steps
• Learning the application domain
– relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data processing
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from database)
– Data transformation (data transformed or consolidated into forms appropriate for mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
• Use of discovered knowledge
8
A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

• This is a view from typical machine learning and statistics communities

9
Multi-Dimensional View of Data Mining
• Data to be mined
– Database data (extended-relational, object-oriented, heterogeneous,
legacy), transactional data, stream, spatiotemporal, time-series, sequence,
text and web, multi-media, graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
– Regression, association, classification, clustering, trend/deviation, outlier
analysis, summarization, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Machine learning, statistics, pattern recognition, visualization, Natural
Language Processing, Image Processing, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
10
Different Kinds of Datasets

• Database-oriented data sets and applications


– Relational database, data warehouse, transactional database
Transactional Database: records of interaction between entities
• Advanced data sets and advanced applications
– Data streams and sensor data
– Temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web

11
Relational Databases

• DBMS – database management system, contains a collection of


interrelated databases
e.g. Faculty database, student database, publications database
• Each database contains a collection of tables and functions to manage and
access the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as attributes of data
and rows as records.
• Tables can be used to represent the relationships between or among
multiple tables.
Relational Databases – AllElectronics store
Relational Databases

• With a relational query language, e.g. SQL, we will be able to find


answers to questions such as:
– How many items were sold last year?
– Who has earned commissions higher than 10%?
– What is the total sales of last month for Dell laptops?

• When data mining is applied to relational databases, we can search


for trends or data patterns.

• Relational databases are one of the most commonly available and


rich information repositories, and thus are a major data form in our
study.
Transactional Databases

• Consists of a file where each record represents a transaction


• A transaction typically includes a unique transaction ID and a
list of the items making up the transaction.

• Either stored in a flat file or unfolded into relational tables


• Easy to identify items that are frequently sold together
Data Mining Tasks

• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.

Common data mining tasks


– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]
Classification: A Predictive task
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the
class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
Classification: An Example

(from Pattern Classification by Duda & Hart & Stork –


Second Edition, 2001)
• A fish-packing plant wants to automate the
process of sorting incoming fish according to
species

• As a pilot project, it is decided to try to separate


sea bass from salmon using optical sensing
Classification

• Features (to distinguish):


(i) Length
(ii) Lightness
(iii) Width
(iv) Position of mouth
Classification

 Preprocessing: Images of
different fishes are isolated from
one another and from background

 Feature extraction: The


information of a single fish is then
sent to a feature extractor, that
measure certain “features” or
“properties”

 Classification: The values of


these features are passed to a
classifier that evaluates the
evidence presented, and build a
model to discriminate between the
two species
Classification

 Domain knowledge:
◦ A sea bass is generally longer than a salmon
 Related feature: (or attribute)
◦ Length
 Training the classifier:
◦ Some examples are provided to the classifier in this
form: <fish_length, fish_name>
◦ These examples are called training examples
◦ The classifier learns itself from the training examples,
how to distinguish Salmon from Bass based on the
fish_length
Classification

 Classification model (hypothesis):


◦ The classifier generates a model from the training data to
classify future examples (test examples)
◦ An example of the model is a rule like:

Rule: If Length >= l* then sea bass otherwise salmon

◦ Here the value of l* determined by the classifier


 Testing the model
◦ Once we get a model out of the classifier, we may use the
classifier to test future examples.
◦ The test data is provided in the form <fish_length>
◦ The classifier outputs <fish_type> by checking fish_length
against the model
Classification

Test/Unlabeled
• So the overall Training Data
Data

classification Preprocessing, Preprocessing,

process goes like and feature


extraction
and feature
extraction

this 
Feature vector Feature vector

Testing against
Training model/
Classification

Model Prediction/Eval
uation
Classification

Pre-
12, salmon If len > 12, then
processing, Training
15, sea bass sea bass else
Feature
8, salmon salmon
extraction
5, sea bass

Training data
Feature vector Model
Labeled data

Pre- 15, salmon sea bass (error!)


processing, 10, salmon Test/ salmon (correct)
Feature 18, ? Classify sea bass
extraction 8, ? salmon

Test data Feature vector Evaluation/Prediction

Unlabeled data
Classification

• Why error?
 Insufficient training data
 Too few features
 Too many/irrelevant features
 Overfitting / specialization
Overfitting: model learns to perform well on training data but fails to perform on unseen data
Classification
Classification

• New Feature:
– Average lightness of the fish scales
Classification
Classification

Pre- If ltns > 6 or


12, 4, salmon
processing, Training len*5+ltns*2>100
15, 8, sea bass
8, 2, salmon then sea bass else
Feature salmon
5, 10, sea bass
extraction

Training data Model


Feature vector

Pre- 15, 2, salmon salmon (correct)


processing, 10, 7, salmon Test/ salmon (correct)
18, 7, ? Classify sea bass
Feature 8, 5, ? salmon
extraction

Evaluation/Prediction
Test data Feature vector
Classification: Performance

 Accuracy:
 % of test data correctly classified
 In our first example, accuracy was 3 out 4 = 75%
 In our second example, accuracy was 4 out 4 =
100%
 False positive:
 Negative class incorrectly classified as positive
 Usually, the larger class is the negative class
 Suppose
 salmon is negative class
 sea bass is positive class
Classification

false positive

false negative
Classification

 Cross validation (3 fold)

Training Training Testing

Training Testing Training

Testing Training Training

Fold 1 Fold 2 Fold 3


CLUSTERING
Clustering

• Given a set of data points, each having a set of


attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one
another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
Illustrating Clustering

• Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized
Clustering: Application 1

• Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their geographical
and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
Cluster Analysis
– Class label is unknown: group data to form new classes
– Clusters of objects are formed based on the principle of
maximizing intra-class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.

37
Clustering: Application 2

• Document Clustering:
– Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate
a new document or search term to clustered documents.
Fraud Detection

• Outlier Analysis
– Data that do no comply with the general behavior or model.
– Outliers are usually discarded as noise or exceptions.
– Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts

39
ASSOCIATION RULE MINING
Association Rule Discovery

• Given a set of records each of which contain some number of


items from a given collection;
– Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
REGRESSION
Regression
• Predict future values based on past values
• Linear Regression assumes linear relationship
exists.
y = c0 + c1 x1 + … + cn xn
• Find values to best fit the data
• Some real-world examples for regression analysis include:

(i) Predicting the price of a house given house features


(ii) Predicting the impact of SAT/GRE scores on college admissions
(iii) Predicting the sales based on input parameters
(iv) Predicting the weather, etc.
Linear Regression
Major Issues in Data Mining

• Mining methodology and User interaction


– Mining different kinds of knowledge
• DM should cover a wide spectrum of data analysis and knowledge
discovery tasks
• Enable to use the database in different ways
• Require the development of numerous data mining techniques
– Interactive mining of knowledge at multiple levels of
abstraction
• Difficult to know exactly what will be discovered
• Allow users to focus the search, refine data mining requests
– Incorporation of background knowledge
• Guide the discovery process
• Allow discovered patterns to be expressed in concise terms and different
levels of abstraction

45
Major Issues in Data Mining
– Data mining query languages and ad hoc data mining
• High-level query languages need to be developed
• Should be integrated with a DB/DW query language
– Presentation and visualization of results
• Knowledge should be easily understood and directly usable
• High level languages, visual representations or other expressive forms
• Require the DM system to adopt the above techniques
– Handling noisy or incomplete data
• Require data cleaning methods and data analysis methods that can handle noise
– Pattern evaluation – the interestingness problem
• How to develop techniques to access the interestingness of discovered patterns,
especially with subjective measures bases on user beliefs or expectations

46
Major Issues in Data Mining
• Performance Issues
– Efficiency and scalability
• Huge amount of data
• Running time must be predictable and acceptable
– Parallel, distributed and incremental mining algorithms
• Divide the data into partitions and processed in parallel
• Incorporate database updates without having to mine the entire data again
from scratch

• Diversity of Database Types


– Other database that contain complex data objects, multimedia data,
spatial data, etc.
– Expect to have different DM systems for different kinds of data
– Heterogeneous databases and global information systems
• Web mining becomes a very challenging and fast-evolving field in data
mining
47

You might also like