L1 CH 1 Introd
L1 CH 1 Introd
L1 CH 1 Introd
Data Mining
• Chapter 1 Introduction
• Chapter 2 Know your Data
• Chapter 3 Data Preprocessing
• Chapter 4 Data Warehousing
• Chapter 6 Mining Frequent Patterns;
Association and Correlation: Basic Concepts
• Chapter 7 Advanced Frequent Patterns
Book 3rd Edition and Course Content
• Data – as in databases
• Information or knowledge is a
meta information ABOUT the
patterns hidden in the data
§ The patterns must be discovered
automatically
Why Data Mining?
• Data Mining:
Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in large
databases
What is Data Mining?
DM is a process to extract
previously unknown knowledge
from large volumes of data
• classification (chapter 6)
• association (chapter 5)
• prediction (chapter 6)
• clustering (chapter 7)
Data Mining
• DM often presents the knowledge as a set
of rules of the form
IF.... THEN ...
In this case it is called a Descriptive DM
• DM detects deviations
DM: Some Historical Applications
• More Applications
• Text mining
• News groups, emails, documents
• Web analysis
• Intelligent query answering
• Scientific Applications
DM: Business Advantages
• Applications
– widely used in health care, retail, credit
card services, telecommunications
(phone card fraud), etc.
• Approach
– use historical data to build models of
fraudulent behavior and use data mining
to help identify similar instances
Fraud Detection and Management (B2)
• Examples
– auto insurance: detect characteristics of
group of people who stage accidents to collect
on insurance
– money laundering: detect characteristics of
suspicious money transactions (US Treasury's
Financial Crimes Enforcement Network)
– medical insurance: detect characteristics of
fraudulent patients and doctors
Fraud Detection and Management (B3)
• Detecting inappropriate medical treatment
– Australian Health Insurance Commission detected
that in many cases blanket screening tests were
requested (save Australian $1m/yr)
• Detecting telephone fraud
– DM builds telephone call model: destination of the
call, duration, time of day or week.
– Detects patterns that deviate from an expected norm.
– British Telecom identified discrete groups of callers
with frequent intra-group calls, especially mobile
phones, and broke a multimillion dollar fraud
Fraud Detection and Management (B4)
• Retail
– Analysts used Data Mining techniques
to estimate that 38% of retail shrink is
due to dishonest employees
– and more….
Data Mining vs Data Marketing
• Data Marketing:
• Applications of Data Mining methods in
which the goal is to find buying patterns
in Transactional Data Bases
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
Market Analysis and Management (MA3)
• Customer profiling
– DM can tell you what types of customers buy
what products (clustering or classification)
• Competition:
– monitor competitors and market
directions
– group customers into classes and a
class-based pricing procedure
– set pricing strategy in a highly
competitive market
Business Summary
• Data Mining helps to improve competitive
advantage of organizations in dynamically
changing environment;
• it improves clients retention and
conversion
• Different Data Mining methods are
required for different kind of data and
different kinds of goals
Scientific Applications
• Networks failure detection
• Controllers
• Geographic Information Systems
• Genome- Bioinformatics
• Intelligent robots
• Intelligent rooms
• etc… etc ….
What is NOT Data Mining
• Once patterns are found Data Mining
process is finished
• 1960s:
– Data collection, database creation, IMS and
network DBMS
• 1970s:
– Relational data model, relational DBMS
implementation
Evolution of Database Technology
• 1980s:
– RDBMS, advanced data models (extended-
relational, OO, deductive, etc.) and
application-oriented DBMS (spatial, scientific,
engineering, etc.)
• 1990s—2000s:
– Data mining and data warehousing,
multimedia databases, and Web database
• 2000 ---- Big Data
Short History of Data Mining
• 1989 - KDD term: Knowledge Discovery
in Databases appears in (IJCAI
Workshop)
• 2000- present
• the term Data Mining becomes established
and evolves into Big Data
Data Mining: Confluence of
Multiple Disciplines
Database
Statistics
Technology
Machine
Learning Data Mining Visualization
Information Other
Science Disciplines
KDD process Definition
[Piatetsky-Shapiro 97]
Knowledge
DATA MINING (proper)
Processed Data
SELECTION
Target data
Data
DM: Data Mining
• Rememeber
• It is necessary to apply first
• the preprocessing operations to clean and
preprocess the data in order to obtain significant
patterns
Knowledge
DATA MINING (proper)
Processed Data
SELECTION
Target data
Data
KDD vs DM
• KDD was a term used by academia
• DM was often used as a commercial
term
• DM term is now being used in academia,
as it has become a “brand name” for both
KDD process and its DM sub-process
• The important point is to see DM as a
process with Data Mining Proper as part
of it
• BIG DATA – a new videly use term
Steps of the DM process
• Preprocessing: includes all the
operations that have to be performed before
a data mining algorithm is applied
()
• Data Mining (proper): knowledge
discovery algorithms are applied in order to
obtain the patterns
(8 )
• Interpretation: discovered patterns are
presented in a proper format and the user
decides if it is neccesary to re-iterate the
algorthms
Architecture of a Typical Data Mining System
(book slide)
Pattern evaluation
Databases Data
Warehouse
What Kind of Data?
• Relational Databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW
Descriptive Data Mining:
Concept Description
• For example:
• climate=wet is a description of the concept of WET
CLIMATE and
• WET CLIMAT = {records: climate=wet}
• We use words: decision attribute, class
attribute, concept attribute
• We talk about decision or class description
•
• REMEMBER: all definitions are relative to
the database we deal with.
Desctiptive DM
Decision, Concept, Class Characteristics
• A=0 & B=1 à C=1 33% 83% (support, confidence: the conditional
probability of the concept given the characteristics)
• A=2 & B=0 à C=1 27% 80%
• A=1 & B=1 à C=1 12% 76%
Classification - Supervised Learning
– Classification
– Finding models (rules) that describe
(characterize) or/ and distinguish
(discriminate) classes or concepts for future
prediction
– Example: classify countries based on climate
(characteristics)
– classify cars based on gas mileage and use it
to predict classification of a new car
Classification Algorithms
Models, Basic Classifiers
– Presentation of results:
– characteristic and /or discriminant rules
– In case of descriptive DM
• Outlier analysis
– Outlier: a data object that does not comply
with the general behavior of the data
– It can be considered as noise or exception
but is quite useful in fraud detection, rare
events analysis and others
Statistical DM
• Consistency
Other preprocessing tasks
• Generalization vs specification
• Discretization
• Sampling
• Reducing number of attributes at the
preprocessing stage
Summary
• The preprocessing is required and is an
essential part of the DM process
Boundary
Region
Lower
ConceptX
Lower
⎧0 card ( X ) = 0
c ( X ,Y ) = ⎨ if
⎩1 − card ( X ∩ Y ) / card ( X ) card ( X ) > 0
Rough Sets in SQL
Begin UPPER
setdb(dbName);
exec(conn,”BEGIN”);
while not_end_records() do
equ_class=exec(“FETCH 1 IN cursor”);
first_decision_value=get_value(equ_class(“D”));
insert(equ_class,upper[first_decision_value]);
while (equ_class == exec(“FETCH 1 IN cursor”) do
decision_value=get_value(equ_class(“D”));
insert(equ_class,upper[first_decision_value]);
end while
end while
End UPPER
Statistical Methods
• Transactional data
• There is not needed to specify right and
left side of the rules
• There are algorithms to tackle any kind
of data
• Minimum support
• Maximum number of rules to be
obtained
Clustering: requirements
• Set of attributes
• Maximum number of clusters
• Number of iterations
• Mimimun number of elements in any
cluster