Csb4318 DWDM Unit II PPT Word
Csb4318 DWDM Unit II PPT Word
Csb4318 DWDM Unit II PPT Word
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes Data
collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
5
Knowledge Discovery (KDD) Process
6
D
7
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
8
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
10
Multiple/integrated functions and mining at multiple levels Techniques
utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
General functionality
Descriptive data mining
11
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
12
Relational database, data warehouse, transactional database Advanced
data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Data Mining: On What Kinds of Data?
13
The new data base applications include spatial data such as maps
,engineering data such as design of buildings, system components or
integrated circuits, hypertext and multimedia data including text, image,
video and audio data, time related data such as historical records or stock
exchange data, stream data such as video surveillance and sensor data
and world wide web such as huge widely distributed information
repository made available by internet.
These application require efficient data structure and scalable method for
handling complex object structures; variable length records; semi-
structured or unstructured data; text; spatio temporal; and multimedia
14
data; and database schema with complex structures and dynamic
changes.
Relational Queries:
16
Show me list of all items that were sold in last quarter.
17
Data warehouse data
Data warehouse
Data mining
18
Can analyze transactional, spatial, textual, and multimedia
data that are difficult to model with current
multidimensional technology.
Transaction Database
Consists of a file where each record represents a
transaction.
19
no, sales person id no and of the branch at which the
sales occurred.
22
Spatial and Spatial Temporal Database
Geographic databases have numerous applications, ranging from
forestry and ecology planning to provide public service information
regarding the location of telephone and electric cables, pipes, and
sewage systems.
23
Discover pattern describing characteristics of houses located near a
specified kind of location, such as park.
Describe the climate of mountainous areas located at various altitudes
Describe the change in trend of metropolitan poverty rates based on
city distances from major highways.
Spatial Temporal Database
25
Discover general and concise descriptions of the text
documents, content association ,as well as clustering
behavior of the text document.
26
world wide web, and speech based user interfaces that
recognize spoken commands.
28
School Example. Data mining provide interesting solution to the information
exchange problem by performing statistical data distribution and correlation
analysis and transforming the given data into higher, more generalized,
conceptual level(such as fair ,good, excellent) from which information
exchange can then more easily be performed.
Data streams
Many application involves the generation and analysis of a
new kind of data, where data flow and in and out of an
observation platform dynamically.
30
Data streams
Data model for stream data is Continuous query model
consists predefined queries constantly evaluate incoming
streams, collect aggregate data, report the current status of
the data streams and response to their changes.
32
World Wide Web
www and its associated information services such
as yahoo,google,provide rich world wide on line
information services, where data objects are
linked together to facilitate interactive access.
33
World Wide Web
Capturing user access pattern in such distributed
information environment is called web usage
mining.
36
Object Relational Database
Constructed based on the object- relational data model.
This model extends the relational model by providing rich data types
for handling complex objects orientation.
37
with other objects or other database system, a set of methods where
each method holds the code to implement a message.
38
Temporal, sequence and Time series
database
A temporal database stores relational data that include time related
attributes. These attribute involve several timestamps, each having
different semantics.
Stock exchange data mined to discover trends that could help you to
plan investment strategies.(When is the best time to purchase ALL
ELECTRONICS STOCK? This analysis is require defining multiple
granularity of time.
39
Temporal, sequence and Time series database
40
Classified as descriptive mining and predictive mining.
41
Data characterization is a summarization of the general
characteristics or features of a target class of data.
42
Data Mining Functionalities
Frequent patterns, association, correlation vs. causality
A frequent item set typically refers to a set of items that
frequently appear together in a transactional data set, such as milk
and bread.
43
age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD player”)
[support = 2%, confidence = 60%]
The derived model is based on the analysis of a set of training data (i.e.,
data objects whose class label is known).
E.g., accepted if has good credit risk or rejected when he has bad
credit risk based on income,employement status,savings,age,home
ownership.
44
Predict some unknown or missing numerical values
The mining process construct model to identify the person as
belonging to a class.
classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks.
Outlier: Data object that does not comply with the general
behavior of the data
Noise or exception? Useful in fraud detection, rare events analysis
Detecting purchase of large amount compare to regular purchase.
Trend and evolution analysis
45
describes and models regularities or trends for objects whose
behavior changes over time-(several year time series data help to
predict the trend).
Trend and deviation: e.g., regression analysis
Sequential pattern mining, Periodicity analysis Similarity-based
46
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction Applications
and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
47
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
48
First general all the patterns and then filter out the uninteresting
ones
Generate only the interesting patterns—mining query optimization
49
Foundation for design of graphical user interface
Standardization of data mining industry and practice
51
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and
(P1 -
P2) < $50
Primitive 4: Pattern Interestingness Measure
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule
quality, discriminating weight, etc.
Utility potential usefulness, e.g., support (association), noise
threshold (description)
Novelty
52
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
Primitive 5: Presentation of Discovered Patterns
55
DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.
56
Architecture: Typical Data Mining System
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
57
Data Preprocessing
Data cleaning
Data integration and transformation
Data reduction
Summary
58
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization: part of data reduction, of particular
importance for numerical data
Data Cleaning
No quality data, no quality mining results!
Quality decisions must be based on quality data
59
e.g., duplicate or missing data may cause incorrect or even
misleading statistics
“Data cleaning is the number one problem in data warehousing”—DCI
survey
Data extraction, cleaning, and transformation comprises the majority of the
work of building a data warehouse Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
63
Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective
when the % of missing values per attribute varies
considerably
Fill in the missing value manually: tedious +
infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
64
Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
68
Regression
y
Y1
Y1’ y=x+1
X1 x
69
Cluster Analysis
70
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potter’s Wheels)
71
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources Entity
identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
72
Handling Redundancy in Data Integration
minA
v' v (new_ maxAnew_ minA) new_ minA
maxAminA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
vA
v'
A
77
1.225
Normalization by decimal scaling v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Reduction Strategies
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the
complete data set
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
Dimensionality reduction — e.g., remove unimportant attributes
Numerosity reduction (some simply call it: Data Reduction)
78
Data cub aggregation
Data compression
Regression
Discretization (and concept hierarchy generation)
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Principal component analysis
79
Singular value decomposition
Supervised and nonlinear techniques (e.g., feature selection)
80
x2
x1
Principal Component Analysis (Steps)
Given Ndata vectors from n-dimensions, find k≤ n orthogonal
vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
81
Compute korthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the kprincipal component
vectors
The principal components are sorted in order of decreasing “significance” or
strength
Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
Works for numeric data only
82
duplicate much or all of the information contained in one or more
other attributes
E.g., purchase price of a product and the amount of sales tax paid
Irrelevant features
contain no information that is useful for the data mining task at
hand
E.g., students' ID is often irrelevant to the task of predicting
students' GPA
86
Parametric Data Reduction:
Regression and Log-Linear Models
87
Regress Analysis and Log-Linear
Models
Linear regression: Y = w X + b
Two regression coefficients, wand b,specify the line and are to
be estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2,
…, X1, X2, ….
Multiple regression: Y = b0+ b1X1+ b2X2.
Many nonlinear functions can be transformed into the above
Log-linear models:
The multi-way table of joint probabilities is approximated by a
product of lower-order tables
Probability: p(a, b, c, d) = abacadbcd
88
Data Reduction:
Wavelet Transformation Haar2 Daubechie4
89
DWT for Image Compression
Image
90
Data Cube Aggregation
The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of interest
E.g., a customer in a phone calling data warehouse Multiple
levels of aggregation in data cubes
Further reduce the size of data to deal with Reference
appropriate levels
Use the smallest representation which is enough to solve the task
91
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
92
Data Compression
Original Data
Approximated
93
Data Reduction: Histograms
Divide data into buckets and store 40 average (sum) for each bucket
35 Partitioning rules:
Equal-width: equal bucket range 30
94
Equal-frequency (or equal-depth) 25
V-optimal: with the least histogram variance(weighted sum of the 20
original values that each bucket
represents) 15
MaxDiff: set bucket boundary
10
between each pair for pairs have
the β–1 largest differences 5
0
10000 30000 50000 70000 90000
95
Data Reduction Method: Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in
multidimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 7
96
Data Reduction Method: Sampling
Sampling: obtaining a small sample sto represent the whole
data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Key principle: Choose a representative subset of the data
Simple random sampling may have very poor performance in the
presence of skew
Develop adaptive sampling methods, e.g., stratified sampling:
97
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular item
Sampling without replacement
Once an object is selected, it is removed from the population
Sampling with replacement
A selected object is not removed from the population Stratified
sampling:
Partitionthe data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
Used in conjunction with skewed data
98
Sampling: With or without Replacement
Raw Data
99
Sampling: Cluster or Stratified
Sampling
100
Data Reduction: Discretization
Three types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
101
Discretization and Concept
Hierarchy
Discretization
Reduce the number of values for a given continuous attribute by dividing
the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute Concept
hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
102
Discretization and Concept Hierarchy
Generation for Numeric Data
Typical methods: All the methods can be applied recursively
Binning (covered above)
105
Discretization Without Using Class
The process is recursively applied to partitions obtained until some
stopping criterion is met
Such a boundary may reduce data size and improve classification
accuracy
Labels
106
Equal frequency K-means
2
Interval Merge by Analysis
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge them to form
larger intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered to be one
interval
2 tests are performed for every pair of adjacent intervals
107
Adjacent intervals with the least 2 values are merged together, since low
2 values for a pair indicate similar class distributions
This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max inconsistency,
etc.)
108
If it covers 1, 5, or 10 distinct values at the most significant digit,
partition the range into 5 intervals
(-$1,000 - $2,000)
Step 3:
109
(-$400 -$5,000)
Step 4:
110
Specification of a hierarchy for a set of values by explicit
data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels)
by the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
country 15 distinct
values
province_or_ state
365 distinct values
city
3567 distinct
street values
112
Mining Frequent Patterns,
Association and Correlations
116
Customer Customer
buys both buys diaper
Let supmin= 50%, confmin= 50%
Freq. Pat.: {A:3, B:3, D:4, E:3,
AD:3} Association rules:
Customer
A D (60%, 100%)
buys beer D A (60%, 75%)
Closed Patterns
and MaxPatterns
A long pattern contains a combinatorial number of
subpatterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patternsand max-patternsinstead
An itemset X is closed if X is frequentand there exists no
super-patternY כX, with the same supportas X
117
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
120
Generate length (k+1) candidate itemsets from length k frequent
itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be generated
The Apriori Algorithm—An Example
Supmin = 2 Database TDB
Tid Items C1L1
10 A, C, D
20 B, C, E 1st scan
30 A, B, C, E
40 B, E
C2 C2
122
Lk+1 = candidates in Ck+1with min_support
end return kLk;
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates? Example of Candidate-
generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abcand abd
acdefrom acdand ace Pruning:
acdeis removed because adeis not in L3
C4={abcd}
123
How to Generate Candidates?
Step 2: pruning
forall itemsets c in Ckdo forall (k-1)-subsets s of c do
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a transaction
Example: Counting Supports of
Candidates
Subset function
Transaction: 1 2 3 5 6
125
124
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
457 125 159
458
Efficient Implementation of Apriori in SQL
Challenges
127
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
128
Scan 2: consolidate global frequent patterns
129
ab is not a candidate 2-itemset if the sum of count of {ab, ad,
ae} is below support threshold
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
131
Transactions
133