DM-Unit-I Introduction To Association-1
DM-Unit-I Introduction To Association-1
DM-Unit-I Introduction To Association-1
Introduction
1
Course Outline:
• Introduction: KDD Process
• Data Preprocessing
• Association Rule Mining
• Classification
• Clustering and Anomaly Detection
• Regression
• Case Studies
2
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data.
– Data mining is the process of extracting knowledge or insights from large amounts of
data using various statistical and computational techniques. The data can be structured,
semi-structured or unstructured, and can be stored in various forms such as databases,
data warehouses, and data lakes.
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting, business intelligence,
etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
4
5
Data Mining: Confluence of Multiple Disciplines
6
Why Not Traditional Data Analysis?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
7
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
8
Data Mining: On What Kinds of Data?
• Datawarehouse
• Transactional Data
9
KDD(knowledge discovery from data)
10
Steps in KDD
11
Architecture of DM System
12
Data Mining Functionalities
• Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.
• In general, such tasks can be classified into two categories: descriptive and predictive.
• Descriptive mining tasks characterize properties of the data in a target data set.
• Predictive mining tasks perform induction on the current data in order to make predictions.
1. Class/Concept Description: Characterization and Discrimination
2. Associations, and correlations
3. Classification and regression
4. Clustering analysis
5. Outlier analysis
1. Class/Concept Description: Characterization and Discrimination
• It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept
descriptions. These descriptions can be derived using (1) data characterization, by summarizing the data of the class under study (often called the target class) in general terms, or (2)
data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes), or (3) both data characterization and
discrimination. Data characterization is a summarization of the general characteristics or features of a target class of data. The data corresponding to the user-specified class are
typically collected by a query.
2. Associations and Correlations (Frequent patterns)
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence = 50%] // single-dimensional association rule
age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”) [support = 2%, confidence = 60%] // multidimensional association rule
13
Classification and prediction
14
Clustering analysis
• Clustering analyzes data objects without
consulting class labels.
• Clustering can be used to generate class
labels for a group of data.
• The objects are clustered or grouped
based on the principle of maximizing the
intraclass similarity and minimizing the
interclass similarity.
• That is, clusters of objects are formed so
that objects within a cluster have high
similarity in comparison to one another,
but are rather dissimilar to objects in other
clusters
15
Outlier analysis
• A data set may contain objects that do not comply with
the general behaviour or model of the data. These data
objects are outliers.
• Many data mining methods discard outliers as noise or
exceptions. However, in some applications (e.g., fraud
detection) the rare events can be more interesting than
the more regularly occurring ones.
• The analysis of outlier data is referred to as outlier
analysis or anomaly mining.
16
Major Issues in Data Mining
17
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
18
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Knowledge
Data Mining Engine -Base
19
Classification of Data Mining System
20
Classification of Data Mining System
23
End of Introduction
24
Data Preprocessing
• Data preprocessing is an important process of data mining.
In this process, raw data is converted into an
understandable format and made ready for further analysis.
The motive is to improve data quality and make it up to
mark for specific tasks.
Tasks in Data Preprocessing
1. Data cleaning
Data cleaning help us remove inaccurate, incomplete and
incorrect data from the dataset. Some techniques used in data
cleaning are −
1.1 Handling missing values
• Ignore the tuple, if the tuple contains several attributes
with missing values
• Fill in the missing value manually.
• Use a global constant to fill in the missing value: Replace
all missing attribute values by the same constant, such as
a label like “Unknown".
• Use the attribute mean to fill in the missing value.
• Use the most probable value to fill in the missing value.
25
1.2 Noisy Data
Data Preprocessing
• Noisy data are the data that cannot be
interpreted by machine and are
containing unnecessary faulty data.
• Binning − This method handle noisy
data to make it smooth. Data gets
divided equally and stored in form of
bins and then methods are applied to
smoothing or completing the tasks. The
methods are Smoothing by a bin mean
method(bin values are replaced by
mean values), Smoothing by bin
median(bin values are replaced by
median values) and Smoothing by bin
boundary(minimum/maximum bin
values are taken and replaced by
closest boundary values).
• Regression − Regression functions are
used to smoothen the data. Regression
can be linear(consists of one
independent variable) or
multiple(consists of multiple
independent variables).
• Clustering − It is used for grouping the
similar data in clusters and is used for
finding outliers.
1.3 Inconsistent Data
26
DATA INTEGRATION
• Data Integration is the method of
merging data derived from different
sources of data into a consistent
dataset.
• During data migration, there are
certain issues to be considered
namely:
1. Schema Integration and Object
Matching(Entity Identification
Problem)- CustomerId/ custID/CID
2. Redundancy (Correlation Coeff)
3. Tuple Duplication
4. Detection and Resolution of data
value conflicts.(units)
27
Data Transformation
• Smoothing
• Aggregation 1. Min – Max Normalization
2. Z – Score Normalization
• Generalization of data 3. Decimal Scaling
• Normalization Normalization
• Attribute Construction
28
Min – Max Normalization
29
Z – Score Normalization
30
Decimal Scaling Normalization
31
Data Reduction
• Data cube aggregation, which applies aggregation of data during
data cube construction.
• Attribute Subset Selection, which may be detected and removed by
obsolete, weakly significant or redundant measurements.
• Dimensionality reduction where encoding approaches are used to
reduce the data collection size to a minimum.
• Numerosity Reduction of numbers when alternative, smaller data
representation replaces or predicts information.
• Discretization and category hierarchy generation where raw data
values for the attributes are replaced by ranges or higher logical
levels.
32
Attribute Subset Selection
33
Data Mining
Data Preprocessing
34
What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a unit temperature in Celsius deviation, Pearson's
of measurement exists.
or Fahrenheit correlation, t and F
(+, - )
tests
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, length,
percent variation
electrical current
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits.
– Continuous attributes are typically represented as floating-point variables.
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes Tid Refund Marital Taxable
Status Income Cheat
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Facebook graph and HTML Links
2
5 1
2
5
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a
single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Sampling
• Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and the final
data analysis.
• Stratified sampling
– Split the data into several partitions; then draw random samples from each
partition
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse
in the space that it occupies
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
Scatter plots
showing the
similarity from –1
to 1.
End of Data Preprocessing
Data Mining
Association Rules
71
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items in
the transaction
Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk {Diaper} → {Beer},
2 Bread, Diaper, Beer, Eggs {Milk, Bread} → {Eggs,Coke},
3 Milk, Diaper, Beer, Coke {Beer, Bread} → {Milk},
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
• Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
supersets
ABCDE
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk
Beer
4
3
Itemset Count Pairs (2-itemsets)
{Bread,Milk} 3
Diaper 4 {Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 (No need to generate
{Milk,Beer}
{Milk,Diaper}
2
3 candidates involving Coke
{Beer,Diaper} 3 or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
{Bread,Milk,Diaper}
Count
3
6C + 6C + 6C = 41
1 2 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that
are frequent
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may
also increase
• Size of database
– Apriori makes multiple passes, run time of algorithm increase with number of
transactions
• Average transaction width
– This may increase max length of frequent itemsets and traversals of hash tree
(number of subsets in a transaction increases with its width)
Rule Generation
• How to efficiently generate rules from frequent itemsets?
– In general, confidence does not have an anti-monotone property
c(ABC →D) can be larger or smaller than c(AB →D)
– But confidence of rules generated from the same itemset has an anti-
monotone property
– e.g., L = {A,B,C,D}:
c(ABC → D) c(AB → CD) c(A → BCD)
• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Rule Generation for Apriori Algorithm
Lattice of rules ABCD=>{ }
Low
Confidence
Rule BCD=>A ACD=>B ABD=>C ABC=>D
Pruned
D=>ABC C=>ABD B=>ACD A=>BCD
Rules
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same prefix
in the rule consequent
CD=>AB BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
• Subjective measure:
– Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
• A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)
+ - Expected Patterns
- + Unexpected Patterns
• Need to combine expectation of users with evidence from data (i.e., extracted patterns)
End of Association Rule