Csb4318 DWDM Unit II PPT Word

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 133

Unit – II Data Mining

 Motivation: Why data mining?


 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Major issues in data mining

2
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes  Data
collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data

 Business: Web, e-commerce, transactions, stocks, …

 Science: Remote sensing, bioinformatics, scientific simulation, …

 Society and everyone: news, digital cameras, YouTube  We are


drowning in data, but starving for knowledge!
3
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
 Data mining: a misnomer? Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
4
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

5
Knowledge Discovery (KDD) Process

6
D

7
KDD Process: Several Key Steps
 Learning the application domain
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant
representation
 Choosing functions of data mining
 summarization, classification, regression, association, clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
8
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge

Why Not Traditional Data Analysis?


 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of
data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions  High
complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
9
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

Multi-Dimensional View of Data Mining


 Data to be mined
 Relational, data warehouse, transactional, stream, objectoriented/relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
 Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.

10
 Multiple/integrated functions and mining at multiple levels  Techniques
utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.

Data Mining: Classification Schemes

 General functionality
 Descriptive data mining

11
 Predictive data mining
 Different views lead to different classifications
 Data view: Kinds of data to be mined
 Knowledge view: Kinds of knowledge to be discovered
 Method view: Kinds of techniques utilized
 Application view: Kinds of applications adapted
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications

12
 Relational database, data warehouse, transactional database  Advanced
data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
Data Mining: On What Kinds of Data?
13
 The new data base applications include spatial data such as maps
,engineering data such as design of buildings, system components or
integrated circuits, hypertext and multimedia data including text, image,
video and audio data, time related data such as historical records or stock
exchange data, stream data such as video surveillance and sensor data
and world wide web such as huge widely distributed information
repository made available by internet.

 These application require efficient data structure and scalable method for
handling complex object structures; variable length records; semi-
structured or unstructured data; text; spatio temporal; and multimedia

14
data; and database schema with complex structures and dynamic
changes.

 Due to this reason advanced database system and specific application


oriented database system has been developed.

Relational Database system

 Collection of interrelated data is known as database and a set of s/w


programs to access and manage data.

 A Relational database is a collection of tables, each of which is


assigned a unique name. Each table consists set of attributes.
15
 Each tuples in a relational table represent an object identified by a
unique and described by a set of attributes.

 ER model is constructed for relational database which represent the


database as a set of entities and their relationships.

 Relational data can be accessed by executing SQL query

 A relational query transformed into set of relational operation such as


join, selection, and projection, and is then optimized for efficient
processing.

Relational Database system

 Relational Queries:
16
 Show me list of all items that were sold in last quarter.

 Show me the total sales of the last month, grouped by


branch.(sum,avg,min,max,aggregation)

 Mining system can analyze customer data to predict the


credit risk of new customers based on their income,age,and
previous credit information.

 Mining system can detect deviations ,such as items whose


sales are far from those expected in comparison with
previous year.

17
Data warehouse data
Data warehouse

 Historical data analysis facilitates decision making.


 Integrated, Subject oriented, Non volatile, Time
referenced.
 Multidimensional Analysis of data.
 Data's are organized in separate schema architec.

Data mining

18
 Can analyze transactional, spatial, textual, and multimedia
data that are difficult to model with current
multidimensional technology.
Transaction Database
 Consists of a file where each record represents a
transaction.

 Transaction also consists unique transaction identity


number and a list of the items making up the transaction,
other information such as date of transaction, customer id

19
no, sales person id no and of the branch at which the
sales occurred.

 Most relational DBMS do not support nested relational


structure where as transactional DB store list of items ids
for each transaction.

 Show me all items purchased by kathir ,a simple query on


Transactional DB can be performed.
Transaction Database
Data Mining
20
A regular data retrieval system not able to answer queries
like, Which items sold well together?

 Market Basket analysis would enable you to bundle group


of items together as a strategy for maximizing sales.

 Printers are commonly purchased by together with


computer, you could offer an expensive model of printers
at a discount to customers buying selected computers, in
the hopes of selling more of the expensive printers .
Spatial and Spatial Temporal Database

 Contains spatial related information's. includes,


21
 Geographic(map) databases, very large scale integration or computer
aided design databases, medical and satellite image databases.

 Spatial data represented in raster format, consisting of n-dimensional


bit maps or pixel maps.

 Example 2-D Satellite image represented as raster data, where each


pixel registers the rainfall in a given area.

 Maps represented in vector format, where roads,bridges,buildings and


lakes are represented as unions or overlays of basic geometric
constructs, such as points,polygons,and network formed by these
components.

22
Spatial and Spatial Temporal Database
 Geographic databases have numerous applications, ranging from
forestry and ecology planning to provide public service information
regarding the location of telephone and electric cables, pipes, and
sewage systems.

 Used to vehicle navigation and dispatching systems.


Example.Taxys would store a city map with information regarding one
way street, suggested route, and the location of the restaurant and
hospitals as well as the current location each driver.

What kind of Mining-on Spatial?

23
 Discover pattern describing characteristics of houses located near a
specified kind of location, such as park.
 Describe the climate of mountainous areas located at various altitudes
 Describe the change in trend of metropolitan poverty rates based on
city distances from major highways.
Spatial Temporal Database

 A spatial Database that stores spatial objects that change


with time is called spatiotemporal database, from which
interesting information can be mined.

 Able to group the trends of moving objects and identify


some strangely moving vehicles, or distinguish a
24
bioterrorist attack from a normal outbreak of the flu based
on the geographic spread of a disease with time.
Text Database and Multimedia database

 Text databases are database that contain word description


for objects. It is not a simple keyword but long sentences
or paragraphs such as product specification, error or bug
reports, warning msg,summary report notes.

 What can data mine on text database uncover?

25
 Discover general and concise descriptions of the text
documents, content association ,as well as clustering
behavior of the text document.

 To do this standard data mining method need to be


integrated with information retrieval technique.
Multimedia database

 Multimedia database store image,audio,and video data.

 They are used in applications such as picture content


based retrival,voice mail system, video on demand system,

26
world wide web, and speech based user interfaces that
recognize spoken commands.

 Specialized storage and search techniques are need to be


integrated with standard data mining methods. because
video and audio data require real time retrieval at a steady
and predetermined rate in order to avoid picture or sound
gaps and system buffer overflows, such data are referred
to as continuous media data.

 Construction of multimedia data cubes leads to extraction


of multiple features from multimedia data and similarity
based pattern matching.
27
Heterogeneous and legacy databases

 A Heterogeneous database consists of a set of interconnected ,autonomous


component databases. The components are communicate in order to
exchange information and answer queries.

 Heterogeneous data are not easily integrated.

 A legacy database is a group of heterogeneous database that combines


different kinds of data systems. The heterogeneous database in a legacy
database connected by intra or inter -computer network.

 Information exchange across such database is difficult because it would


require precise transformation rules from one representation to
another ,considering diverse semantics.

28
 School Example. Data mining provide interesting solution to the information
exchange problem by performing statistical data distribution and correlation
analysis and transforming the given data into higher, more generalized,
conceptual level(such as fair ,good, excellent) from which information
exchange can then more easily be performed.

Data streams
 Many application involves the generation and analysis of a
new kind of data, where data flow and in and out of an
observation platform dynamically.

 Features of data streams: huge or possibly infinite volume,


dynamically changing, flowing in and out in a fixed order,
29
allowing only one or a small number of scans and
demanding fast response time.

 Data streams include various kinds of scientific and


engineering data ,time series data ,power supply, network
traffic ,stock exchange,telecommunicationsvideo
surveillance and whether on environmental monitoring.

 Data streams are not stored in any kind of data repository.


Because efficient mgt and analysis of stream data is an
challenging task.

30
Data streams
 Data model for stream data is Continuous query model
consists predefined queries constantly evaluate incoming
streams, collect aggregate data, report the current status of
the data streams and response to their changes.

 Efficient discovery of general patterns and dynamic changes


within stream data.

 Example: To detect intrusion of a computer network based


on the anomaly of message flow, which may be discovered by
31
clustering data streams, dynamic construction of stream
models, or comparing the current frequent patterns with that
at a certain previous time.

32
World Wide Web
 www and its associated information services such
as yahoo,google,provide rich world wide on line
information services, where data objects are
linked together to facilitate interactive access.

 Understanding user access pattern will not only


help improve system design but also leads to
better marketing decisions.

33
World Wide Web
 Capturing user access pattern in such distributed
information environment is called web usage
mining.

 WWW data's are highly unstructured and lack of


predefined schema or patterns.

 Web services that provide keyword based


searches without understanding the context
behind the web pages can only offer limited help
to users.
34
World Wide Web
 Web search search based on a single keyword
may return 100 of web pages pointers containing
the keyword, but most of the pointers will be very
weekly related what the user wants to find.

 Data mining can often provide additional help


than the web services .

 Web page analysis based on linkages among eb


pages can help rank web pages based on their
importance ,influence, and topics.
35
World Wide Web
 Automated web page clustering and classification
help group and arrange web pages in a
multidimensional manner based on their content.

 Web community analysis help identify hidden


web social network and communities and observe
their evaluation.

36
Object Relational Database
 Constructed based on the object- relational data model.

 This model extends the relational model by providing rich data types
for handling complex objects orientation.

 Most sophisticated database application need to handle complex


objects and structures. So this model became very popular in
industry and applications.

 Conceptually ORDM inherits the essential concept of OO databases,


where each entity is considered as object.

 Data and code relating to an object are encapsulated into a single


unit. Each object is associated with set of variable which is describe
object, a set of messages that the object can use to communicate

37
with other objects or other database system, a set of methods where
each method holds the code to implement a message.

Object Relational Database


 Objects that share a common properties can be grouped
into an object class, each object is an instant of a class.

 Example: An employee class can contain variables like


name ,address and birth date. The sub class of the class,
salesperson can inherit all variables of its super class.

 Data mining technique need to be developed for handling


complex object structures, complex data types, class and
sub class hierarchies, property inheritance and methods.

38
Temporal, sequence and Time series
database
 A temporal database stores relational data that include time related
attributes. These attribute involve several timestamps, each having
different semantics.

 DM tech used to find the characteristics of object evaluation or the


trend of changes for object in the database. Such information useful
in decision making and strategy planning.

 Example: Mining of bank data aid in the scheduling of bank tellers


according to the volume of customer traffic.

 Stock exchange data mined to discover trends that could help you to
plan investment strategies.(When is the best time to purchase ALL
ELECTRONICS STOCK? This analysis is require defining multiple
granularity of time.
39
Temporal, sequence and Time series database

 A sequence database store sequences of ordered events with or


without concrete notion of time.(Customer shopping sequences)

 A time series database stores sequence of values or events obtained


over repeated measurements of time.(Hourly,daily,weekly)

 Example: data collected from the stock exchange, inventory control,


and observation of natural phenomena(like temperature and wind).

 DM tech used to find the characteristics of object evaluation or the


trend of changes for object in the database

Data Mining Functionalities


 Used to specify kind of patterns to be found in data mining task.

40
 Classified as descriptive mining and predictive mining.

 Descriptive mining task characterize the general properties of the


data

 Predictive mining task perform inference on the current data in order


to take prediction.

 Data mining system mine multiple kind of patterns to accommodate


different user expectation.

 Able to discover pattern at various granularity.

Data Mining Functionalities


 Class or concept description: useful to describe individual classes
and concepts in summarized, concise, and yet precise terms.
 Characterization and discrimination

41
 Data characterization is a summarization of the general
characteristics or features of a target class of data.

 To study the characteristics of software products whose sales


increased by 10% in the last year

 Data discrimination, by comparison of the target class with one or a set of


comparative classes

 Data discrimination is a comparison of the general features of target


class data objects with the general features of objects from one or a
set of contrasting classes.

 Example, the user may like to compare the general features of


software products whose sales increased by 10% in the last year
with those whose sales decreased by at least 30% during the same
period

42
Data Mining Functionalities
 Frequent patterns, association, correlation vs. causality
 A frequent item set typically refers to a set of items that
frequently appear together in a transactional data set, such as milk
and bread.

 buys(X; “computer”))buys(X; “software”) [support = 1%;


confidence
= 50%]

 A confidence, or certainty, of 50% means that if a customer buys a


computer, there is a 50% chance that she will buy software as well.

 A 1% support means that 1% of all of the transactions under analysis


showed that computer and software were purchased together

43
 age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD player”)
[support = 2%, confidence = 60%]

Data Mining Functionalities


Classification and prediction
 Construct models (functions) that describe and distinguish classes or
concepts for future prediction for the purpose of being able to use the
model to predict the class of objects whose class label is unknown.

 The derived model is based on the analysis of a set of training data (i.e.,
data objects whose class label is known).

 To determine the credit risk of credit card applicant.

 E.g., accepted if has good credit risk or rejected when he has bad
credit risk based on income,employement status,savings,age,home
ownership.

44
 Predict some unknown or missing numerical values
 The mining process construct model to identify the person as
belonging to a class.
 classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks.

Data Mining Functionalities (2)


 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
Banking-identify the group of customer. Outlier analysis

 Outlier: Data object that does not comply with the general
behavior of the data
 Noise or exception? Useful in fraud detection, rare events analysis
 Detecting purchase of large amount compare to regular purchase.
 Trend and evolution analysis

45
 describes and models regularities or trends for objects whose
behavior changes over time-(several year time series data help to
predict the trend).
 Trend and deviation: e.g., regression analysis
 Sequential pattern mining, Periodicity analysis Similarity-based

Major Issues in Data Mining


 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion
 User interaction

46
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction  Applications
and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy

Are All the “Discovered” Patterns Interesting?

 Data mining may generate thousands of patterns: Not all of them


are interesting
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new
or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm

 Objective vs. subjective interestingness measures

47
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.

Find All and Only Interesting Patterns?


 Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns?
 Association vs. classification vs. clustering
 Search for only interesting patterns: An optimization problem
 Can a data mining system find only the interesting patterns?
 Approaches

48
 First general all the patterns and then filter out the uninteresting
ones
 Generate only the interesting patterns—mining query optimization

Why Data Mining Query Language?


 Automated vs. query-driven?
 Finding all the patterns autonomously in a database?—unrealistic
because the patterns could be too many but uninteresting  Data
mining should be an interactive process
 User directs what to be mined
 Users must be provided with a set of primitives to be used to
communicate with the data mining system
 Incorporating these primitives in a data mining query language
 More flexible user interaction

49
 Foundation for design of graphical user interface
 Standardization of data mining industry and practice

Primitives of a Data Mining Task


 Task-relevant data
 Database or data warehouse name
 Database tables or data warehouse cubes
 Condition for data selection
 Relevant attributes or dimensions
 Data grouping criteria
 Type of knowledge to be mined
 Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
 Background knowledge
50
 Pattern interestingness measurements
 Visualization/presentation of discovered patterns
Primitive 3: Background Knowledge

 A typical kind of background knowledge: Concept


hierarchies  Schema hierarchy
 E.g., street < city < province_or_state < country
 Set-grouping hierarchy
 E.g., {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
 email address: [email protected] login-name <
department < university < country  Rule-based hierarchy

51
 low_profit_margin (X) <= price(X, P1) and cost (X, P2) and
(P1 -
P2) < $50
Primitive 4: Pattern Interestingness Measure

 Simplicity
e.g., (association) rule length, (decision) tree size
 Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule
quality, discriminating weight, etc.
 Utility potential usefulness, e.g., support (association), noise
threshold (description)
 Novelty

52
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
Primitive 5: Presentation of Discovered Patterns

 Different backgrounds/usages may require different forms of


representation
 E.g., rules, tables, crosstabs, pie/bar chart, etc.
 Concept hierarchy is also important
 Discovered knowledge might be more understandable when
represented at high level of abstraction
 Interactive drill up/down, pivoting, slicing and dicing provide
different perspectives to data
 Different kinds of knowledge require different representation:
53
association, classification, clustering, etc.
Integration of Data Mining and Data Warehousing

 Data mining systems, DBMS, Data warehouse systems


coupling

 No coupling, loose-coupling, semi-tight-coupling, tight-coupling

 On-line analytical mining data

 integration of mining and OLAP technologies

 Interactive mining multi-level knowledge

 Necessity of mining knowledge and patterns at different levels of


abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

 Integration of multiple mining functions


54
 Characterized classification, first clustering and then association

Coupling Data Mining with DB/DW Systems


 No coupling—flat file processing, not recommended
Loose coupling
 Fetching data from DB/DW

 Semi-tight coupling—enhanced DM performance


 Provide efficient implement a few data mining primitives in a
DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions

 Tight coupling—A uniform information processing


environment

55
 DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.

56
Architecture: Typical Data Mining System

Graphical User Interface

Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web

57
Data Preprocessing

 Data cleaning
 Data integration and transformation
 Data reduction
 Summary

Major Tasks in Data Preprocessing


 Data cleaning

58
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the
same or similar analytical results
 Data discretization: part of data reduction, of particular
importance for numerical data

Data Cleaning
 No quality data, no quality mining results!
 Quality decisions must be based on quality data

59
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics
 “Data cleaning is the number one problem in data warehousing”—DCI
survey
 Data extraction, cleaning, and transformation comprises the majority of the
work of building a data warehouse Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

Data in the Real World Is Dirty


 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
60
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or
names, e.g.,
 Age=“42” Birthday=“03/07/1997”
 Was rating “1,2,3”, now rating “A, B, C”
 discrepancy between duplicate records

Why Is Data Dirty?


 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was collected and
when it is analyzed.
 Human/hardware/software problems

 Noisy data (incorrect values) may come from


 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
61
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)

 Duplicate records also need data cleaning


Multi-Dimensional Measure of Data Quality
 A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
 Broad categories:
 Intrinsic, contextual, representational, and accessibility
62
Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data Missing data
may need to be inferred
How to Handle Missing Data?

63
 Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective
when the % of missing values per attribute varies
considerably
 Fill in the missing value manually: tedious +
infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree

64
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

How to Handle Noisy Data?


65
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with
possible outliers)

Simple Discretization Methods: Binning


 Equal-width (distance) partitioning
 Divides the range into Nintervals of equal size: uniform grid
66
 if Aand Bare the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well

 Equal-depth (frequency) partitioning


 Divides the range into Nintervals, each containing approximately same
number of samples
 Good data scaling
 Managing categorical attributes can be tricky

Binning Methods for Data Smoothing


Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
67
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9,
9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

68
Regression
y

Y1

Y1’ y=x+1

X1 x

69
Cluster Analysis

70
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

71
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources Entity
identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units

72
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple


databases
 Object identification: The same attribute or object may have
different names in different databases
 Derivable data:One attribute may be a “derived” attribute in another
table, e.g., annual revenue

 Redundant attributes may be able to be detected by


correlation analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
73
Correlation Analysis (Numerical Data)

 Correlation coefficient (also called Pearson’s product moment


coefficient)

rp,q (p p)(qq) (pq)npq


(n1)pq (n1)pq
where n is the number of tuples, andp are the respective means of p
q and q, σp and σq are the respective standard deviation of p and q, and
Σ(pq) is the sum of the pq cross-product.

 If rp,q > 0, p and q are positively correlated (p’s values


increase as q’s). The higher, the stronger correlation.
74
r = 0: independent; rpq < 0: negatively correlated
 p,q

Correlation (viewed as linear


relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, p and q, and then take their dot product

pk  (pk  mean(p))/std(p) qk  (qk 


mean(q))/std(q) correlation(p,q)  pq
75
Data Transformation
 A function that maps the entire set of values of a given
attribute to a new set of replacement values s.t. each
old value can be identified with one of the new values
 Methods
 Smoothing: Remove noise from data
 Aggregation: Summarization, data cube construction
 Generalization: Concept hierarchy climbing
 Normalization: Scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
76
Data Transformation: Normalization
 Min-max normalization: to [new_minA, new_maxA]

minA
v' v (new_ maxAnew_ minA) new_ minA
maxAminA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

Then $73,000 is mapped to (1.00)0  0.716


 Z-score normalization (μ: mean, σ: standard deviation):

vA

v'
 A

 Ex. Let μ = 54,000, σ = 16,000. Then

77
1.225
 Normalization by decimal scaling v

v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Reduction Strategies
 Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run on the
complete data set
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produce the same (or almost the
same) analytical results
 Data reduction strategies
 Dimensionality reduction — e.g., remove unimportant attributes
 Numerosity reduction (some simply call it: Data Reduction)
78
 Data cub aggregation
 Data compression
 Regression
 Discretization (and concept hierarchy generation)

Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Principal component analysis

79
 Singular value decomposition
 Supervised and nonlinear techniques (e.g., feature selection)

Dimensionality Reduction: Principal


Component Analysis (PCA)
 Find a projection that captures the largest amount of
variation in data
 Find the eigenvectors of the covariance matrix, and these
eigenvectors define the new space

80
x2

x1
Principal Component Analysis (Steps)
 Given Ndata vectors from n-dimensions, find k≤ n orthogonal
vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
81
 Compute korthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the kprincipal component
vectors
 The principal components are sorted in order of decreasing “significance” or
strength
 Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
 Works for numeric data only

Feature Subset Selection


 Another way to reduce dimensionality of data
Redundant features

82
 duplicate much or all of the information contained in one or more
other attributes
 E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant features
 contain no information that is useful for the data mining task at
hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA

Heuristic Search in Feature Selection


 There are 2dpossible feature combinations of d
features Typical heuristic feature selection methods:
 Best single features under the feature independence assumption:
choose by significance tests
 Best step-wise feature selection:
83
 The best single-feature is picked first
 Then next best feature condition to the first, ...
 Step-wise feature elimination:
 Repeatedly eliminate the worst feature
 Best combined feature selection and elimination  Optimal branch
and bound:
 Use feature elimination and backtracking
Feature Creation
 Create new attributes that can capture the important
information in a data set much more efficiently than
the original attributes
 Three general methodologies
 Feature extraction
 domain-specific
84
 Mapping data to new space (see: data reduction)
 E.g., Fourier transformation, wavelet transformation
 Feature construction
 Combining features
 Data discretization
Mapping Data to a New Space
 Fourier transform
 Wavelet transform
85
Two Sine Waves
Two Sine Waves + Noise Frequency

Numerosity (Data) Reduction


 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
 Example: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling

86
Parametric Data Reduction:
Regression and Log-Linear Models

 Linear regression: Data are modeled to fit a straight line


 Often uses the least-square method to fit the line

 Multiple regression: allows a response variable Y to be


modeled as a linear function of multidimensional feature
vector

 Log-linear model: approximates discrete multidimensional


probability distributions

87
Regress Analysis and Log-Linear
Models
 Linear regression: Y = w X + b
 Two regression coefficients, wand b,specify the line and are to
be estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2,
…, X1, X2, ….
 Multiple regression: Y = b0+ b1X1+ b2X2.
 Many nonlinear functions can be transformed into the above
Log-linear models:
 The multi-way table of joint probabilities is approximated by a
product of lower-order tables
 Probability: p(a, b, c, d) = abacadbcd

88
Data Reduction:
Wavelet Transformation Haar2 Daubechie4

 Discrete wavelet transform (DWT): linear signal


processing, multi-resolutional analysis
 Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length

89
DWT for Image Compression
 Image

Low Pass High Pass

Low Pass High Pass

Low Pass High Pass

90
Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse  Multiple
levels of aggregation in data cubes
 Further reduce the size of data to deal with  Reference
appropriate levels
 Use the smallest representation which is enough to solve the task

Queries regarding aggregated information should be


answered using data cube, when possible

91
Data Compression
 String compression
 There are extensive theories and well-tuned algorithms
Typically lossless

 But only limited manipulation is possible without expansion


 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
 Time sequence is not audio
 Typically short and vary slowly with time

92
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

93
Data Reduction: Histograms
 Divide data into buckets and store 40 average (sum) for each bucket
35 Partitioning rules:
 Equal-width: equal bucket range 30

94
 Equal-frequency (or equal-depth) 25
 V-optimal: with the least histogram variance(weighted sum of the 20
original values that each bucket

represents) 15
 MaxDiff: set bucket boundary
10
between each pair for pairs have
the β–1 largest differences 5
0
10000 30000 50000 70000 90000

95
Data Reduction Method: Clustering
 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in
multidimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 7

96
Data Reduction Method: Sampling
 Sampling: obtaining a small sample sto represent the whole
data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor performance in the
presence of skew
 Develop adaptive sampling methods, e.g., stratified sampling:

 Note: Sampling may not reduce database I/Os (page at a


time)

97
Types of Sampling
 Simple random sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 Once an object is selected, it is removed from the population
 Sampling with replacement
A selected object is not removed from the population  Stratified
sampling:
 Partitionthe data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
 Used in conjunction with skewed data

98
Sampling: With or without Replacement

Raw Data
99
Sampling: Cluster or Stratified
Sampling

Raw Data Cluster/Stratified Sample

100
Data Reduction: Discretization
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic rank
 Continuous — real numbers, e.g., integer or real numbers

 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis

101
Discretization and Concept
Hierarchy
 Discretization
 Reduce the number of values for a given continuous attribute by dividing
the range of the attribute into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute Concept


hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
102
Discretization and Concept Hierarchy
Generation for Numeric Data
 Typical methods: All the methods can be applied recursively
 Binning (covered above)

 Top-down split, unsupervised,


 Histogram analysis (covered above)

 Top-down split, unsupervised


 Clustering analysis (covered above)

 Either top-down split or bottom-up merge, unsupervised


 Entropy-based discretization: supervised, top-down split

 Interval merging by 2 Analysis: unsupervised, bottom-up merge


 Segmentation by natural partitioning: top-down split, unsupervised
103
Discretization Using Class Labels

 Entropy based approach


3 categories for both x and y 5 categories for both x and y
104
Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S 1 and S2
using boundary T, the information gain after partitioning is
S1| | S2 |
I(S,T)  | Entropy(S1)  Entropy(S2)
|S| |S|
 Entropy is calculated based on class distribution of the samples in the
set. Given mclasses, the entropy of S1is
m

Entropy(S1)  pi log2(pi )


i1 where
pi is the probability of class iin S1
 The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization

105
Discretization Without Using Class
 The process is recursively applied to partitions obtained until some
stopping criterion is met
 Such a boundary may reduce data size and improve classification
accuracy

Labels

Data Equal interval width

106
Equal frequency K-means
2
Interval Merge by  Analysis
 Merging-based (bottom-up) vs. splitting-based methods
 Merge: Find the best neighboring intervals and merge them to form
larger intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
 Initially, each distinct value of a numerical attr. A is considered to be one
interval
 2 tests are performed for every pair of adjacent intervals
107
 Adjacent intervals with the least 2 values are merged together, since low
2 values for a pair indicate similar class distributions
 This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max inconsistency,
etc.)

Segmentation by Natural Partitioning

 A simply 3-4-5 rule can be used to segment numeric


data into relatively uniform, “natural” intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most significant digit,
partition the range into 4 intervals

108
 If it covers 1, 5, or 10 distinct values at the most significant digit,
partition the range into 5 intervals

Example of 3-4-5 Rule


count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max


Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

109
(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)

Concept Hierarchy Generation for


Categorical Data
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country

110
 Specification of a hierarchy for a set of values by explicit
data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels)
by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}

Automatic Concept Hierarchy Generation


 Some hierarchies can be automatically generated
based on the analysis of the number of distinct
values per attribute in the data set
 The attribute with the most distinct values is placed at the
lowest level of the hierarchy
111
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct
values
province_or_ state
365 distinct values
city
3567 distinct
street values

674,339 distinct values

112
Mining Frequent Patterns,
Association and Correlations

 Basic concepts and a road map


 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
113
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set

 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the


context of frequent itemsets and association rule mining

 Motivation: Finding inherent regularities in data

 What products were often purchased together?— Beer and diapers?!

 What are the subsequent purchases after buying a PC?

 What kinds of DNA are sensitive to this new drug?

 Can we automatically classify web documents? Applications


114
 Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.

Why Is Freq. Pattern Mining Important?


 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
115
 Semantic data compression: fascicles
 Broad applications

Basic Concepts: Frequent Patterns


and Association Rules
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X Ywith
20 A, C, D minimum support and confidence
30 A, D, E  support, s, probability that a
40 B, E, F transaction contains X  Y
50 B, C, D, E, F  confidence, c,conditional probability
that a transaction having X also contains Y

116
Customer Customer
buys both buys diaper
Let supmin= 50%, confmin= 50%
Freq. Pat.: {A:3, B:3, D:4, E:3,
AD:3} Association rules:
Customer
A D (60%, 100%)
buys beer D A (60%, 75%)

Closed Patterns
and MaxPatterns
 A long pattern contains a combinatorial number of
subpatterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patternsand max-patternsinstead
 An itemset X is closed if X is frequentand there exists no
super-patternY ‫כ‬X, with the same supportas X
117
(proposed by Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫כ‬X (proposed by
Bayardo @ SIGMOD’98)
 Closed pattern is a lossless compression of freq. patterns
 Reducing the # of patterns and rules

Closed Patterns and


MaxPatterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
118
 < a1, …, a50>: 2

 What is the set of max-pattern?


 <a1, …, a100>: 1

 What is the set of all patterns?


 !!
Scalable Methods for Mining Frequent
Patterns
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
119
 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
Apriori: A Candidate Generation-and-Test
Approach

 Apriori pruning principle: If there is any itemset which


is infrequent, its superset should not be
generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method:


 Initially, scan DB once to get frequent 1-itemset

120
 Generate length (k+1) candidate itemsets from length k frequent
itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be generated
The Apriori Algorithm—An Example
Supmin = 2 Database TDB
Tid Items C1L1
10 A, C, D
20 B, C, E 1st scan
30 A, B, C, E
40 B, E
C2 C2

L22nd scan Itemset


{B, C, E} 121
C33rd scan L3

The Apriori Algorithm


 Pseudo-code:
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1= {frequent items}; for (k= 1;
Lk!=; k++) do begin
Ck+1= candidates generated from Lk; for
each transaction tin database do
increment the count of all candidates in Ck+1
that are contained in t

122
Lk+1 = candidates in Ck+1with min_support
end return kLk;
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?  Example of Candidate-
generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abcand abd
 acdefrom acdand ace Pruning:
 acdeis removed because adeis not in L3
 C4={abcd}
123
How to Generate Candidates?

 Suppose the items in Lk-1are listed in an order


 Step 1: self-joining Lk-1
insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-
1p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-
1 < q.itemk-
1

 Step 2: pruning
forall itemsets c in Ckdo forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete cfrom Ck


How to Count Supports of Candidates?
124
 Why counting supports of candidates a problem?
 The total number of candidates can be very huge
 One transaction may contain many candidates

 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets and counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained in a transaction
Example: Counting Supports of
Candidates

Subset function
Transaction: 1 2 3 5 6

125
124
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
457 125 159
458
Efficient Implementation of Apriori in SQL

 Hard to get good performance out of pure SQL (SQL-


126
92) based approaches alone

 Make use of object-relational extensions like UDFs, BLOBs,


Table functions etc.
 Get orders of magnitude improvement

 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating


association rule mining with relational database systems:
Alternatives and implications. In SIGMOD’98

Challenges of Frequent Pattern


Mining

 Challenges
127
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates

 Improving Apriori: general ideas


 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

Partition: Scan Database Only Twice

 Any itemset that is potentially frequent in DB must be


frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent patterns

128
 Scan 2: consolidate global frequent patterns

 A. Savasere, E. Omiecinski, and S. Navathe. An efficient


algorithm for mining association in large databases. In
VLDB’95
DHP: Reduce the Number of
Candidates

 A k-itemset whose corresponding hashing bucket


count is below the threshold cannot be frequent
 Candidates: a, b, c, d, e
 Hash entries: {ab, ad, ae} {bd, be, de} …
 Frequent 1-itemset: a, b, d, e

129
 ab is not a candidate 2-itemset if the sum of count of {ab, ad,
ae} is below support threshold

 J. Park, M. Chen, and P. Yu. An effective hash-based


algorithm for mining association rules. In SIGMOD’95
Sampling for Frequent Patterns

 Select a sample of original database, mine frequent


patterns within sample using Apriori
 Scan database once to verify frequent itemsets found in
sample, only bordersof closure of frequent patterns are
checked
 Example: check abcdinstead of ab, ac, …, etc.
130
 Scan database again to find missed frequent patterns
 H. Toivonen. Sampling large databases for association rules.
In VLDB’96
DIC: Reduce Number of Scans
ABCD
 Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD  Once all length-2 subsets of BCD
are determined frequent, the
counting of BCD begins
AB AC BC AD BD CD

1-itemsets
A B C D
Apriori 2-itemsets

{}
131
Transactions

Itemset lattice 1-itemsets


S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
132
 # of Candidates: (1001) + (1002) + … + (110000) = 21001 =
1.27*1030 !

 Bottleneck: candidate-generation-and-test  Can


we avoid candidate generation?

133

You might also like