21IS503 UnitII LM5
21IS503 UnitII LM5
21IS503 UnitII LM5
1
Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
2
Why Data Mining?
■ The Explosive Growth of Data: from terabytes to petabytes
3
What Is Data Mining?
■ Alternative names
■ Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
4
Knowledge Discovery (KDD) Process
■ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
■ Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
5
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
7
Multi-Dimensional View of Data Mining
■ Data to be mined
■ Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social
and information networks
■ Knowledge to be mined (or: Data mining functions)
■ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
■ Descriptive vs. predictive data mining
■ Multiple/integrated functions and mining at multiple levels
■ Techniques utilized
■ Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
■ Applications adapted
■ Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
8
Data Mining: On What Kinds of Data?
■ Database-oriented data sets and applications
■ Relational database, data warehouse, transactional database
■ Advanced data sets and advanced applications
■ Data streams and sensor data
■ Time-series data, temporal data, sequence data (incl. bio-sequences)
■ Structure data, graphs, social networks and multi-linked data
■ Object-relational databases
■ Heterogeneous databases and legacy databases
■ Spatial data and spatiotemporal data
■ Multimedia database
■ Text databases
■ The World-Wide Web
9
Data Mining Function: (1) Generalization
10
Data Mining Function: (2) Association and
Correlation Analysis
■ Frequent patterns (or frequent itemsets)
■ What items are frequently purchased together in your
Walmart?
■ Association, correlation vs. causality
■ A typical association rule
■ Diaper 🡪 Beer [0.5%, 75%] (support, confidence)
■ Are strongly associated items also strongly correlated?
■ How to mine such patterns and rules efficiently in large
datasets?
■ How to use such patterns for classification, clustering,
and other applications?
11
Data Mining Function: (3) Classification
12
Data Mining Function: (4) Cluster Analysis
13
Data Mining Function: (5) Outlier Analysis
■ Outlier analysis
■ Outlier: A data object that does not comply with the general
behavior of the data
■ Noise or exception? ― One person’s garbage could be another
person’s treasure
■ Methods: by product of clustering or regression analysis, …
■ Useful in fraud detection, rare events analysis
14
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
■ Sequence, trend and evolution analysis
■ Trend, time-series, and deviation analysis: e.g.,
memory cards
■ Periodicity analysis
■ Similarity-based analysis
15
Structure and Network Analysis
■ Graph mining
■ Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
■ Information network analysis
■ Social networks: actors (objects, nodes) and relationships (edges)
■ e.g., author networks in CS, terrorist networks
family, classmates, …
■ Links carry a lot of semantic information: Link mining
■ Web mining
■ Web is a big information network: from PageRank to Google
■ Analysis of Web information networks
■ Web community discovery, opinion mining, usage mining, …
16
Evaluation of Knowledge
■ Are all mined knowledge interesting?
■ One can mine tremendous amount of “patterns” and knowledge
■ Some may fit only certain dimension space (time, location, …)
■ Some may not be representative, may be transient, …
■ Evaluation of mined knowledge → directly mine only
interesting knowledge?
■ Descriptive vs. predictive
■ Coverage
■ Typicality vs. novelty
■ Accuracy
■ Timeliness
■ …
17
Data Mining: Confluence of Multiple Disciplines
18
Why Confluence of Multiple Disciplines?
19
Summary
■ Data mining: Discovering interesting patterns and knowledge from
massive amount of data
■ A natural evolution of database technology, in great demand, with
wide applications
■ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
■ Mining can be performed in a variety of data
■ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
20