DM 1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Data Mining & Warehousing

Chapter 1. Introduction

Data Warehousing/Mining
Chapter 1. Introduction

 Motivation: Why data mining?


 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Are all the patterns interesting?
 Classification of data mining systems
 Major issues in data mining

Data Warehousing/Mining
• 1 Zeta byte = 1
trillion Gigabytes.

• 5,200 GB of data
for every
person on
Earth.

Data Warehousing/Mining 3
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, social media,
mobile devices, …
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
• Mine the knowledge from data
Data Warehousing/Mining 4
Example of Data
Volumes

https://www.eecis.udel.edu/~amer/Table-Kilo-Mega-Giga---YottaBytes.html

Data Warehousing/Mining 17
Evolution of Database Technology

 1960s:
– Data collection, database creation, IMS and network DBMS
 1970s:
– Relational data model, relational DBMS implementation
 1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
 1990s—2000s:
– Data mining and data warehousing, multimedia databases, and
Web databases

Data Warehousing/Mining
Data Mining: On What Kind of
Data?
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW

Data Warehousing/Mining
What Is Data Mining?
 Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously


unknown and potentially useful) information or patterns
from data in large databases
 Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
 What is not data mining?
– (Deductive) query processing.
– Expert systems or small machine learning/
statistical programs
Data Warehousing/Mining
Data Mining: A Knowledge Discovery in
Databases (KDD) Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Data Warehousing/Mining
7 Data Mining Steps
 1. Data cleaning – remove noise and
inconsistent data
 2. Data integration – combine multiple
sources
 3. Data selection – retrieve from the
database data relevant to the analysis task
 4. Data transformation – data are
transformed or consolidated into forms
appropriate for mining (e.g. performing
summary or aggregation operations)

Data Warehousing/Mining 1
7 Data Mining Steps (continued)

 5. Data mining – intelligent methods are


applied to extract data patterns
 6. Pattern evaluation – identify truly
interesting patterns representing knowledge
based on some interestingness measures
 7. Knowledge presentation – present mined
knowledge to the user

Data Warehousing/Mining 1
Data Mining: Classification Schemes

 General functionality
– Descriptive data mining
– Predictive data mining
 Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted

Data Warehousing/Mining 1
Why Data Mining? — Potential
Applications
 Database analysis and decision support
– Market analysis and management
 target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
 Other Applications
– Text mining (news group, email, documents) and Web analysis.
– Intelligent query answering

Data Warehousing/Mining 1
Market Analysis and Management (1)

 Where are the data sources for analysis?


– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
 Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
 Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
Data Warehousing/Mining 1
Market Analysis and Management (2)

 Customer profiling
– data mining can tell you what types of customers buy what
products (clustering or classification)
 Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new customers
 Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and
variation)
Data Warehousing/Mining 1
Corporate Analysis and Risk
Management
 Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
 Resource planning:
– summarize and compare the resources and spending
 Competition:
– monitor competitors and market directions
– group customers into classes and a class-based pricing
procedure
– set pricing strategy in a highly competitive market

Data Warehousing/Mining 1
Fraud Detection and Management (1)

 Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
 Approach
– use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
 Examples
– auto insurance: detect a group of people who stage accidents to
collect on insurance
– money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
Data Warehousing/Mining 1
Fraud Detection and Management (2)
 Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr).
 Detecting telephone fraud
– Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
– British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and broke
a multimillion dollar fraud.
 Retail
– Analysts estimate that 38% of retail shrink is due to dishonest
employees.
Data Warehousing/Mining 1
Other Applications

 Sports
– IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
 Astronomy
– JPL and the Palomar Observatory discovered 22 quasars with
the help of data mining
 Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Data Warehousing/Mining 1
Major Issues in Data Mining (1)

 Mining methodology and user interaction


– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
 Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods

Data Warehousing/Mining 2
Major Issues in Data Mining (2)

 Issues relating to the diversity of data types


– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
 Issues related to applications and social impacts
– Application of discovered knowledge
 Domain-specific data mining tools
 Intelligent query answering
 Process control and decision making
– Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
– Protection of data security, integrity, and privacy

Data Warehousing/Mining 2
Summary

 Data mining: discovering interesting patterns from large amounts of


data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Classification of data mining systems
 Major issues in data mining
Data Warehousing/Mining 2

You might also like