DM 1
DM 1
DM 1
Chapter 1. Introduction
Data Warehousing/Mining
Chapter 1. Introduction
Data Warehousing/Mining
• 1 Zeta byte = 1
trillion Gigabytes.
• 5,200 GB of data
for every
person on
Earth.
Data Warehousing/Mining 3
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, social media,
mobile devices, …
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
• Mine the knowledge from data
Data Warehousing/Mining 4
Example of Data
Volumes
https://www.eecis.udel.edu/~amer/Table-Kilo-Mega-Giga---YottaBytes.html
Data Warehousing/Mining 17
Evolution of Database Technology
1960s:
– Data collection, database creation, IMS and network DBMS
1970s:
– Relational data model, relational DBMS implementation
1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
1990s—2000s:
– Data mining and data warehousing, multimedia databases, and
Web databases
Data Warehousing/Mining
Data Mining: On What Kind of
Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW
Data Warehousing/Mining
What Is Data Mining?
Data mining (knowledge discovery in databases):
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Warehousing/Mining
7 Data Mining Steps
1. Data cleaning – remove noise and
inconsistent data
2. Data integration – combine multiple
sources
3. Data selection – retrieve from the
database data relevant to the analysis task
4. Data transformation – data are
transformed or consolidated into forms
appropriate for mining (e.g. performing
summary or aggregation operations)
Data Warehousing/Mining 1
7 Data Mining Steps (continued)
Data Warehousing/Mining 1
Data Mining: Classification Schemes
General functionality
– Descriptive data mining
– Predictive data mining
Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
Data Warehousing/Mining 1
Why Data Mining? — Potential
Applications
Database analysis and decision support
– Market analysis and management
target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
Other Applications
– Text mining (news group, email, documents) and Web analysis.
– Intelligent query answering
Data Warehousing/Mining 1
Market Analysis and Management (1)
Customer profiling
– data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new customers
Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and
variation)
Data Warehousing/Mining 1
Corporate Analysis and Risk
Management
Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning:
– summarize and compare the resources and spending
Competition:
– monitor competitors and market directions
– group customers into classes and a class-based pricing
procedure
– set pricing strategy in a highly competitive market
Data Warehousing/Mining 1
Fraud Detection and Management (1)
Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
– use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
Examples
– auto insurance: detect a group of people who stage accidents to
collect on insurance
– money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
Data Warehousing/Mining 1
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr).
Detecting telephone fraud
– Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
– British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and broke
a multimillion dollar fraud.
Retail
– Analysts estimate that 38% of retail shrink is due to dishonest
employees.
Data Warehousing/Mining 1
Other Applications
Sports
– IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
Astronomy
– JPL and the Palomar Observatory discovered 22 quasars with
the help of data mining
Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Data Warehousing/Mining 1
Major Issues in Data Mining (1)
Data Warehousing/Mining 2
Major Issues in Data Mining (2)
Data Warehousing/Mining 2
Summary