DM 1

Data Mining & Warehousing
Chapter 1. Introduction
Data Warehousing/Mining
Chapter 1. Introduction
 Motivation: Why data mining?

 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Are all the patterns interesting?
 Classification of data mining systems
 Major issues in data mining
• 1 Zeta byte = 1
trillion Gigabytes.
• 5,200 GB of data
for every
person on
Earth.
Data Warehousing/Mining 3
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, social media,
mobile devices, …
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
• Mine the knowledge from data
Example of Data
Volumes
https://www.eecis.udel.edu/~amer/Table-Kilo-Mega-Giga---YottaBytes.html
Evolution of Database Technology
 1960s:
– Data collection, database creation, IMS and network DBMS
 1970s:
– Relational data model, relational DBMS implementation
 1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
 1990s—2000s:
– Data mining and data warehousing, multimedia databases, and
Web databases
Data Mining: On What Kind of
Data?
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW
What Is Data Mining?
 Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) information or patterns
from data in large databases
 Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
 What is not data mining?
– (Deductive) query processing.
– Expert systems or small machine learning/
statistical programs
Data Mining: A Knowledge Discovery in
Databases (KDD) Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
7 Data Mining Steps
 1. Data cleaning – remove noise and
inconsistent data
 2. Data integration – combine multiple
sources
 3. Data selection – retrieve from the
database data relevant to the analysis task
 4. Data transformation – data are
transformed or consolidated into forms
appropriate for mining (e.g. performing
summary or aggregation operations)
7 Data Mining Steps (continued)
 5. Data mining – intelligent methods are

applied to extract data patterns
 6. Pattern evaluation – identify truly
interesting patterns representing knowledge
based on some interestingness measures
 7. Knowledge presentation – present mined
knowledge to the user
Data Mining: Classification Schemes
 General functionality
– Descriptive data mining
– Predictive data mining
 Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
Why Data Mining? — Potential
Applications
 Database analysis and decision support
– Market analysis and management
 target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
 Other Applications
– Text mining (news group, email, documents) and Web analysis.
– Intelligent query answering
Market Analysis and Management (1)
 Where are the data sources for analysis?

– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
 Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
 Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
Market Analysis and Management (2)
 Customer profiling
– data mining can tell you what types of customers buy what
products (clustering or classification)
 Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new customers
 Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and
variation)
Corporate Analysis and Risk
Management
 Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
 Resource planning:
– summarize and compare the resources and spending
 Competition:
– monitor competitors and market directions
– group customers into classes and a class-based pricing
procedure
– set pricing strategy in a highly competitive market
Fraud Detection and Management (1)
 Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
 Approach
– use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
 Examples
– auto insurance: detect a group of people who stage accidents to
collect on insurance
– money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
Fraud Detection and Management (2)
 Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr).
 Detecting telephone fraud
– Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
– British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and broke
a multimillion dollar fraud.
 Retail
– Analysts estimate that 38% of retail shrink is due to dishonest
employees.
Other Applications
 Sports
– IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
 Astronomy
– JPL and the Palomar Observatory discovered 22 quasars with
the help of data mining
 Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Major Issues in Data Mining (1)
 Mining methodology and user interaction

– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
 Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
Major Issues in Data Mining (2)
 Issues relating to the diversity of data types

– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
 Issues related to applications and social impacts
– Application of discovered knowledge
 Domain-specific data mining tools
 Intelligent query answering
 Process control and decision making
– Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
– Protection of data security, integrity, and privacy
Summary
 Data mining: discovering interesting patterns from large amounts of

data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Classification of data mining systems
 Major issues in data mining

DM 1

Uploaded by

Copyright:

Available Formats

DM 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM 1

Uploaded by

Copyright:

Available Formats

Data Mining & Warehousing

 Motivation: Why data mining?

– Extraction of interesting (non-trivial, implicit, previously

Data Warehouse Selection

 5. Data mining – intelligent methods are

 Where are the data sources for analysis?

 Mining methodology and user interaction

 Issues relating to the diversity of data types

 Data mining: discovering interesting patterns from large amounts of

You might also like