01 Intro To Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Adopted from Dr.

Saed Sayad Lectures


▪Why Data Mining?
▪What is Data Mining?
▪Data Mining Applications
▪Data Mining Tasks
▪Data Mining Steps
The Explosive Growth of Data: from terabytes to petabytes
Vast amount of data is
collected daily in Commercial
Viewpoint
➢Web data
➢ e-commerce
➢ purchases at
department/
grocery stores
➢ Bank/Credit Card
transactions

Vast amount of data is


collected daily in Scientific Viewpoint
➢ remote sensors on a satellite
➢ telescopes scanning the
skies ➢ microarrays generating
gene expression data
➢ scientific simulations
generating terabytes of data
▪ Information retrieval (Databases) is
simply not enough anymore for decision making
We are drowning in data, but starving for knowledge!

▪ Mining Data — Automated analysis of massive data sets
Also known as Knowledge Discovery in Databases (KDD)

▪It is an Extraction of interesting (non-trivial, implicit,


previously unknown and potentially useful) patterns or
knowledge from huge amount of data
The process of

A multi-disciplinary
filed which combines
Statistics, AI & Machine
Learning, Database & Data
Warehousing
Data mining is the process of discovering interesting
patterns and knowledge from large amounts of data.
The data sources can include databases, data warehouses,
the Web, other information repositories, or data that are
streamed into the system dynamically.
Improving health care and reducing costs Predicting the impact of climate change

Finding alternative/ green energy sources agriculture production

Prediction Methods
Reducing hunger and poverty by increasing
✓ Use some variables to predict unknown or future values
of other variables.
Description Methods
✓ Find human-interpretable patterns that describe the
data.
Data
Tid Refund Marital Taxable
Status
Income Cheat

Milk
1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No
4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K
No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married
75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes
10
Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No
15 No Single 90K Yes
Find a model for class attribute as a function of the values of
other attributes
Model for predicting credit
worthiness
Employed
Class
Education No
Credit Worthy Education
# years at present
Tid Employed Level of address No Yes

1 Yes Graduate 5 Yes 2 Yes High School 2 No 3


No Undergrad 1 No

Graduate { High school, Undergrad }


4 Yes High School 10 Yes … … …
10

…… Number of Number of
years years
> 3 yr < 3 yr > 7 yrs < 7 yrs
Yes Yes No
No Tid Employed
Level of # years at
Credit Worthy
Education present
address
Education 1 Yes Undergrad 7 ? 2 No
Tid Employed Level of # years at present addressGraduate 3 ? 3 Yes High
Credit Worthy School 2 ?
1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Set
Undergrad 1 No 4 Yes High School 10 Yes … … … …
10


…………… 10

Test
Learn
Classifier Model

Classifying credit card transactions as legitimate or fraudulent


Training Set

• Classifying land covers (water bodies, urban areas, forests, etc.) using
satellite data
• Categorizing news stories as finance, weather, entertainment, sports,
etc • Identifying intruders in the cyberspace
• Predicting tumor cells as benign or malignant
• Classifying secondary structures of protein as alpha-helix, beta-sheet, or
random coil
GOAL:
Predict fraudulent cases in credit card transactions
APPROACH:
▪ Use credit card transactions and the information on its account-holder
as attributes.
▪ When does a customer buy, what does he buy, how often he
pays on time, etc
▪ Label past transactions as fraud or fair transactions. This forms the
class attribute
▪ Predict a value of a given continuous valued variable based on the values
of other variables, assuming a linear or nonlinear model of dependency.
▪ Examples:
✓ Predicting sales amounts of new product based on
advetising expenditure.
✓Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
✓Time series prediction of stock market indices.
▪ Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to)
the objects in other groups
.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
▪ Given a set of records each of which contain some number of items from
a given collection
▪ Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.
TID Items
1 Bread, Coke, Milk 2
Beer, Bread Rules Discovered:
3 Beer, Coke, Diaper, Milk 4 {Milk} --> {Coke}
Beer, Bread, Diaper, Milk 5 Coke, {Diaper, Milk} --> {Beer}
Diaper, Milk
▪ Market-basket analysis
▪ Rules are used for sales promotion, shelf management, and inventory
management
▪ Telecommunication alarm diagnosis
▪ Rules are used to find combination of alarms that occur together
frequently in the same time period
▪ Medical Informatics
▪ Rules are used to find combination of patient symptoms and test results
associated with certain diseases
▪ Detect significant
deviations from normal
behavior
▪ Applications:
▪ Credit Card Fraud Detection
▪ Network Intrusion
Detection
▪ Identify anomalous
behavior from sensor
networks for monitoring
and surveillance.
▪ Detecting changes in the
global forest cover.
• Problem Definition 2•
1
Data Preparation 3•
Data Exploration 4•
Modeling
5• Evaluation
6• Deployment
▪Understanding the project objectives and requirements from
a business perspective and then converting this knowledge
into a data mining problem definition with a preliminary plan
designed to achieve the objectives.

▪A successful data mining project starts from a well defined


question or need.

Data Preparation

involves data,
dataset,
databases and ETL(
Extraction,
Transformation &
Loading

You might also like