Lecture - 1 02032023 095637am 1 29022024 124126pm

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

Data Mining

Introduction
Recommended Books
 Data Mining: Concepts and Techniques, Second Edition
by
Jiawei Han and Micheline Kamber

 Data Mining: Introductory and Advanced Topics


By
Margaret H. Dunham, S. Sridhar

 Data Mining: Practical Machine Learning Tool and Techniques


by
Ian. H. Witten and Eibe Frank
Introduction
Information Growing

Define data mining

Data mining vs. databases

Basic data mining tasks

Data mining development

Data mining issues


Information Growing
Data is growing at a exceptional rate

Users expect more sophisticated information

How?

“UNCOVER HIDDEN INFORMATION”


DATA MINING
Attributes

So, what is Data? Tid Refund Marital Taxable


 Collection of data objects and Status Income Cheat

their attributes 1 Yes Single 125K No


2 No Married 100K No
 An attribute is a property or 3 No Single 70K No
characteristic of an object 4 Yes Married 120K No
 Examples: eye color of a person, 5 No Divorced 95K Yes
temperature, etc. Objects
6 No Married 60K No
 Attribute is also known as 7 Yes Divorced 220K No
variable, field, characteristic, or 8 No Single 85K Yes
feature
9 No Married 75K No
 A collection of attributes
10 No Single 90K Yes
describe an object 10

 Object is also known as record,


point, case, sample, entity, or Size: Number of objects
instance Dimensionality: Number of attributes
Sparsity: Number of populated
object-attribute pairs
Types of Attributes
 There are different types of attributes
Categorical
 Examples: eye color, zip codes, words, rankings (e.g, good,
fair, bad), height in {tall, medium, short}
 Nominal (no order or comparison) vs Ordinal (order but
not comparable)
Numeric
 Examples: dates, temperature, time, length, value, count.
 Discrete (counts) vs Continuous (temperature)
 Special case: Binary attributes (yes/no, exists/not exists)
Motivation
“Necessity is the Mother of Invention”

Data Explosion Problem


Automated data collection tools (e.g. web, sensor
networks) and mature Database technology lead to
tremendous amounts of data stored in databases, data
warehouses and other information repositories.

Currently enterprises are facing data explosion problem.


Motivation Cont…
Electronic Information an Important Asset for
Business Decisions
With the growth of electronic information, enterprises
began to realizing that the accumulated information can
be an important asset in their business decisions
There is a potential business intelligence hidden in the
large volume of data.
This intelligence can be the secret weapon on which the
success of a business may depend.
Trends leading to Data Flood
More data is generated
Bank telecom other business transactions
Scientific data: astronomy, biology, etc
Web, text, and e- commerce
AT&T handles billions of calls per day
So much data it cannot be all stored analysis has to be done
“on the fly”, on streaming data
Max Planck Inst. for Meteorology , 222 TB
Yahoo ~ 100 TB
Google (huge amount of data)
Extracting Business Intelligence(Solutions)
It is not a Simple Matter to discover Business Intelligence
from Mountain of Accumulated Data

What is required are Techniques that allow the enterprise


to Extract the Most Valuable Information

The Field of Data Mining provides such Techniques

These techniques can Find Novel Patterns (unknown) that


may Assist an Enterprise in Understanding the business
better and in forecasting
Data Mining Definition
It can be defined as to find the hidden information in a
database
Fit data to a model
It is the process of discovering new patterns from large
data sets involving methods at the intersection of
artificial intelligence, machine learning, statistics and
database systems
The goal of data mining is to extract knowledge from a
data set in a human-understandable structure
Data mining is sorting through data to identify patterns
and establish relationships.
Data Mining Definition Cont…
Data mining is the process in which there is analysis of
data form different angle and perspectives and
summarizing the same data into the relevant
information.
What can be the purpose of data mining?
This kind of information could be utilized to increase
the revenue, reducing the cost/expense or the both
Data Mining Cont…
Data Mining Algorithm Characterization
Model
The purpose of the algorithm is to fit a model to the data
Preference
Some criteria must be used to fit one model over another
Search
All Algorithms require some techniques to search the
data
Data Mining Cont….
Query: (a question; an inquiry)
Queries are the primary mechanism for retrieving
information from a database and consist of questions
presented to the database in a predefined format. Many
database management systems use the Structured
Query Language (SQL) standard query format.
The Query might be not well defined or precisely stated.
The data miner might not even be exactly sure of what
he wants to see.
Data Mining Cont….
Data:
The data accessed is usually a different version from
that of the original operational database. The data has
been cleansed and modified to better support the
mining process.
Output:
The output of the data mining query probably is not a
subset of the database. Instead it is the output of some
analysis of the contents of the database.
Database Processing vs. Data Mining
Processing
Database Query

Well defined

SQL

Data Mining Query

Poorly defined

No precise query language


Database Processing vs. Data Mining
Processing Cont…
Database Data
 Operational data(ready data, prepared data)
Data Mining Data
 Not operational data

Database Output
 Precise
 Subset of database
Data Mining Output
 Fuzzy
 Not a subset of database
Query Examples
Database
Find all credit applicants with last name of Smith.
Identify customers who have purchased more than
$10,000 in the last month.
Data Mining
Find all credit applicants who are poor credit risks.
(Classification)
 Identify customers with similar buying habits.
(Clustering)
 Find all items which are frequently purchased with
milk. (Association rules)
Why Data Mining
Credit ratings/targeted marketing:
 Given a database of 100,000 names, which persons are the least
likely to default on their credit cards?
 Identify likely responders to sales promotions

Fraud detection
 Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
Customer relationship management:
 Which of my customers are likely to be the most loyal, and which
are most likely to leave for a competitor? :

Data Mining helps extract such


information
Data mining
Process of semi-automatically analyzing large
databases to find patterns that are:
valid: hold on new data with some certainity
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the
pattern
Also known as Knowledge Discovery in Databases
(KDD)
Applications
 Banking: loan/credit card approval
 predict good customers based on old customers
Customer relationship management:
 identify those who are likely to leave for a competitor.
Targeted marketing:
 identify likely responders to promotions
Fraud detection: telecommunications, financial
transactions
 from an online stream of event identify fraudulent events
Manufacturing and production:
 automatically adjust knobs when process parameter changes
Relationship with other fields
Overlaps with machine learning, statistics,
artificial intelligence, databases, visualization but
more stress on
scalability of number of features and instances
stress on algorithms and architectures whereas
foundations of methods and formulations provided by
statistics and machine learning.
automation for handling large, heterogeneous data
Data Mining Models and Tasks
Data Mining vs Expert Systems
Expert Systems = Rule-Driven Deduction
Top-down: From known rules (expertise) and data to
decisions.
Data Mining vs Expert Systems Cont…
Data Mining = Data-Driven Induction
Bottom-up: From data about past decisions to
discovered rules (general rules induced from the data).
Difference b/w Machine Learning and Data Mining
Machine Learning techniques are designed to deal
with a limited amount of artificial intelligence data.
Where the Data Mining Techniques deal with large
amount of databases data.
Data Mining (Knowledge Discovery in Databases)
Extraction of interesting non-trivial, implicit, previously
unknown and potentially useful) information or
patterns from data in large databases.
Difference b/w Machine Learning and Data Mining
What is not Data Mining?
(Deductive) query processing
Expert systems or small ML/statistical programs
Data Mining (Example)
Random Guessing vs. Potential Knowledge
Suppose we have to Forecast the Probability of Rain in
Islamabad city for any particular day
Without any Prior Knowledge the probability of rain
would be 50% (pure random guess).
If we had a lot of weather data, then we can extract
potential rules using Data Mining which can then
forecast the chance of rain better than random guessing.
Data Mining (Example) Cont…
If the data of Islamabad Weather is

if [Temperature = ‘hot’ and Humidity = ‘high’] then


there is 66.6% chance of rain
Basic Data Mining Tasks
Classification maps data into predefined
groups or classes
Supervised learning
Pattern recognition
Prediction
Basic Data Mining Tasks Cont…
Regression is used to map a data item to a real
valued prediction variable.
Clustering groups similar data together into
clusters.
Unsupervised learning
Segmentation
Partitioning
Basic Data Mining Tasks (Cont…)
Summarization maps data into subsets with
associated simple descriptions.
Characterization(find some characteristics)
Generalization (Generalized based on features)
Link Analysis uncovers relationships among data.
Affinity Analysis(or similarity Analysis)
Association Rules(or relationships rules)
Sequential Analysis determines sequential patterns.
Data Mining vs. KDD
Knowledge Discovery in Databases (KDD):
It is the process of finding useful information and
patterns in data.

Data Mining:
Use of algorithms to extract the information and
patterns derived by the KDD process.

You might also like