Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 36

Data Mining Concepts and Applications

 Six factors behind the sudden rise in popularity of data


mining
1. General recognition of the untapped value in large databases;
2. Consolidation of database records tending toward a single
customer view;
3. Consolidation of databases, including the concept of an
information warehouse;
4. Reduction in the cost of data storage and processing,
providing for the ability to collect and accumulate data;
5. Intense competition for a customer’s attention in an
increasingly saturated marketplace; and
6. The movement toward the de-massification of business
practices
Data Mining Concepts and Applications

 Data mining (DM)


A process that uses statistical, mathematical,
artificial intelligence and machine-learning
techniques to extract and identify useful information
and subsequent knowledge from large databases
Data Mining Concepts and Applications

 Major characteristics and objectives of data mining


 Data is often buried deep within very large databases,
which sometimes contain data from several years;
sometimes the data has been cleansed and consolidated
in a data warehouse
 The data mining environment is usually client/server
architecture or a Web-based architecture
Data Mining Concepts and Applications

 Major characteristics and objectives of data mining


 Sophisticated new tools help to remove the information
ore buried in corporate files or archival public records;
finding it involves massaging and synchronizing the
data to get the right results.
 The miner is often an end user, empowered by data drills
and other power query tools to ask ad hoc questions and
obtain answers quickly, with little or no programming skill
Data Mining Concepts and Applications

 Major characteristics and objectives of data mining


 Striking it rich often involves finding an unexpected
result and requires end users to think creatively
 Data mining tools are readily combined with
spreadsheets and other software development tools; the
mined data can be analyzed and processed quickly and
easily
 Parallel processing is sometimes used because of the
large amounts of data and massive search efforts
Data Mining Concepts and Applications

 How data mining works


 Data mining tools find patterns in data and may even
infer rules from them
 Three methods are used to identify patterns in data:
1. Simple models
2. Intermediate models
3. Complex models
Data Mining Concepts and Applications

 Classification
Supervised induction used to analyze the historical
data stored in a database and to automatically
generate a model that can predict future behavior
 Common tools used for classification are:
 Neural networks
 Decision trees
 If-then-else rules
Data Mining Concepts and Applications

 Clustering
Partitioning a database into segments in which the
members of a segment share similar qualities
 Association
A category of data mining algorithm that
establishes relationships about items that occur
together in a given record
Data Mining Concepts and Applications

 Sequence discovery
The identification of associations over time

 Visualization can be used in conjunction with data


mining to gain a clearer understanding of many
underlying relationships
Data Mining Concepts and Applications

 Regression is a well-known statistical technique


that is used to map data to a prediction value

 Forecasting estimates future values based on


patterns within large sets of data
Data Mining Concepts and Applications

 Hypothesis-driven data mining


Begins with a proposition by the user, who then
seeks to validate the truthfulness of the proposition
 Discovery-driven data mining
Finds patterns, associations, and relationships
among the data in order to uncover facts that were
previously unknown or not even contemplated by
an organization
Data Mining Concepts and Applications

Data mining applications


– Marketing – Computer hardware
– Banking and software
– Retailing and sales – Government and
– Manufacturing and defense
production – Airlines
– Brokerage and – Health care
securities trading – Broadcasting
– Insurance – Police
– Homeland security
Data Mining
Techniques and Tools
 Data mining tools and techniques can be classified
based on the structure of the data and the
algorithms used:
 Examples of tools used;
 Statistical methods
 Decision trees
Defined as a root followed by internal nodes. Each node
(including root) is labeled with a question and arcs
associated with each node cover all possible responses
Data Mining
Techniques and Tools
 Case-based reasoning
 Neural computing
 Intelligent agents
 Genetic algorithms
 Other tools
 Rule induction
 Data visualization
Data Mining
Techniques and Tools
 A general algorithm for building a decision tree:
1. Create a root node and select a splitting attribute.
2. Add a branch to the root node for each split candidate
value and label
3. Take the following iterative steps:
a. Classify data by applying the split value.
b. If a stopping point is reached, then create leaf node and
label it. Otherwise, build another subtree
Data Mining
Techniques and Tools
 Gini index
Used in economics to measure the diversity of the
population. The same concept can be used to
determine the ‘purity’ of a specific class as a result
of a decision to branch along a particular
attribute/variable
Data Mining
Techniques and Tools
 The ID3 algorithm decision tree approach
 Entropy
Measures the extent of uncertainty or randomness in a
data set. If all the data in a subset belong to just one
class, then there is no uncertainty or randomness in
that dataset, therefore the entropy is zero
Data Mining
Techniques and Tools
 Cluster analysis for data mining
 Cluster analysis is an exploratory data analysis tool for
solving classification problems
 The object is to sort cases into groups so that the
degree of association is strong between members of
the same cluster and weak between members of
different clusters
Data Mining
Techniques and Tools
 Cluster analysis results may be used to:

 Help identify a classification scheme


 Suggest statistical models to describe populations
 Indicate rules for assigning new cases to classes for
identification, targeting, and diagnostic purposes
 Find typical cases to represent classes
Data Mining
Techniques and Tools

 Cluster analysis methods


 Statistical methods
 Optimal methods
 Neural networks
 Fuzzy logic
 Genetic algorithms
Data Mining Project Methodologies
CRISP-DM
Data Mining Project Methodologies
The acronym SEMMA stands for Sample, Explore, Modify, Model, Assess, and
refers to the process of conducting a DM project.

Sample – This stage consists of sampling the data by extracting a portion of a


large data set big enough to contain the significant information, yet small
enough to manipulate quickly;

Explore - this stage consists of the exploration of the data by searching for
unanticipated trends and anomalies in order to gain understanding
and ideas;

Modify - this stage consists of the modification of the data by creating,


selecting, and transforming the variables to focus the model selection process;

Model - this stage consists on modeling the data by allowing the software to
search automatically for a combination of data that reliably predicts a
desired outcome;

Assess - this stage consists on assessing the data by evaluating the usefulness
and reliability of the findings from the DM process and estimate how well it
performs. The SEMMA process offers an easy to understand process, allowing
an organized and adequate development and maintenance of DM projects.
Knowledge Discovery in Databases

 Knowledge discovery in databases (KDD)


A comprehensive process of using data mining
methods to find useful information and patterns in
data
Knowledge Discovery in Databases

 KDD process
1. Selection
2. Preprocessing
3. Transformation
4. Data mining
5. Interpretation/evaluation
Text Mining
 Text mining
Application of data mining to non-structured or less
structured text files. It entails the generation of
meaningful numerical indices from the
unstructured text and then processing these indices
using various data mining algorithms
Text Mining
 Text mining helps organizations:
 Find the “hidden” content of documents, including
additional useful relationships
 Relate documents across previous unnoticed divisions
 Group documents by common themes
Text Mining
 Applications of text mining
 Automatic detection of e-mail spam or phishing
through analysis of the document content
 Automatic processing of messages or e-mails to route
a message to the most appropriate party to process that
message
 Analysis of warranty claims, help desk calls/reports,
and so on to identify the most common problems and
relevant responses
Text Mining
 Applications of text mining
 Analysis of related scientific publications in journals
to create an automated summary view of a particular
discipline
 Creation of a “relationship view” of a document
collection
 Qualitative analysis of documents to detect deception
Text Mining
 How to mine text
1. Eliminate commonly used words (stop-words)
2. Replace words with their stems or roots (stemming
algorithms)
3. Consider synonyms and phrases
4. Calculate the weights of the remaining terms
Web Mining
 Web mining
The discovery and analysis of interesting and
useful information from the Web, about the Web,
and usually through Web-based tools
Data Mining Project Processes
Web Mining
 Web content mining
The extraction of useful information from Web pages
 Web structure mining
The development of useful information from the
links included in the Web documents
 Web usage mining
The extraction of useful information from the data
being generated through webpage visits, transaction,
etc.
Web Mining
 Uses for Web mining:
 Determine the lifetime value of clients
 Design cross-marketing strategies across products
 Evaluate promotional campaigns
 Target electronic ads and coupons at user groups
 Predict user behavior
 Present dynamic information to users
For more on Data Mining Techniques….

 http://www.statsoft.com/textbook/data-mining-
techniques/

You might also like