Data Mining
Data Mining
Data Mining
Data are typically stored in a database environment and are large in scale.
• Business intelligence is the transformation of raw data into knowledge and insight for
making better business decisions.
Related Fields
• Database: Focuses on data storage and access technology, while data mining focuses on
data analysis and knowledge discovery.
• Artificial Intelligence (AI): There are overlaps between data mining and AI (including
machine learning) techniques. However, AI techniques are not necessarily data-oriented
(e.g., expert systems).
• Statistics:
Statistical science assumes data are scarce; it focuses on numeric data and parametric
approach (e.g., assume data follows normal distribution).
Data mining assumes data are abundant; it deals with various data types and focuses
on efficient algorithms for large-scale data.
Terminology
• Dataset (relation or relational database table): A set of data with attributes in columns and
records in rows.
Business Applications
• Database marketing
• Credit evaluation
• Fraud detection
• Market segmentation
Classification
Classification
• Classification is the process of assigning data records into one of several predefined
groups, called classes. Classification involves building a model (called classifier), which
can be a mathematical function, a set of rules, or other representations.
• Examples of classification:
Prediction
• Prediction discovers the relationship between a set of variables, called predictor (or
input or independent) variables, and another set of variables, called target (or output or
dependent) variables in data, so that the past or current values of predictor variables can
be used to predict the future values of target variables.
• Examples of prediction:
Association
• Association refers to the presence of one set of items in a group of records implies the
existence of another set of items in the same group of records.
• Examples of association:
Market basket analysis (i.e., what items were normally bought together in a
customer’s visit to the store?)
Recommender system (Amazon: Customers who bought this item also bought…)
Clustering
• Clustering is the process of grouping data records into a number of groups, called
clusters, such that records within the same cluster are more similar than those between
different clusters.
• Clustering differs from classification in that groups are formed as a result of the analysis
instead of predefined.
• Examples of clustering:
Market segmentation
1. Problem identification
2. Data preparation
Data reduction: sampling (in rows), feature (attribute) selection (in columns)
Select appropriate data-mining techniques and tools and use the selected techniques
and tools to build models and explore the patterns/relationships hidden in data.
Test if the models built are valid; modify the models if necessary.
Big Data
• Examples:
Financial market: 7 billion shares change hands every day on U.S. markets (MSC p.7).
• Mean/Mode substitution:
For a missing value of a numeric attribute, replace it with the mean of the non-
missing values of that attribute.
For a missing value of a categorical attribute, replace it with the mode (most frequent
value) of the non-missing values of that attribute.
• Task-specific missing value replacement methods (we will discuss some of them later in
class)
• Transform a series of numeric data into values within range [0, 1], as below:
• Example:
• Validation set is the portion of data used to validate or adjust the models, and to prevent
overfitting problems.
• Test set is the data used to evaluate the performance of the models. The test set serves as
unseen future data.
• Overfitting occurs when a model fits the training data very well or even perfectly, but
performs poorly when it is applied to unseen future data.