Unit 3 PPT (BA)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

Unit 3

Introduction to Data Mining


Data mining
• Data mining is the process of extracting patterns and other
useful information from large data sets.
• Data mining is the process of sorting through large data sets to
identify patterns and relationships that can help solve business
problems through data analysis. Data mining techniques and
tools enable enterprises to predict future trends and make
more-informed business decisions.
• It’s sometimes known as knowledge discovery in data or
KDD.
• Data mining has improved corporate decision-making.
• Defining the target dataset, Forecasting outcomes using
machine learning methods
Data mining used for
• To achieve a corporate goal
• To answer business or research questions
• To contribute to problem-solving
• To aid in the accurate prediction of outcomes
• To analyses and predict trends
• To inform forecasts
• To identify gaps and mistakes in processes, such
as supply chain bottlenecks or incorrect data
entry
Data Mining Techniques
• Depending on the company’s goals for data mining, different techniques are
used to produce models that fit the desired outcomes. The models can be
used to describe current data, predict future trends or aid in finding data
anomalies.

• Descriptive model: Descriptive analytics finds patterns and relationships in


current data.

• Predictive model: Used to predict future outcomes, such as whether a loan


applicant is a good risk, or to make financial forecasts, such as upcoming sales.

• Outlier Analysis: Used to find anomalies, that is, data that doesn’t fit neatly
into patterns. Outlier analysis is especially useful in fraud detection, network
intrusion detection and criminal investigations.
Origins of Data Mining
• Data mining has its origins in the field of statistics and
computer science, and it has evolved over the years as
technology and data collection methods have advanced.
• Today, data mining continues to evolve with the development
of more sophisticated algorithms, big data technologies, and
increased automation, making it an essential component of
data analytics and business intelligence in various fields like
• Statistics, Machine Learning, Databases, Knowledge
Discovery in Databases (KDD), Rapid Growth of Data, Data
Warehousing, Industry Applications, Research and Academia
etc
Data mining Tasks
• Data mining encompasses various tasks and
techniques for extracting valuable insights and
patterns from large datasets. These data mining
tasks are essential for gaining insights from
large and complex datasets, and the choice of
the task and technique depends on the specific
problem and data at hand. These tasks can be
broadly categorized into the following:
• Classification
• Regression
• Clustering
• Association Rule Mining
• Anomaly Detection
• Sequential Pattern Mining
• Text Mining and Natural Language
Processing (NLP)
• Time Series Analysis
• Dimensionality Reduction
• Recommendation Systems
• Graph Mining:
• Spatial Data Mining
• Ensemble Methods
OLAP(Online Analytical Processing) and
Multidimensional Analysis
Online Analytical process(OLAP)
• OLAP (Online Analytical Processing) and
multidimensional data analysis are techniques and
technologies used in data warehousing and business
intelligence to help users analyze and explore large sets
of data. They are particularly useful for decision support
and reporting purposes. OLAP is a category of software
tools and applications that allow users to interact with
and analyze data from various perspectives, making it
easier to perform complex and multidimensional data
analysis. OLAP systems are designed for query
performance
• Slice: View a single "slice" or cross-section of data to focus on specific
attributes or dimensions.
• Dice: Select and view a subset of data based on multiple criteria or
dimensions.
• Pivot (rotate): Change the orientation of data to look at it from different
angles or dimensions.
• Drill-down: Navigate from summary-level data to more detailed data.
• Roll-up: Aggregate data to a higher-level summary.
• Measure: Apply mathematical functions to measure data, such as
calculating sums, averages, or percentages.
• Hierarchy navigation: Explore data hierarchies, such as drilling down
from years to quarters to months.
• OLAP systems typically use a multidimensional data model, which
organizes data into dimensions, facts, and measures, making it more
suitable for complex and interactive analysis.
Multidimensional Data Analysis

• Multidimensional data analysis, sometimes referred to


as "MDA," is a broader concept that encompasses the
techniques and methods used to analyze data from
multiple dimensions. These dimensions could represent
various attributes, categories, or factors relevant to the
data being analyzed. Multidimensional data analysis
typically involves. Multidimensional data analysis is
closely associated with OLAP, as OLAP systems typically
employ multidimensional data structures to facilitate
easy exploration and analysis of data
• Data Cubes: Data is often organized in a cube-like
structure, where dimensions are represented along
different axes. Each cell in the cube contains aggregated
or detailed data.
• Dimension Hierarchies: Dimensions are often organized
hierarchically, allowing users to drill down or roll up to
access different levels of detail.
• Measures: Data analysis in multidimensional models
often involves calculating various measures, such as
sums, averages, and percentages.
• Interactive Exploration: Users can interact with the data
to change dimensions, apply filters, and perform
calculations to gain insights.
Basic concept of Association Analysis:
• Association analysis and cluster analysis are two
important techniques in data mining and data
analysis used to discover patterns and relationships
in data. They are often employed to gain insights
from large datasets and make data-driven decisions.
Association analysis, often referred to as market
basket analysis, is a technique used to discover
relationships and associations between items in a
dataset. It is commonly used in retail, e-commerce,
and recommendation systems. The primary goal of
association analysis is to find frequent patterns or
associations within transactional data.
• Frequent Itemsets: Association analysis identifies sets
of items that frequently appear together in
transactions. These sets are known as frequent
itemsets.
• Support: Support measures how frequently a particular
itemset appears in the dataset. It is defined as the
proportion of transactions that contain the itemset.
• Confidence: Confidence quantifies the strength of an
association rule. It measures the probability that one
item will be bought if another item is bought.
• Lift: Lift measures the degree of association between
items in an association rule, compared to their
individual occurrence.
Cluster Analysis
• Cluster analysis, or clustering, is a technique
used to group similar data points into clusters
or segments based on certain characteristics
or features. The primary goal of cluster
analysis is to uncover natural groupings within
data. Cluster analysis has various applications,
such as customer segmentation, anomaly
detection, image segmentation, and gene
expression analysis.
• K-Means Clustering: One of the most popular clustering
methods, K-means divides data into a pre-defined number of
clusters. It assigns each data point to the cluster with the
nearest centroid (center of the cluster) based on a distance
measure (usually Euclidean distance).
• Hierarchical Clustering: This method builds a hierarchy of
clusters, often represented as a dendrogram. It can be divisive
(top-down) or agglomerative (bottom-up).
• DBSCAN (Density-Based Spatial Clustering of Applications
with Noise): DBSCAN identifies clusters based on density. It
groups together data points that are close to each other and
separates regions of low density.
• Hierarchical Clustering: This method builds a hierarchy of
clusters, often represented as a dendrogram. It can be divisive
(top-down) or agglomerative (bottom-up).

You might also like