Data Mining & Data Warehousing
Data Mining & Data Warehousing
Data Mining & Data Warehousing
&
Data
Warehousing
Data Mining
Data mining is the process of sorting large data
sets to identify patterns and relationships that
helps to solve business problems
Binning:
Binning methods are applied by sorting values into buckets or bins.
Smoothening is performed by consulting the neighboring values.
Smoothing by a median
Each bin value is replaced by a bin median.
Smoothing by bin boundaries
The minimum and maximum values in the bin are bin boundaries.
Regression
Statistical technique that relates a
dependent variable to one or more
independent (explanatory)
variables.
Predicting the chance of winning, or
predicting annual income of a person based
on the education, age, profession etc.
Classification is a process of categorizing a given set of
data into classes.
Example :
Based on age, predicting an Adult.
if AGE >60 then
ADULT
else
YOUTH
Regression Vs Classification
Finding the outliers
An outlier is an object that deviates significantly
from the rest of the objects. They can be caused
by measurement or execution errors.
Data Inconsistency
When the same data exists in different
formats in multiple tables.
Data Integration
When multiple heterogeneous data
sources such as databases, data cubes or
files are combined for analysis, this process is
called data integration.
This helps in improving the accuracy and speed
of the data mining process.
Tools:
Oracle Data Integrator tool
SQL Server Integration Services
Data Reduction
Data reduction is a process that reduces the
volume of original data and represents it in a
much smaller volume.
Main thing: Maintain Integrity while
reducing data.
Some strategies of data reduction are:
Dimensionality Reduction: Reducing the
number of attributes in the dataset.
Numerosity Reduction: Replacing the original
data volume by smaller forms of data
representation.
Data Compression: Compressed representation
Dimensionality Reduction
Numerosity Reduction
Histogram
A histogram is a graphical representation of data
points organized into user-specified ranges.
Data compression
Data compression is the process of
modifying data in order to reduce its size.
Two types of data compression techniques
LosslessData Compression
Lossy Data Compression
Lossless data compression
Flat Files
Relational Databases
DataWarehouse
Transactional Databases
Multimedia Databases
Spatial Databases
Time Series Databases
World Wide Web(WWW)
What Kinds of Patterns Can Be
Mined?
Data mining functionalities are used to specify the kind
of patterns to be found in data mining tasks.
It can be classified into two categories: descriptive and
predictive.
ETL Tools
Database
Data
Access Tools
ETL Tools
The data coming from the data source layer
can come in a variety of formats.
ETL stands for Extract, Transform, Load.
The staging layer uses ETL tools to extract the
needed data from various formats and checks
the quality before loading it into the data
warehouse.
Database
The most crucial component and the heart of
each architecture is the database.
The warehouse is where the data is stored and
accessed.
Four types of databases are:
Relational databases (row-centered databases).
Analytics databases (developed to sustain and
manage analytics).
Data warehouse applications (software for data
management and hardware for storing data offered
by third-party dealers).
Cloud-based databases (hosted on the cloud).
Data
Once the system cleans and organizes the data, it
stores it in the data warehouse.
The data warehouse represents the central repository
that stores metadata, summary data, and raw data
coming from each source.
Metadata is the information that defines the data. It
allows data analysts to classify, locate, and direct
queries to the required data.
Summary data is generated by the warehouse
manager. It updates as new data loads into the
warehouse. This component can include lightly or
highly summarized data.
Raw data is the actual data loading into the
repository, which has not been processed. Raw data
Data Contd…
Access Tools
Users interact with the gathered information
through different tools and technologies.
They can analyse the data, gather insight, and create
reports.
Some of the tools used include:
Reporting tools.
Reporting tools include visualizations such as graphs and
charts showing how data changes over time.
OLAP tools.
Online analytical processing tools which allow users to
analyze multidimensional data from multiple perspectives.
Data mining tools.
Examine data sets to find patterns within the warehouse
and the correlation between them.
Data Marts
Data marts allow you to have multiple
groups within the system by segmenting the
data in the warehouse into categories.
It partitions data, producing it for a particular
user group.
Ex: Sales group, finance group, account group.
Properties of Data Warehouse
Architectures
1. Separation:
Analytical and transactional processing should be keep
apart as much as possible.
2. Scalability:
Hardware and software architectures should be simple to
upgrade the data volume,
3. Extensibility:
The architecture should be able to perform new
operations and technologies without redesigning the
whole system.
4. Security:
Monitoring accesses are necessary because of the
strategic data stored in the data warehouses.
5. Administerability:
Data Warehouse management should not be complicated.