Data Mining Unit 1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

Data Mining and Warehouse

Unit-I
• What is Data Warehouse?
• A Multidimensional Data Model
• Data Warehouse Architecture
• Data Warehouse Implementation
• Data cube Technology
• From data warehousing to data mining
• Data mining functionalities
• Data Cleaning, Data Integration
• Transformation, Data Reduction
What is Data Warehouse?

• According to William H. Inmon,

“A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”
What is Data Warehouse?

• Subject oriented : A data warehouse is organized around


major subjects such as customer, supplier, product, and sales.

• Integrated: A data warehouse is constructed by integrating


multiple heterogonous sources, such as relational databases,
flat files, and online transaction.

• Time variant: Data are stored to provide information from an


historic perspective (e.g., the past 5-10 years).

• Non-volatile: A data warehouse is always a physically


separate store of data transformed from the application data
found in the operational environment.
Difference between Operational database system
and Data warehouse
• The major task of online database system is to
perform online transaction and query processing.
These systems are called Online Transaction
Processing (OLTP) systems.

• They cover most of the day-to-day operations of an


organization such as purchasing, inventory,
manufacturing, banking, payroll, registration, and
accounting.
Difference between Operational database system
and Data warehouse
• Data Warehouse systems serve users or knowledge
workers in the role of data analysis and decision
making. Such systems can organize and present data
in various formats in order to accommodate the
diverse needs of different users. These systems are
known as Online Analytical Processing (OLAP).
Comparison of OLTP and OLAP Systems
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Data Cube: A Multidimensional Data Model

• What is data cube?


“A data cube allows data to be modelled and
viewed in multiple dimensions”
It is defined by dimensions and facts.

 A multidimensional data model is typically organized around


a central theme, such as sales. This theme is represented by a
fact table.
 Each dimension may have a table associated with it, called a
dimension table.
Data Cube: A Multidimensional Data Model
Data Cube: A Multidimensional Data Model
A 3-D data cube representation of Table 4.3
A 4-D data cube representation of sales data
Lattice of cuboids
Lattice of cuboids

• Base cuboid : The cuboid that holds the lowest


level of summarization is called the base
cuboid.
• Apex cuboid : The 0-D cuboid, which holds the
highest level of summarization, is called the
apex cuboid
The most popular data model for a data warehouse is a multidimensional model,
which can exist in form of a star schema, a snowflake schema, or a fact constellation
schema

• Star Schema: (in which data warehouse contains)


1) A large central table (fact table) containing the bulk of the data, with no
redundancy,
2) A set of smaller attendant tables (dimension tables), one for each dimension.
Stars, Snowflakes, and Fact Constellations : Schemas for Multidimensional Data Model

• Snowflake schema:
1) A snowflake schema is a variant of the star schema model,
2) Where some dimensions are normalized, thereby further splitting the data
into additional tables.
3) The resulting schema similar to snowflake.
Stars, Snowflakes, and Fact Constellations : Schemas for Multidimensional Data Model

• Fact Constellation:
1) Sophisticated applications may require multiple fact tables to share
dimension tables.
2) It can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
Data Warehouse Architecture
( A Three-tier data warehousing architecture )
Data Warehouse Architecture
( A Three-tier data warehousing architecture )
• Back-end tools and utilities are used to feed data into
the bottom tier from operational databases or other
external sources.
 Extract
 Clean
 Transform
 Load
 Refresh
Data Warehouse Architecture
( A Three-tier data warehousing architecture )

1. The bottom tier is a warehouse database server is always a


relational database system.

2. The middle tier is an OLAP server that is typically implements


using either (1) a relational OLAP (ROLAP) or
(2) a multidimensional OLAP (MOLAP) model
(i.e. a special-purpose server that directly implements multidimensional data and
operations)

3. The top-tier is a front-end client layer, which contains query


and reporting tools, analysis tools, and /or data mining tools
( e.g., trend analysis, prediction, and so on)
Data Warehouse models:
(Enterprise Warehouse, Data Mart, and Virtual Warehouse)

• Enterprise Warehouse:
 it collects all of the information about subjects spanning the entire
organization. It provides corporate-wide data integration.
 It contain detailed data as well as summarized data, and range in size
from hundreds of gigabyte, terabytes, or beyond.

• Data Mart:
 it contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects.
(e.g. a marketing data mart may confine its subjects to customer, item, and sales.)
 It contain detailed data as well as summarized data, and range in size
from hundreds of gigabyte, terabytes, or beyond.
 Implemented on low-cost departmental servers that are Unix/Linux or
Windows based.
 The implementations cycle of a data mart is more likely to be measured in
week rather than months or year.
Data Warehouse models:
(Enterprise Warehouse, Data Mart, and Virtual Warehouse)

• Virtual Warehouse:

 it is set of views over operational databases.


 For efficient query processing, only some of the possible
summary views may be materialized.
 A virtual warehouse is easy to build but requires excess
capacity on operational database servers.
Metadata Repository
• Metadata are data about data
• A metadata should contain the following:
• A description of data structure warehouse.
• Operational metadata (e.g history of migrated data and sequence of
transformations applied to it).
• The algorithms used for summarization (include measure and dimension
definition algorithms)
• Mapping from the operational environment to the data warehouse
(includes extraction, cleaning, transformation rules and defaults).
• Business metadata (includes business terms and definitions, data ownership
information, and charging policies.).
Typical OLAP Operations
Typical OLAP Operations
• Roll-up (also called drill-up operation):
performs aggregation on a data cube, either by climbing up a concept hierarchy
for a dimension or by dimension reduction.
• Drill-down: It is the reverse of the roll-up. It navigates from less detailed data
to more detailed data.
• Slice and dice
• Pivot (rotate)
Data Warehouse Implementation
Data warehouse Implementations
Materialization

• No Materialization: Do not precompute any of the “nonbase” cuboids


• Full Materialization: Precompute all of the cuboids. The resulting lattice
of computed cuboids is referred to as the full cube. This choice typically requires
huge amounts of memory space in order to store all of the precomputed
cuboids.
• Partial Materialization: Selectively compute a proper subset of the
whole set of possible cuboids. The partial materialization of cuboids or
subcubes should consider three factors:
(1) Identify the subset of cuboids or subcubes to materialize;
(2) Exploit the materialized cuboids or subcubes during query processing; and
(3) efficiently update the materialized cuboids or subcubes during load and
refresh.
• Indexing OLAP Data: Bitmap index and Join
index
• To facilitate efficient data accessing, most
data warehouse systems support Index
structures and materialized views.
Indexing OLAP Data
Bitmap Indexing
• It is especially useful for low-cardinality domains
because comparison, join, and aggregation
operations are then reduced to bit arithmetic,
which substantially reduces the processing time.

• Bitmap indexing leads to significant reductions in


space and input/output (I/O) since a string of
characters can be represented by a single bit.
Join indexing method
• The join indexing method gained popularity from its
use in relational database query processing.
• join indexing registers the joinable rows of two
relations from a relational database.
• For example, if two relations R.RID, A/ and S.B, SID/
join on the attributes A and B, then the join index
record contains the pair .RID, SID/, where RID and SID
are record identifiers from the R and S relations,
respectively
Join indexing method
Join indexing method
Data Pre-processing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Data Pre-processing

• Data Cleaning :
• Data cleaning routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies.
• Missing values : (Methods for filling missing values)
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use mean or median to fill in the missing value
5. Use the attribute mean or median for all samples belonging
to the same class as the given tuple.
6. Use the most probable value to fill in the missing value
Data Pre-processing

What is noisy data ?


“Noise is a random error or variance in a measured variable”
Binning Method: Binning methods smooth a sorted data value
by consulting its “neighbourhood”, that is, the values around it.
Binning Method
• Regression:
Data smoothing can also be done by regression,
a technique that conforms data values to a function.
Linear regression involves finding the “best” line to fit
two attributes (or variables) so that one attribute can
be used to predict the other.

• Outlier analysis:
Outliers may be detected by clustering, for
example, where similar values are organized into
groups, or “clusters”. On the other hand, values that fall
outside of the set of clusters may be considered
outliers.
Data Integration
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• The semantic heterogeneity and structure of data pose great
challenges in data integration.
• For example, how can the data analyst or the computer be sure
that customer id in one database and cust number in another
refer to the same attribute.
• (e.g., where data codes for pay type in one database may be “H”
and “S” but 1 and 2 in another).
• For example, in one system, a discount may be applied to the
order, whereas in another system it is applied to each individual
line item within the order. If this is not caught before integration,
items in the target system may be improperly discounted.
Data Reduction

• Data reduction strategies include dimensionality reduction,


numerosity reduction, and data compression.
• Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
• Dimensionality reduction methods include wavelet transforms and
principal components analysis, which transform or project the
original data into a smaller space.
• Numerosity reduction techniques replace the original data
volume by alternative, smaller forms of data representation.
• For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead
of the actual data.
• data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data.
Data Transformation

1. Smoothing, which works to remove noise from the data.


Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where
new attributes are constructed and added from the given
set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations
are applied to the data. For example, the daily sales data
may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a
data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as
to fall within a smaller range,
such as 0.0 to 1.0
Data Transformation

5. Discretization, where the raw values of a numeric attribute (e.g.,


age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior).
The labels, in turn, can be recursively organized into higher-level
concepts, resulting in a concept hierarchy for the numeric attribute.
Figure 3.12 shows a concept hierarchy for the attribute price. More
than one concept hierarchy can be defined for the same attribute to
accommodate the needs of various users.

6. Concept hierarchy generation for nominal data, where attributes


such as street can be generalized to higher-level concepts, like city
or country. Many hierarchies for nominal attributes are implicit
within the database schema and can be automatically defined at
the schema definition level.
Data Transformation
Practical Assignment 1: on
Linear Regression
• Load regression11.arff file
• Click on Choose button, then expand function
• Select Linear Regression
• Use training set on selling price
• Start
• Observe the analysis
Practical Assignment 2: on
Linear Regression
• Load iris.arff file
• Click on Choose button,
• Select trees
• J-48
• Select percentage 66%
• Start
• Visulize tree
• Observe the analysis

You might also like