Data Mining Unit 1
Data Mining Unit 1
Data Mining Unit 1
Unit-I
• What is Data Warehouse?
• A Multidimensional Data Model
• Data Warehouse Architecture
• Data Warehouse Implementation
• Data cube Technology
• From data warehousing to data mining
• Data mining functionalities
• Data Cleaning, Data Integration
• Transformation, Data Reduction
What is Data Warehouse?
• Snowflake schema:
1) A snowflake schema is a variant of the star schema model,
2) Where some dimensions are normalized, thereby further splitting the data
into additional tables.
3) The resulting schema similar to snowflake.
Stars, Snowflakes, and Fact Constellations : Schemas for Multidimensional Data Model
• Fact Constellation:
1) Sophisticated applications may require multiple fact tables to share
dimension tables.
2) It can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
Data Warehouse Architecture
( A Three-tier data warehousing architecture )
Data Warehouse Architecture
( A Three-tier data warehousing architecture )
• Back-end tools and utilities are used to feed data into
the bottom tier from operational databases or other
external sources.
Extract
Clean
Transform
Load
Refresh
Data Warehouse Architecture
( A Three-tier data warehousing architecture )
• Enterprise Warehouse:
it collects all of the information about subjects spanning the entire
organization. It provides corporate-wide data integration.
It contain detailed data as well as summarized data, and range in size
from hundreds of gigabyte, terabytes, or beyond.
• Data Mart:
it contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects.
(e.g. a marketing data mart may confine its subjects to customer, item, and sales.)
It contain detailed data as well as summarized data, and range in size
from hundreds of gigabyte, terabytes, or beyond.
Implemented on low-cost departmental servers that are Unix/Linux or
Windows based.
The implementations cycle of a data mart is more likely to be measured in
week rather than months or year.
Data Warehouse models:
(Enterprise Warehouse, Data Mart, and Virtual Warehouse)
• Virtual Warehouse:
• Data Cleaning :
• Data cleaning routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies.
• Missing values : (Methods for filling missing values)
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use mean or median to fill in the missing value
5. Use the attribute mean or median for all samples belonging
to the same class as the given tuple.
6. Use the most probable value to fill in the missing value
Data Pre-processing
• Outlier analysis:
Outliers may be detected by clustering, for
example, where similar values are organized into
groups, or “clusters”. On the other hand, values that fall
outside of the set of clusters may be considered
outliers.
Data Integration
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• The semantic heterogeneity and structure of data pose great
challenges in data integration.
• For example, how can the data analyst or the computer be sure
that customer id in one database and cust number in another
refer to the same attribute.
• (e.g., where data codes for pay type in one database may be “H”
and “S” but 1 and 2 in another).
• For example, in one system, a discount may be applied to the
order, whereas in another system it is applied to each individual
line item within the order. If this is not caught before integration,
items in the target system may be improperly discounted.
Data Reduction