Unit Ii DWDM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

UNIT-II (DWDM)

2.1 What is Data Warehouse?


"A warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making process".
 Subject Oriented: Data that gives information about a particular subject
instead of about a company's ongoing operations.
 Integrated: Data that is gathered into the data warehouse from a variety
ofsources and merged into a coherent whole.
 Time-variant: All data in the data warehouse is identified with a
particular timeperiod.
 Non-volatile: Data is stable in a data warehouse. More data is
added but data isnever removed.
Benefits of data warehousing:
 Data warehouses are designed to perform well with aggregate queries
running on large amounts of data.
 The structure of data warehouses is easier for end users to navigate,
understand and query against unlike the relational databases primarily
designed to handle lots of transactions.
 Data warehouses enable queries that cut across different segments of a
company's operation. E.g. production data could be compared against
inventory data even if they were originally stored in different databases
with different structures.
 Queries that would be complex in very normalized databases could be
easier to build and maintain in data warehouses, decreasing the workload
on transaction systems.
 Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non-uniform and scattered throughout a
company.
2.2 Differences between Operational Database Systems and Data
Warehouses
Features of OLTP and OLAP

The major distinguishing features between OLTP and OLAP are summarized
asfollows.

1. Users and system orientation: An OLTP system is customer-oriented and is


used for transaction and query processing by clerks, clients, and information
technology professionals. An OLAP system is market-oriented and is used for data
analysis by knowledge workers, including managers, executives, and analysts.

2. Data contents: An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making. An OLAP system manages large
amounts of historical data, provides facilities for summarization and aggregation,
and stores and manages information at different levels of granularity. These
featuresmake the data easier for use in informed decision making.

3. Database design: An OLTP system usually adopts an entity-relationship (ER)


data model and an application oriented database design. An OLAP system
typically adopts either a star or snowflake model and a subject-oriented database
design.

4. View: An OLTP system focuses mainly on the current data within an enterprise
or department, without referring to historical data or data in different
organizations. In contrast, an OLAP system often spans multiple versions of a
database schema. OLAP systems also deal with information that originates from
different organizations, integrating information from many data stores. Because of
their huge volume, OLAP data are stored on multiple storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of
short, atomic transactions. Such a system requires concurrency control and
recovery mechanisms. However, accesses to OLAP systems are mostly read-only
operationsalthough many could be complex queries.
2.3 Data warehouse architecture: The Design of a Data Warehouse: A
Business Analysis Framework Four different views regarding the design of a data
warehouse must be considered: the top-down view, the data source view, the data
warehouseview, the business query view.
 The top-down view allows the selection of relevant information
necessary for the data warehouse.
 The data source view exposes the information being captured, stored
and managed by operational systems.
 The data warehouse view includes fact tables and dimension tables
 Finally the business query view is the Perspective of data in the data
warehouse from the viewpoint of the end user.

Three-tier Data warehouse architecture

The bottom tier is ware-house database server which is almost always a


relational database system. The middle tier is an OLAP server which is typically
implemented using either (1) a Relational OLAP (ROLAP) model, (2) a
Multidimensional OLAP (MOLAP) model. The top tier is a client, which contains
query and reporting tools, analysis tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).
2.4 Data warehouse Models: From the architecture point of view, there are
three data warehouse models: the enterprise warehouse, the data mart, and the
virtual warehouse.

 Enterprise warehouse: An enterprise warehouse collects all of the


information about subjects spanning the entire organization. It provides
corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
It typically contains detailed data as well as summarized data, and can
range in size from a few gigabytes to hundreds of gigabytes, terabytes, or
beyond.
 Data mart: A data mart contains a subset of corporate-wide data that is of
value to a specific group of users. The scope is connected to specific,
selected subjects. For example, a marketing data mart may connect its
subjects to customer, item, and sales. The data contained in data marts tend
to be summarized. Depending on the source of data, data marts can be
categorized into the following two classes:

(i).Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data
generated locally within a particular department or geographic area.

(ii).Dependent data marts are sourced directly from enterprise data


warehouses.

 Virtual warehouse: A virtual warehouse is a set of views over operational


databases. For efficient query processing, only some of the possible
summary views may be materialized. A virtual warehouse is easy to build
but requires excess capacity on operational database servers.
2.5 ETL (Data warehouse Back-End Tools and Utilities)
The ETL (Extract Transformation Load) process: The 4 major process of the
data warehouse. They are extract (data from the operational systems and bring it to
the data warehouse), transform (the data into internal format and structure of the
data warehouse), cleanse (to make sure it is of sufficient quality to be used for
decisionmaking) and load (cleanse data is put into the data warehouse).
The four processes from extraction through loading often referred collectively
as Data Staging.

EXTRACT: Some of the data elements in the operational database can be reasonably be
expected to be useful in the decision making, but others are of less value for that purpose.
For this reason, it is necessary to extract the relevant data from the operational database
before bringing into the data warehouse. Many commercial tools are available to help with
the extraction process. Data Junction is one of the commercial products. The user of one
of these tools typically has an easy-to-use windowed interface by which to specify the
following:

(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often
doneinternally by SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?

TRANSFORM: The operational databases developed can be based on any set of


priorities, which keeps changing with the requirements. Therefore those who
develop data warehouse based on these databases are typically faced with
inconsistency among their data sources. Transformation process deals with
rectifying any inconsistency (if any).
One of the most common transformation issues is ‘Attribute Naming
Inconsistency’. It is common for the given data element to be referred to by
different data names in different databases. Employee Name may be EMP_NAME
in one database, ENAME in the other. Thus one set of Data Names are picked and
used consistently in the data warehouse. Once all the data elements have right
names, they must be converted to common formats. The conversion may
encompass the following:

(i) Characters must be converted ASCII to EBCDIC or vise versa.


(ii) Mixed Text may be converted to all uppercase for consistency.
(iii) Numerical data must be converted in to a common format.
(iv) Data Format has to be standardized.
(v) Measurement may have to convert. (Rs/ $)
(vi) Coded data (Male/ Female, M/F) must be converted into a common
format.
All these transformation activities are automated and many commercial products
are available to perform the tasks. DataMAPPER from Applied Database
Technologies is one such comprehensive tool.

CLEANSING: Information quality is the key consideration in determining the


value of the information. The developer of the data warehouse is not usually in a
position to change the quality of its underlying historic data, though a data
warehousing project can put spotlight on the data quality issues and lead to
improvements for the future. It is, therefore, usually necessary to go through the
data entered into the data warehouse and make it as error free as possible. This
process is known as Data Cleansing.

Data Cleansing must deal with many types of possible errors. These include
missing data and incorrect data at one source; inconsistent data and conflicting data
when twoor more source are involved. There are several algorithms followed to
clean the data, which will be discussed in the coming lecture notes.

LOADING: Loading often implies physical movement of the data from the
computer(s) storing the source database(s) to that which will store the data
warehouse database, assumingit is different. This takes place immediately after the
extraction phase. The most common channel for data movement is a high-speed
communication link. Ex: Oracle Warehouse Builder is the API from Oracle,
which provides the features to performthe ETL task on Oracle Data Warehouse.
2.6 Meta data Repository: Metadata are data about data. When used in a data
warehouse, metadata are the data that define warehouse objects. Metadata are
created for the data names and definitions of the given warehouse. Additional
metadata are created and captured for time stamping any extracted data, the source
of the extracted data, and missing fields that have been added by data cleaning or
integration processes. A metadata repositoryshould contain:

 A description of the structure of the data warehouse. This includes the


warehouse schema, view, dimensions, hierarchies, and derived data
definitions, as well as data mart locations and contents;
 Operational metadata, which include data lineage (history of migrated data
and the sequence of transformations applied to it), currency of data (active,
archived, or purged), and monitoring information (warehouse usage
statistics, error reports, and audit trails);
 the algorithms used for summarization, which include measure and
dimension definition algorithms, data on granularity, partitions, subject
areas, aggregation, summarization, and predefined queries and reports;
 The mapping from the operational environment to the data warehouse,
which includes source databases and their contents, gateway descriptions,
data partitions, data extraction, cleaning, transformation rules and defaults,
data refresh and purging rules, and security (user authorization and access
control).
 Data related to system performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for the
timing and scheduling of refresh, update, and replication cycles; and
 Business metadata, which include business terms and definitions, data
ownership information, and charging policies.

2.7 A Multidimensional Data Model: Data Cube

The most popular data model for data warehouses is a multidimensional


model. This model can exist in the form of a star schema, a snowflake schema, or a
fact constellation schema. Let's have a look at each of these schema types.
 A data cube allows data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts
 A data cube contains dimension tables and fact tables, For Sales
o Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year),branch and location
o Fact table contains measures (such as dollars_sold) and keys to each
of the related dimension tables
 n-D base cube is called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the apex cuboid.

2.7.1 Schemas for Multidimensional Databases: Stars, Snowflakes, and Fact


Constellations

Star schema: The star schema is a modeling paradigm in which the data warehouse
contains (1) a large central table (fact table), and (2) a set of smaller attendant tables
(dimension tables), one for each dimension. The schema graph resembles a starburst,
with the dimension tables displayed in a radial pattern around the central fact table.
Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form. Such a
table is easy to maintain and also saves storage space because a large dimension
table can be extremely large whenthe dimensional structure is included as columns.

Fact constellation: Sophisticated applications may require multiple fact tables to


share dimension tables. This kind of schema can be viewed as a collection of stars,
and hence is called a galaxy schema or a fact constellation.
2.8 OLAP operations on multidimensional data.

1. Roll-up: The roll-up operation performs aggregation on a data cube, either by


climbing-up a concept hierarchy for a dimension or by dimension reduction.
Figure shows the result of a roll-up operation performed on the central cube by
climbing up the concept hierarchy for location. This hierarchy was defined as the
total order street
< city < province or state <country.

2. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed


data to more detailed data. Drill-down can be realized by either stepping-down a
concept hierarchy for a dimension or introducing additional dimensions. Figure
shows the result of a drill-down operation performed on the central cube by
stepping down a concept hierarchy for time defined as day < month < quarter <
year. Drill- down occurs by descending the time hierarchy from the level of quarter
to the more detailed level of month.
3. Slice and dice: The slice operation performs a selection on one dimension of
the given cube, resulting in a sub cube. Figure shows a slice operation where the
sales data are selected from the central cube for the dimension time using the
criteria time=”Q2". The dice operation defines a sub cube by performing a selection on
twoor more dimensions.
4. Pivot (rotate): Pivot is a visualization operation which rotates the data axes in
view in order to provide an alternative presentation of the data. Figure shows a
pivot operation where the item and location axes in a 2-D slice are rotated.

You might also like