Unit - 3 Data Warehouse Modelling and Online Analytical Processing II

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Unit – 3

Data Warehouse
Modelling and Online
Analytical Processing II
Business Analysis Framework :
The business analyst get the information from the data warehouses to measure the
performance and make critical adjustments in order to win over other business
holders in the market.
Having a data warehouse offers the following advantages −
• Since a data warehouse can gather information quickly and efficiently, it can
enhance business productivity.
• A data warehouse provides us a consistent view of customers and items,
hence, it helps us manage customer relationship.
• A data warehouse also helps in bringing down the costs by tracking trends,
patterns over a long period in a consistent and reliable manner.
Data Warehouse Design :
• Good Business Intelligence (BI), allows your organization to query data obtained
from trusted sources and use the answers to gain a competitive edge in your
industry.
• The first step to achieving effective BI is a well-designed warehouse.
• Data warehouse design is the process of building a solution to integrate data
from multiple sources that support analytical reporting and data analysis.
• A poorly designed data warehouse can result in acquiring and using inaccurate
source data that negatively affect the productivity and growth of your
organization.
• Datawarehouse design and usage deals with how data warehousing can be used
for information processing analytical processing data mining
• To design an effective and efficient data warehouse, we need to understand
and analyze the business needs and construct a business analysis framework.
• Each person has different views regarding the design of a data warehouse.
• These views are as follows −
The top-down view − This view allows the selection of relevant information
needed for a data warehouse.
The data source view − This view presents the information being captured,
stored, and managed by the operational system.
The data warehouse view − This view includes the fact tables and dimension
tables. It represents the information stored inside the data warehouse.
The business query view − It is the view of the data from the viewpoint of the
end-user.
Important steps in Data warehouse design :

Requirements Gathering :
• Gathering requirements is step one of the data warehouse design process.
• The goal of the requirements gathering phase is to determine the criteria for a
successful implementation of the data warehouse.
• An organization's long-term business strategy should be just as important as
the current business and technical requirements.
• User analysis and reporting requirements must be identified.
• Hardware, development, testing, implementation, and user training are also
important.
• Once the business and technical strategy has been decided the next step is
to address how the organization will backup the data warehouse and how it
will recover if the system fails.
• Developing a disaster recovery plan while gathering requirements, ensures
that the organization is prepared to respond quickly to direct and indirect
threats to the data warehouse.
Physical Environment Setup :
• Once the business requirements are set, the next step is to determine the
physical environment for the data warehouse.
• There should be separate physical application and database servers as well as
separate ETL/ELT, OLAP, data cube, and reporting processes set up for
development, testing, and production.
• Building separate physical environments ensure that all changes can be tested
before moving them to production, development, and testing.
• Testing can be done without halting the production environment.
• If data integrity problem occurs, the IT staff can investigate the issue without
negatively impacting the production environment.
Data Modeling :
• Once requirements gathering and physical environments have been defined, the
next step is to define how data structures will be accessed, connected, processed,
and stored in the data warehouse. This process is known as data modeling.
• During this phase of data warehouse design, it is important to identify data
sources. Knowing where the original data resides and just as importantly, the
availability of that data, is crucial to the success of the project.
• Once the data sources have been identified, the data warehouse team can begin
building the logical and physical structures based on established requirements.
• The ETL process occurs after identifying the data sources.
OLAP Cube Design :
• On-Line Analytical Processing (OLAP) provides the infrastructure for ad-hoc user
query and multi-dimensional analysis.
• OLAP design specification should come from users who will query the data.
• Documentation specifying the OLAP cube dimensions and measures should be
obtained during the beginning of data warehouse design process.
• The three critical elements of OLAP design include :-
• Grouping measures - numerical values you want to analyze such as
revenue, number of customers, how many products customers purchase,
or average purchase amount.
• Dimension - where measures are stored for analysis such as geographic
region, month, or quarter.
• Granularity - the lowest level of detail that you want to include in the
OLAP dataset.
Front End Development :
• Business requirements have been captured, physical environment complete,
data model decided, and ETL process has been documented. The next step is
to work on how users will access the data warehouse.
• Front end development is how users will access the data for analysis and run
reports.
• There are many options available, including building your front end in-house
or purchasing an off the shelf product.
• To ensure the best experience for end users we have to provide -
o Secure access to the data from any device - desktop, laptop, tablet, or
phone should be the primary consideration.
o The tool should allow your development team to modify the backend
structure as enterprise level reporting requirements change.
o It should also provide a Graphical User Interface (GUI) that enables
users to customize their reports as needed.
o The OLAP engine and data can be the best in class, but if users are not
able to use the data, the data warehouse becomes an expensive and
useless data repository.
Report Development :
• For most end users, the only contact they have with the data warehouse is
through the reports they generate.
• Users’ ability to select their report criteria quickly and efficiently is an essential
feature for data warehouse report generation.
• Along with receiving reports through a secure web interface, users may want
or need reports sent as an email attachment, or spreadsheet.
• Controlling the flow and visibility of data is another aspect of report
development.
• Developing user groups with access to specific data segments should provide
data security and control.
• Reporting should change after the initial implementation.
• A well-designed data warehouse should be able to handle the new reporting
requests with little to no data warehouse system modification.
Testing :
• Once the data warehouse system has been developed according to
business requirements, the next step is to test it.
• Testing, or quality assurance, is a step that should not be skipped because
it will allow the data warehouse team to expose and solve issues before
the initial rollout.
• Failing to complete the testing phase could lead to implementation delays
or termination of the data warehouse project.
Deployment :
• Time to go live.
• Deciding to make the system available to everyone at once or perform a
staggered release, will depend on the number of end users and how they will
access the data warehouse system.
• Another important aspect of any system implementation is end-user training.
• The data warehouse team and developers think the GUI is good, but if the
actual end users finds the tool difficult to use, or do not understand the
benefits of using the data warehouse for reporting and analysis, it is of no use.
Data warehouse building require three types of skill :
Business skill :
• Understanding how operational systems store and manage data
• Building extractors that can transfer data from operational systems to
data warehouse
• Building a refresher software that can keep the data warehouse upto date
Technology skill :
• Involves making assessments from quantitative information
• Deriving facts from historical information
• Discovering patterns and trends and look for anomalies
Program management skill :
• interfacing with many technologies, vendors and end users to deliver
results in cost effective and timely manner
Data warehouse usage for information processing :

• Data warehouses and data marts are used in a wide range of applications.
• Business executives use the data in data warehouses and data marts to perform
data analysis and make strategic decisions.
• Business users need to know what exists in the data warehouse (through
metadata), how to access the contents of the data warehouse, how to examine
the contents using analysis tools, and how to present the results of such
analysis.
There are three kind of warehouse application -
Information processing includes -
a) Querying
b) Statistical analysis
c) Reporting using cross tabs tables charts graphs

Analytical processing includes -


a) OLAP operations rollup, drill down, slice and dice
b) Operation on historic data summarized and detailed forms

Data mining includes -


a) Finding hidden patterns and association
b) Constructing analytical models
c) Classification and prediction
d) Presenting mining rules using visualization tools
From Online Analytical Processing to Multidimensional Data Mining :
• The data mining field has conducted substantial research regarding mining on
various data types, including relational data, data from data warehouses,
transaction data, time-series data, spatial data, text data, and flat files.
• Multidimensional data mining (also known as exploratory multidimensional data
mining, online analytical mining, or OLAM) integrates OLAP with data mining to
uncover knowledge in multidimensional databases.
Multidimensional data mining is particularly important for the following reasons:

High quality of data in data warehouses:


• Most data mining tools need to work on integrated, consistent, and
cleaned data, which requires costly data cleaning, data integration, and
data transformation as preprocessing steps.
• A data warehouse constructed by such preprocessing serves as a
valuable source of high-quality data for OLAP as well as for data mining.
Available information processing infrastructure surrounding data warehouses:
• Information processing and data analysis infrastructures are systematically
constructed surrounding data warehouses,
• It include accessing, integration, consolidation, and transformation of
multiple heterogeneous databases, ODBC/OLEDB connections, Web
accessing and service facilities, and reporting and OLAP analysis tools.
• It is good to make the best use of the available infrastructures rather than
constructing everything from scratch.
OLAP-based exploration of multidimensional data:
• Effective data mining needs exploratory data analysis.
• Multidimensional data mining provides facilities for mining on different subsets
of data and at varying levels of abstraction—by drilling, pivoting, filtering,
dicing, and slicing on a data cube and/or intermediate data mining results.
• This, together with data/knowledge visualization tools, greatly enhances the
power and flexibility of data mining

Online selection of data mining functions:


• Users may not always know the specific kinds of knowledge they want to mine.
• By integrating OLAP with various data mining functions, multidimensional data
mining provides users with the flexibility to select desired data mining functions
and swap data mining tasks dynamically.
Data Warehouse Implementation :
• Data warehouses contain huge volumes of data.
• OLAP servers demand that decision support queries be answered
quickly. Therefore, it is difficult for data warehouse systems to support
highly efficient cube computation techniques, access methods, and
query processing techniques.
• For better performance different methods are used for the efficient
implementation of data warehouse systems.
o Compute data cubes efficiently.
o OLAP data can be indexed, using either bitmap or join indices.
o OLAP queries are processed
o Various types of warehouse servers for OLAP processing.
Efficient Data Cube Computation:
• The important thing in multidimensional data analysis is the efficient
computation of aggregations across many sets of dimensions.
• In SQL terms, these aggregations are referred to as group-by’s.
• Each group-by can be represented by a cuboid, where the set of group-by’s
forms a lattice of cuboids defining a data cube.
The compute cube Operator and the Dimensionality :
• One approach to cube computation extends SQL so as to include a compute
cube operator.
• The compute cube operator computes aggregates over all subsets of the
dimensions specified in the operation.
• This can require excessive storage space, especially for large numbers of
dimensions.
Example :
• A data cube is a lattice of cuboids.
• Suppose that you want to create a data cube for AllElectronics
sales that contains: city, item, year, and sales in dollars.
• You will be able to analyze the data, with queries such as :
o “Compute the sum of sales, grouping by city and item.”
o “Compute the sum of sales, grouping by city.”
o “Compute the sum of sales, grouping by item.”
• What is the total number of cuboids, or group-by’s, that can be computed for this
data cube?
• Taking the three attributes, city, item, and year, as the dimensions for the data
cube, and sales in dollars as the measure, the total number of cuboids, or group
by’s, can be computed is 23 = 8.
• The possible group-by’s are as follows:
{(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()
} , where () means that the group-by is empty (i.e., the dimensions are not
grouped).
• These group-by’s form a lattice of cuboids for the data cube, as shown in Figure :
Fig. : Lattice of cuboids, making up a 3-D data cube. Each cuboid represents a different group-by.
The base cuboid contains city, item, and year dimensions.
• An SQL query containing no group-by (e.g., “compute the sum of total sales”) is a zero
dimensional operation.
• An SQL query containing one group-by (e.g., “compute the sum of sales, group-by city”) is a
one-dimensional operation.
• A cube operator on n dimensions is equivalent to a collection of group-by statements, one
for each subset of the n dimensions.
• Therefore, the cube operator is the n-dimensional generalization of the group-by operator.
• Similar to the SQL syntax, the data cube in above example can be defined as
define cube sales_cube [city, item, year]: sum(sales_in_dollars)
• For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid.
• A statement such as
compute cube sales_cube
• would instruct the system to compute the sales aggregate cuboids for all eight subsets of
the set { city, item, year} , including the empty subset.
Curse of dimensionality
• Online analytical processing may need to access different cuboids for different
queries. Therefore, it is a good idea to compute in advance all or at least some of
the cuboids in a data cube.
• Precomputation leads to fast response time and avoids some redundant
computation. A major challenge related to this precomputation, is that the
required storage space may explode if all the cuboids in a data cube are
precomputed, especially when
• the cube has many dimensions.
• The storage requirements are even more excessive when many of the dimensions
have associated concept hierarchies, each with multiple levels. This problem is
referred to as the curse of dimensionality.
• Curse of Dimensionality refers to a set of problems that arise when working with
high-dimensional data.
Partial Materialization: Selected Computation of Cuboids
• It is unrealistic to precompute and materialize all of the cuboids that can possibly be
generated for a data cube (i.e., from a base cuboid).
• If there are many cuboids, and these cuboids are large in size, a more reasonable
option is partial materialization; that is, to materialize only some of the possible
cuboids that can be generated
• There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “nonbase” cuboids. This leads
to computing expensive multidimensional aggregates, which can be extremely slow.

2. Full materialization: Precompute all of the cuboids. The resulting lattice of


computed cuboids is referred to as the full cube. This choice typically requires huge
amounts of memory space in order to store all of the precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of
possible cuboids. We may compute a subset of the cube, which contains only those
cells that satisfy some user-specified criterion, such as where the tuple count of
each cell is above some threshold. We will use the term subcube to refer to the
latter case, where only some of the cells may be precomputed for various cuboids.
Partial materialization represents an interesting trade-off between storage space
and response time.
Indexing OLAP Data: Bitmap Index and Join Index :
To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views (using cuboids)

Bitmap Index :
• The bitmap indexing method is popular in OLAP products.
• it allows quick searching in data cubes.
• Index on a particular column.
• The bitmap index is an alternative representation of the record ID (RID) list.
• Each attribute is represented by distinct bit value.
• If a given attribute’s domain consists of n values, then n bits are needed for
each entry in the bitmap index (i.e., there are n bit vectors).
• If the attribute has the value v for a given row in the data table, then the bit
representing that value is set to 1 in the corresponding row of the bitmap index.
All other bits for that row are set to 0.
Example : Bitmap indexing
o In the AllElectronics data warehouse, suppose the dimension item at the
top level has four values (representing item types): “home
entertainment,” “computer,” “phone,” and “security.”
o Each value (e.g., “computer”) is represented by a bit vector in the item
bitmap index table.
o Suppose that the cube is stored as a relation table with 100,000 rows.
Because the domain of item consists of four values, the bitmap index
table requires four bit vectors (or lists), each with 100,000 bits.
o Figure shows a base (data) table containing the dimensions item and city,
and its mapping to bitmap index tables for each of the dimensions.
Base table item bitmap index table city bitmap index table

RID item city RID H C P S RID V T


R1 H V R1 1 0 0 0 R1 1 0
R2 C V R2 0 1 0 0 R2 1 0
R3 P V R3 0 0 1 0 R3 1 0
R4 S V R4 0 0 0 1 R4 1 0
R5 H T R5 1 0 0 0 R5 0 1
R6 C T R6 0 1 0 0 R6 0 1
R7 P T R7 0 0 1 0 R7 0 1
R8 S T R8 0 0 0 1 R8 0 1

Figure : Indexing OLAP data using bitmap indices.


Advantages of Bitmap Indexing :
• Bitmap indexing is advantageous compared to hash and tree indices.
• It is especially useful for low-cardinality domains because comparison, join,
and aggregation operations are then reduced to bit arithmetic.
• Reduces the processing time.
• Bitmap indexing leads to significant reductions in space and input/output
(I/O) since a string of characters can be represented by a single bit.
• For higher-cardinality domains, the method can be adapted using
compression techniques.
Join indexing :
• The join indexing method gained popularity from its use in relational database
query processing.
• Traditional indexing maps the value in a given column to a list of rows having that
value. In contrast, join indexing registers the joinable rows of two relations from a
relational database.
• For example, if two relations R (RID, A) and S (B, SID) join on the attributes A and
B, then the join index record contains the pair (RID, SID), where RID and SID are
record identifiers from the R and S relations, respectively.
• The join index records can identify joinable tuples without performing costly join
operations.
• Join indexing is especially useful for maintaining the relationship between a
foreign key and its matching primary keys, from the joinable relation.
Join indexing maintains relationships between attribute values of a dimension (e.g., within a
dimension table) and the corresponding rows in the fact table.

Example : Join indexing.


for AllElectronics of the form “sales_star [time, item, branch, location]:
dollars_sold = sum (sales_in_dollars).”
An example of a join index relationship between the sales fact table and the location and
item dimension tables is shown in the following figure.
Figure : Linkages between a sales fact table and location and item dimension tables.
For example, the “Main Street” value in the location dimension table joins
with tuples T57, T238, and T884 of the sales fact table.
Similarly, the “Sony-TV” value in the item dimension table joins with tuples
T57 and T459 of the sales fact table.
The corresponding join index tables are shown in following figure.
Figure : Join index tables based on the linkages between the sales fact table and the location and
item dimension tables shown
Advantages of join index :
• The join index records can identify joinable tuples without performing
costly join operation.
• Join indexing is especially useful for maintaining the relationship between
a foreign key and its matching primary keys, from the joinable relation.
• Join in multiple dimensions to form a composite join indices.
• To speed up a query processing the join indexing and the bitmap indexing
methods can be integrated to form bitmap join indices
Efficient Processing of OLAP Queries :
• The purpose of materializing cuboids and constructing OLAP index structures is to
speed up query processing in data cubes.
• Given materialized views, query processing should proceed as follows:
1. Determine which operations should be performed on the available cuboids:
o This involves transforming any selection, projection, roll-up (group-by), and
drill-down operations specified in the query into corresponding SQL and/or
OLAP operations.
o For example, slicing and dicing a data cube may correspond to selection
and/or projection operations on a materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be
applied:
o This involves identifying all of the materialized cuboids that may potentially be
used to answer the query, pruning the set using knowledge of “dominance”
relationships among the cuboids, estimating the costs of using the remaining
materialized cuboids, and selecting the cuboid with the least cost.
Example : OLAP query processing :
Suppose that we define a data cube for AllElectronics of the form
“sales_cube [time, item, location]: sum(sales_in_dollars).”
The dimension hierarchies used are :
“day < month < quarter < year” for time;
“item name < brand < type” for item; and
“street < city < province or state < country” for location.
Suppose that the query to be processed is on {brand, province_or_state}, with the
selection constant “year = 2010.” Also, suppose that there are four materialized
cuboids available, as follows:
cuboid 1: { year, item name, city }
cuboid 2: { year, brand, country }
cuboid 3: { year, brand, province_or_state }
cuboid 4: { item name, province_or_state }, where year = 2010
“Which of these four cuboids should be selected to process the query?”
We must find finer-granularity data.
Cuboid 2 cannot be used because country is a more general concept than province
or state.
Cuboids 1, 3, and 4 can be used to process the query because:
• They have the same set or a superset of the dimensions in the query
• The selection clause in the query can imply the selection in the cuboid.
• The abstraction levels for the item and location dimensions in these cuboids are
at a finer level than brand and province or state, respectively.
“How would the costs of each cuboid compare if used to process the query?”
• Using cuboid 1 would cost the most because both item name and city are at a
lower level than the brand and province_or_state concepts specified in the query.
• If there are not many year values associated with items in the cube, but there are
several item names for each brand, then cuboid 3 will be smaller than cuboid 4,
and thus cuboid 3 should be chosen to process the query.
OLAP Server Architectures :
• Online Analytical Processing Server (OLAP) is based on the multidimensional data
model.
• It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information.
• Types of OLAP Servers :
We have four types of OLAP servers −
o Relational OLAP (ROLAP)
o Multidimensional OLAP (MOLAP)
o Hybrid OLAP (HOLAP)
o Specialized SQL Servers
Relational OLAP :
• ROLAP servers are placed between relational back-end server and client front-end
tools.
• To store and manage warehouse data, ROLAP uses relational or extended-relational
DBMS.
• ROLAP includes the following −
o Implementation of aggregation navigation logic.
o Optimization for each DBMS back end.
o Additional tools and services.
Multidimensional OLAP :
• MOLAP uses array-based multidimensional storage engines for multidimensional
views of data.
• With multidimensional data stores, the storage utilization may be low if the data set
is sparse.
• Therefore, many MOLAP server use two levels of data storage representation to
handle dense and sparse data sets.

Hybrid OLAP :
• Hybrid OLAP is a combination of both ROLAP and MOLAP.
• It offers higher scalability of ROLAP and faster computation of MOLAP.
• HOLAP servers allows to store the large data volumes of detailed information.
• The aggregations are stored separately in MOLAP store.
Specialized SQL Servers :
• Specialized SQL servers provide advanced query language and query processing
support for SQL queries over star and snowflake schemas in a read-only
environment.

You might also like