Slidesitba - Clase2 - OLAP
Slidesitba - Clase2 - OLAP
Slidesitba - Clase2 - OLAP
y Data Warehouses
_ Data Warehouse Architecture
_ The Multidimensional Model
_ OLAP Operations
Data Warehouses
_ Data needed for analysis: typically spread across multiple heterogeneous sources, e.g., operational
systems, web
_ These data must be integrated
_ Data warehouse: a repository of integrated data obtained from several sources for multidimensional
data analysis.
_ A collection of subject-oriented, integrated, nonvolatile, and time-varying data to support manage-
ment decisions (classic Immon definition)
• Subject-oriented: data warehouses focus on the analytical needs of different areas of an organiza-
tion; e.g., in a retail company the analysis may focus on product sales or inventory management;
• Integrated: data obtained from several operational and external systems must be joined together
• Nonvolatile: durability of data is ensured by disallowing data modification and removal; a data
warehouse gathers data from several years, (typically 5 to 10); data in operational databases is
often kept for only a short period of time, e.g., 2 to 6 months
• Time-varying indicates the possibility of retaining different values for the same information, as
well as the time when changes to these values occurred
Data Warehouses
_ No consensus about the relationship between data warehouses and data marts
• A bottom-up approach: a data warehouse is built by first building the smaller data marts and then
merging these to obtain the data warehouse
• A top-down (classic data warehouse view): data marts are obtained from the data warehouse, a
data mart just a logical view of a data warehouse
_ Several differences between OLTP and OLAP systems
Data Warehouses
_ OLTP systems users: perform predefined operations through transactional applications (e.g., payroll
or ticket reservation systems)
_ Data warehouse users employ interactive OLAP tools to perform data analysis, for example, to detect
salary inconsistencies or most frequent tourist destinations (Lines 1–2)
_ Data:
• OLTP systems: current and detailed
• Data analytics require historical, summarized data (Line 3)
_ OLTP requires R/W operations, e.g., insert new orders, modify old ones, and delete orders if customers
cancel them → small number of records accessed
_ OLAP complex aggregation queries, thus requiring read access to all the records in one or more tables
_ OLAP systems not so frequently accessed as OLTP systems
_ OLTP systems: short response time; complex OLAP queries can take long time to complete (Line 9)
_ OLTP systems: heavy concurrent accesses (Lines 10–11).
_ OLAP systems are read only → queries can be submitted and computed concurrently
_ OLTP systems are constantly updated online; OLAP systems are updated offline periodically, then:
• OLTP: highly normalized schema; OLAP: denormalized schema
_ Data Warehouses
y Data Warehouse Architecture
_ The Multidimensional Model
_ OLAP Operations
Enterprise Reporting
ETL OLAP tools
Operational data
process warehouse server
databases
Statistical
tools
External
sources Data marts
Data mining
tools
Back-End Tier
_ Performs Extraction, Transformation, and Loading
_ It is a three-step process
_ Extraction gathers data from multiple, heterogeneous data sources internal or external to the organi-
zation
_ Transformation modifies the data from the format of the data sources to the warehouse format; this
includes:
• Cleaning: Removes errors and inconsistencies in the data and converts it into a standardized format
• Integration: Reconciles data from different data sources, both at the schema and at the data level
• Aggregation: Summarizes the data obtained from data sources according granularity of the data
warehouse
_ Loading feeds the data warehouse with the transformed data, including refreshing the data warehouse,
that is, propagating updates from the data sources to the data warehouse at a specified frequency
_ Data staging area (usually called operational data store): A database where data extracted from the
sources undergoes successive modifications before being loaded into the data warehouse
Back-End Tier
_ Example: DW for smart meter analysis: energy saving, offers, etc.
_ Traditional solution (see article Hadoop and DW)
Back-End Tier
_ Example: DW for smart meter analysis: energy saving, offers, etc.
_ Hadoop-based solution (see article Hadoop and DW)
_ Raw data moved into Hadoop, processed with Pig and moved to the DW
_ Hadoop not an ETL tool, just supports ETL
OLAP Tier
_ Composed of an OLAP server, which presents business users with multidimensional data from data
warehouses or data marts
_ Products include OLAP extensions and tools allowing building, querying, and navigating cubes, anal-
ysis, and reporting
_ Not yet a standardized language for defining and manipulating data cubes
_ MDX (MultiDimensional eXpressions): query language for OLAP databases, a de facto standard for
querying OLAP systems
_ SQL extended for providing analytical capabilities: SQL/OLAP
Front-End Tier
_ Client tools that allow users to exploit the content of the data warehouse
_ OLAP tools: allow interactive exploration and manipulation of the warehouse data and formulation
of complex ad hoc queries
_ Reporting tools enable the production, delivery, and management of reports, which can be paper-
based, interactive, or web-based
_ Reports use predefined queries queries asking for specific information in a specific format, performed
on a regular basis
_ Statistical tools: used to analyze and visualize the cube data using statistical methods
_ Data-mining tools allow users to analyze data in order to discover valuable knowledge such as pat-
terns and trends, and also allow to make predictions based on current data
_ Data Warehouses
_ Data Warehouse Architecture
y The Multidimensional Model
_ OLAP Operations
) r
Köln
ity e
(C tom
Berlin
s
Lyon
Cu
Paris
Q1 measure
Time (Quarter)
values
Q2
dimensions
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Quarter State
Month City
Day Customer
All all
_ Data Warehouses
_ Data Warehouse Architecture
_ The Multidimensional Model
y OLAP Operations
OLAP Operations
Köln
) r
ity e
ntr er
(C tom
Berlin
y)
ou m
Lyon Germany
(C usto
s
Cu
Paris France
C
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Berlin
Lyon
s
Köln
) r
Cu
ity e
(C tom
Paris Berlin
Lyon
s
Jan
Cu
Paris
Time (Month)
Feb Q1
Time (Quarter)
Mar Q2
... Q3
Dec Q4
Produce Seafood Condiments Seafood
Beverages Condiments Beverages Produce
Product (Category) Product (Category)
OLAP Operations
_ Starting cube: quarterly sales (in thousands) by product category and customer cities for 2012
_ We first compute the sales quantities by country: a roll-up operation to the Country level along the
Customer dimension
_ Sales of category Seafood in France significantly higher in the first quarter
• To find out if this occurred during a particular month, we take cube back to City aggregation level,
and drill-down along Time to the Month level
_ To explore alternative visualizations, we sort products by name
_ To see the cube with the Time dimension on the x axis, we rotate the axes of the original cube, without
changing granularities → pivoting (see next slide)
_ To visualize the data only for Paris → slice operation, results in a 2-dimensional subcube, basically a
collection of time series (see next slide)
_ To obtain a 3-dimensional subcube containing only sales for the first two quarters and for the cities
Lyon and Paris, we go back to the original cube and apply a dice operation
OLAP Operations
or t
y)
eg c
Seafood
at du
Condiments
(C Pro
Produce
Q1
Time (Quarter)
Beverages
Paris Q2
Customer (City)
Lyon Q3
Berlin Q4
) r
(Quarter) City me
( to
Lyon
s
Cu
Paris
Q1
Time
Q2
Produce Seafood
Beverages Condiments
Product (Category)
Dice on City=‘Paris’ or ‘Lyon’ and Quarter=’Q1’ or ’Q2’
) r
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
Lyon Lyon
s
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
Berlin
Lyon
s
Cu
Paris
Q1 Q1 84 89 106 84
Time (Quarter)
Time (Quarter)
Q2 Q2 82 77 93 79
Q3 Q3 105 72 65 88
Q4 Q4 112 61 96 102
Produce Seafood Lyon Köln
Beverages Condiments Paris Berlin
Product (Category) Customer (City)
) r
Köln
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
s
Lyon
s
Lyon
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
Maximum sales by quarter and city Top two sales by quarter and city
Köln Köln
) r
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Jan Jan
Time (Month)
Time (Month)
Feb Feb
Mar Mar
... ...
Dec Dec
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
) r
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter) Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
Top 70% per quarter Rank quarter by category and city
_ We go back to the original cube to compute the quarter sales that amount to 70% of the total sales,
applying the top percent aggregation operator
ADDMEASURE(Sales, Top70Perc = TOPPERCENT(Quantity, 70)
OVER Time ALL CELLS PRECEDING)
_ Finally, we rank the quarterly sales by category and city
ADDMEASURE(Sales, Rank = RANK(Quantity) OVER Time ALL CELLS PRECEDING)
) r
ity e
Bilbao
(C tom
Köln
s
Cu
Berlin
Lyon
Paris
Q1
Time (Quarter)
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Union of the original cube and a cube with data from Spain
Köln
) r
ity e
(C tom
Berlin
Lyon
s
Cu
Paris
Q1
Time (Quarter)
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)