Slidesitba - Clase2 - OLAP

Diplomatura en Big Data
Primer Semestre 2016

Módulo 1
Data Warehousing y OLAP

Alejandro VAISMAN
Departmento de Ingenierı́a Informática
Instituto Tecnológico de Buenos Aires
[email protected]
Diplomatura en Big Data 1

Data Warehouse and Multidimensional Databases Concepts
Outline
y Data Warehouses
_ Data Warehouse Architecture
_ The Multidimensional Model
_ OLAP Operations

Data Warehouse Concepts Data Warehouses
OLAP vs. OLTP

_ Traditional database systems designed and tuned to support the day-to-day operation:
• Ensure fast, concurrent access to data
• Transaction processing and concurrency control
• Focus on online update data consistency
• Known as operational databases or online transaction processing (OLTP)
_ OLTP DB data characteristics:
• Detailed data
• Do not include historical data
• Highly normalized
• Poor performance on complex queries including joins an aggregation
_ Data analysis requires a new paradigm: online analytical processing (OLAP)
• Typical OLTP query: pending orders for customer c1
• Typical OLAP query: total sales amount by product and by customer

OLAP vs. OLTP

_ OLAP characteristics
• OLTP paradigm focused on transactions, OLAP focused on analytical queries
• Normalization not good for analytical queries, reconstructing data requires a high number of joins
• OLAP databases support a heavy query load
• OLTP indexing techniques not efficient in OLAP: oriented to access few records
∗ OLAP queries typically include aggregation
_ The need for a different database model to support OLAP was clear: led to data warehouses
_ Data warehouse: (usually) large repositories that consolidate data from different sources (internal and
external to the organization), are updated offline, follow the multidimensional data model, designed
and optimized to efficiently support OLAP queries

Data Warehouses
_ Data needed for analysis: typically spread across multiple heterogeneous sources, e.g., operational
systems, web
_ These data must be integrated
_ Data warehouse: a repository of integrated data obtained from several sources for multidimensional
data analysis.
_ A collection of subject-oriented, integrated, nonvolatile, and time-varying data to support manage-
ment decisions (classic Immon definition)
• Subject-oriented: data warehouses focus on the analytical needs of different areas of an organiza-
tion; e.g., in a retail company the analysis may focus on product sales or inventory management;
• Integrated: data obtained from several operational and external systems must be joined together
• Nonvolatile: durability of data is ensured by disallowing data modification and removal; a data
warehouse gathers data from several years, (typically 5 to 10); data in operational databases is
often kept for only a short period of time, e.g., 2 to 6 months
• Time-varying indicates the possibility of retaining different values for the same information, as
well as the time when changes to these values occurred

Data Warehouses
_ No consensus about the relationship between data warehouses and data marts
• A bottom-up approach: a data warehouse is built by first building the smaller data marts and then
merging these to obtain the data warehouse
• A top-down (classic data warehouse view): data marts are obtained from the data warehouse, a
data mart just a logical view of a data warehouse
_ Several differences between OLTP and OLAP systems
Aspect Operational databases Data warehouses

1 User type Operators, office employees Managers, executives
2 Usage Predictable, repetitive Ad hoc, nonstructured
3 Data content Current, detailed data Historical, summarized data
4 Data organization According to operational needs According to analysis needs
5 Data structures Optimized for small transactions Optimized for complex queries
6 Access frequency High From medium to low
7 Access type Read, insert, update, delete Read, append only
8 Number of records per Few Many
access
9 Response time Short Can be long
10 Concurrency level High Low
11 Lock utilization Needed Not needed
12 Update frequency High None
13 Data redundancy Low (normalized tables) High (denormalized tables)
14 Data modeling UML, ER model Multidimensional model

Data Warehouses
_ OLTP systems users: perform predefined operations through transactional applications (e.g., payroll
or ticket reservation systems)
_ Data warehouse users employ interactive OLAP tools to perform data analysis, for example, to detect
salary inconsistencies or most frequent tourist destinations (Lines 1–2)
_ Data:
• OLTP systems: current and detailed
• Data analytics require historical, summarized data (Line 3)
_ OLTP requires R/W operations, e.g., insert new orders, modify old ones, and delete orders if customers
cancel them → small number of records accessed
_ OLAP complex aggregation queries, thus requiring read access to all the records in one or more tables
_ OLAP systems not so frequently accessed as OLTP systems
_ OLTP systems: short response time; complex OLAP queries can take long time to complete (Line 9)
_ OLTP systems: heavy concurrent accesses (Lines 10–11).
_ OLAP systems are read only → queries can be submitted and computed concurrently
_ OLTP systems are constantly updated online; OLAP systems are updated offline periodically, then:
• OLTP: highly normalized schema; OLAP: denormalized schema

Data Warehouse Concepts
Outline
_ Data Warehouses
y Data Warehouse Architecture
_ OLAP Operations

Data Warehouse Concepts Data Warehouse Architecture
Data Warehouse Architecture

_ General data warehouse architecture: several tiers
_ Back-end tier composed of:
• The extraction, transformation, and loading (ETL) tools : Feed data into the data warehouse from
operational databases and internal and external data sources
• The data staging area: An intermediate database where all the data integration and transformation
processes are run prior to the loading of the data into the data warehouse
_ Data warehouse tier composed of:
• An enterprise data warehouse and/or several data marts
• A metadata repository storing information about the data warehouse and its contents
_ OLAP tier composed of:
• An OLAP server which provides a multidimensional view of the data, regardless the actual way in
which data are stored
_ Front-end tier is used for data analysis and visualization
• Contains client tools such as OLAP tools, reporting tools, statistical tools, and data-mining tools

Typical Data Warehouse Architecture

Internal Data staging Metadata
sources
OLAP tools
Enterprise Reporting
ETL OLAP tools
Operational data
process warehouse server
databases
Statistical
tools
External
sources Data marts
Data mining
tools
Data Back-end Data warehouse OLAP Front-end

sources tier tier tier tier

Back-End Tier
_ Performs Extraction, Transformation, and Loading
_ It is a three-step process
_ Extraction gathers data from multiple, heterogeneous data sources internal or external to the organi-
zation
_ Transformation modifies the data from the format of the data sources to the warehouse format; this
includes:
• Cleaning: Removes errors and inconsistencies in the data and converts it into a standardized format
• Integration: Reconciles data from different data sources, both at the schema and at the data level
• Aggregation: Summarizes the data obtained from data sources according granularity of the data
warehouse
_ Loading feeds the data warehouse with the transformed data, including refreshing the data warehouse,
that is, propagating updates from the data sources to the data warehouse at a specified frequency
_ Data staging area (usually called operational data store): A database where data extracted from the
sources undergoes successive modifications before being loaded into the data warehouse

Back-End Tier
_ Example: DW for smart meter analysis: energy saving, offers, etc.
_ Traditional solution (see article Hadoop and DW)
_ Data loaded from manual processes each quarter: 10 million readings

_ Smart meters: a measure each 5 minutes → 100 billion reads each quarter
_ A new solution required

Back-End Tier
_ Example: DW for smart meter analysis: energy saving, offers, etc.
_ Hadoop-based solution (see article Hadoop and DW)
_ Raw data moved into Hadoop, processed with Pig and moved to the DW
_ Hadoop not an ETL tool, just supports ETL

Data Warehouse Tier

_ Components:
• An enterprise data warehouse, centralized and encompassing an entire organization
• Several data marts, specialized departmental data warehouses
_ Metadata
• Business metadata describes the semantics of the data, and organizational rules, policies, and
constraints related to the data
• Technical metadata describes how data are structured and stored in a computer system, and the
applications and processes that manipulate the data
_ The metadata repository may contain information such as:
• Metadata describing the structure of the data warehouse and the data marts, at the conceptual/logical
level (facts, dimensions, hierarchies, ...) and at the physical level (indexes, partitions,...)
• Security information (user authorization and access control), and monitoring information (usage
statistics, error reports, audit trails)
• Metadata describing data sources: schemas, ownership, update frequencies, legal limitations, ac-
cess methods
• Metadata describing the ETL: data lineage, data extraction, cleaning, transformation rules, etc.

OLAP Tier
_ Composed of an OLAP server, which presents business users with multidimensional data from data
warehouses or data marts
_ Products include OLAP extensions and tools allowing building, querying, and navigating cubes, anal-
ysis, and reporting
_ Not yet a standardized language for defining and manipulating data cubes
_ MDX (MultiDimensional eXpressions): query language for OLAP databases, a de facto standard for
querying OLAP systems
_ SQL extended for providing analytical capabilities: SQL/OLAP

Front-End Tier
_ Client tools that allow users to exploit the content of the data warehouse
_ OLAP tools: allow interactive exploration and manipulation of the warehouse data and formulation
of complex ad hoc queries
_ Reporting tools enable the production, delivery, and management of reports, which can be paper-
based, interactive, or web-based
_ Reports use predefined queries queries asking for specific information in a specific format, performed
on a regular basis
_ Statistical tools: used to analyze and visualize the cube data using statistical methods
_ Data-mining tools allow users to analyze data in order to discover valuable knowledge such as pat-
terns and trends, and also allow to make predictions based on current data

Variations of the Architecture

_ Sometimes: only an enterprise data warehouse without data marts or, alternatively, an enterprise data
warehouse does not exist
_ In other situations, an OLAP server does not exist and/or the client tools directly access the data
warehouse (indicated by the arrow connecting the data warehouse tier to the front-end tier)
_ Extreme situation: neither a data warehouse nor an OLAP server → virtual data warehouse, which
defines a set of views over operational databases that are materialized for efficient access
_ Virtual data warehouse is easy to build but does not provide a real data warehouse solution (does not
contain historical data, centralized metadata, etc.)
_ Data staging area may not be needed when the data in the source systems conforms very closely to the
data in the warehouse

Outline
_ Data Warehouses
y The Multidimensional Model
_ OLAP Operations

The Multidimensional Model
_ Views data in an n-dimensional space: A data cube
_ A data cube is composed of dimensions and facts
_ Dimensions: Perspectives used to analyze the data
• Example: A three-dimensional cube for sales data with dimensions Product, Time, and Customer,
and a measure Quantity
) r
Köln
ity e
(C tom
Berlin
s
Lyon
Cu
Paris
Q1 measure
Time (Quarter)
values
Q2
dimensions
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
_ Attributes describe dimensions

• Product dimension may have attributes ProductNumber and UnitPrice (not shown in the figure)
_ The cells or facts of a data cube have associated numeric values called measures
_ Each cell of the data cube represents Quantity of units sold by category, quarter, and customer’s city

_ Data granularity: level of detail at which measures are represented for each dimension of the cube
• Example: sales figures aggregated to granularities Category, Quarter, and City
• We may want sales figures at a finer granularity (Month), or at a coarser granularity (Country)
_ Elements of a dimension at a certain granularity are called members
• Example: Seafood and Beverages are members of Product at granularity Category
_ A dimension instance comprises all members at all granularity levels in a dimension
_ A data cube contains several measures, e.g. Amount, indicating the total sales amount (not shown)
_ A data cube may be sparse (typical case) or dense
• Example: not all customers may have ordered products of all categories during all quarters
_ Hierarchies: allow viewing data at several granularities
• Define a sequence of mappings relating lower-level, detailed concepts to higher-level ones
• The lower level is called the child and the higher level is called the parent
• The hierarchical structure of a dimension is called the dimension schema

_ Hierarchies of the Product, Time, and Customer dimensions
Product Time Customer
All All All
Category Year Continent
Product Semester Country
Quarter State
Month City
Day Customer

_ Members of a hierarchy Product → Category
All all
Category Beverages ... Seafood
Product Chai Chang ... Ikura Konbu ...

The Multidimensional Model: Measures
_ Aggregation of measures changes the abstraction level at which data in a cube are visualized
_ Measures can be:
• Additive: can be meaningfully summarized along all the dimensions, using addition
∗ The most common type of measures
• Semiadditive: can be meaningfully summarized using addition along some dimensions
∗ Example: inventory quantities, which cannot be added along the Time dimension
• Nonadditive measures cannot be meaningfully summarized using addition across any dimension
∗ Example: item price, cost per unit, and exchange rate
_ Another classification of measures:
• Distributive: defined by an aggregation function that can be computed in a distributed way
∗ Functions count, sum, minimum, and maximum are distributive, distinct count is not
∗ Example: S = {3, 3, 4, 5, 8, 4, 7, 3, 8} partitioned in subsets {3, 3, 4}, {5, 8, 4}, {7, 3, 8} gives a
result of 8, while the answer over the original set is 5
• Algebraic measures are defined by an aggregation function that can be expressed as a scalar
function of distributive ones; example: average, computed by dividing the sum by the count
• Holistic measures cannot be computed from other subaggregates (e.g., median, rank)

The Multidimensional Model: Measures
_ When defining a measure we must determine the associated aggregation functions
• For example, a semiadditive measure representing inventory quantities can be aggregated using
average along the Time dimension, and using addition along other dimensions
_ Summarizability refers to the correct aggregation of cube measures along dimension hierarchies
_ Summarizability conditions:
• Disjointness of instances: the grouping of instances in a level with respect to their parent in the
next level must result in disjoint subsets
• Completeness: all instances are included in the hierarchy and each instance is related to one parent
in the next level
• Correctness: refers to the correct use of the aggregation functions (more on this next)

Outline
_ Data Warehouses
y OLAP Operations

Data Warehouse Concepts OLAP Operations
OLAP Operations
Köln
) r
ity e
ntr er
(C tom
Berlin
y)
ou m
Lyon Germany
(C usto
s
Cu
Paris France
C
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood

Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
Original cube Roll-up to the Country level

Köln
) r
ity e
(C tom
Berlin
Lyon
s
Köln
) r
Cu
ity e
(C tom
Paris Berlin
Lyon
s
Jan
Cu
Paris
Time (Month)
Feb Q1
Time (Quarter)
Mar Q2
... Q3
Dec Q4
Produce Seafood Condiments Seafood
Beverages Condiments Beverages Produce
Drill-down to the Month level Sort product by name

OLAP Operations
_ Starting cube: quarterly sales (in thousands) by product category and customer cities for 2012
_ We first compute the sales quantities by country: a roll-up operation to the Country level along the
Customer dimension
_ Sales of category Seafood in France significantly higher in the first quarter
• To find out if this occurred during a particular month, we take cube back to City aggregation level,
and drill-down along Time to the Month level
_ To explore alternative visualizations, we sort products by name
_ To see the cube with the Time dimension on the x axis, we rotate the axes of the original cube, without
changing granularities → pivoting (see next slide)
_ To visualize the data only for Paris → slice operation, results in a 2-dimensional subcube, basically a
collection of time series (see next slide)
_ To obtain a 3-dimensional subcube containing only sales for the first two quarters and for the cities
Lyon and Paris, we go back to the original cube and apply a dice operation

OLAP Operations
or t
y)
eg c
Seafood
at du
Condiments
(C Pro
Produce
Q1
Time (Quarter)
Beverages
Paris Q2
Customer (City)
Lyon Q3
Berlin Q4
Köln Produce Seafood

Q1 Q2 Q3 Q4 Product (Category)
Time (Quarter)
Pivot Slice on City=’Paris’
) r
(Quarter) City me
( to
Lyon
s
Cu
Paris
Q1
Time
Q2
Produce Seafood
Product (Category)
Dice on City=‘Paris’ or ‘Lyon’ and Quarter=’Q1’ or ’Q2’

Advanced OLAP Operations

Köln Köln
) r
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
Lyon Lyon
s
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Cube for 2011 Drill-across operator

Köln
) r
ity e
(C tom
Berlin
Lyon
s
Cu
Paris
Q1 Q1 84 89 106 84
Time (Quarter)
Time (Quarter)
Q2 Q2 82 77 93 79
Q3 Q3 105 72 65 88
Q4 Q4 112 61 96 102
Produce Seafood Lyon Köln
Beverages Condiments Paris Berlin
Product (Category) Customer (City)
Percentage change Total salesSUM

byBYquarter and city
Time, Customer


Köln
) r
Köln
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
s
Lyon
s
Lyon
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Maximum sales by quarter and city Top two sales by quarter and city
Köln Köln
) r
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Jan Jan
Time (Month)
Time (Month)
Feb Feb
Mar Mar
... ...
Dec Dec
Three-months moving average Year-to-date sum


Köln Köln
) r
) r
ity e
ity e
(C tom
(C tom
Berlin Berlin
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter) Q2 Q2
Q3 Q3
Q4 Q4
Top 70% per quarter Rank quarter by category and city
_ We go back to the original cube to compute the quarter sales that amount to 70% of the total sales,
applying the top percent aggregation operator
ADDMEASURE(Sales, Top70Perc = TOPPERCENT(Quantity, 70)
OVER Time ALL CELLS PRECEDING)
_ Finally, we rank the quarterly sales by category and city
ADDMEASURE(Sales, Rank = RANK(Quantity) OVER Time ALL CELLS PRECEDING)


_ Union merges two cubes having the same schema but disjoint instances.
_ Example: If CubeSpain is a cube having the same schema as the original cube but containing only
the sales to Spanish customers, we can perform:
UNION(Sales, SalesSpain)
_ Difference removes the cells in a cube that belong to another one; the two cubes must have the same
schema
_ Example: Given the cubes TopTwoSales and the original one, compute a cube with all sales measures
except the top two sales by quarter and city
DIFFERENCE(Sales, TopTwoSales)
_ Drill-through operation allows to move from data at the bottom level in a cube to data in the opera-
tional systems from which the cube was derived
_ Could be used when trying to determine the reason for outlier values in a data cube


Madrid
) r
ity e
Bilbao
(C tom
Köln
s
Cu
Berlin
Lyon
Paris
Q1
Time (Quarter)
Q2
Q3
Q4
Produce Seafood
Product (Category)
Union of the original cube and a cube with data from Spain
Köln
) r
ity e
(C tom
Berlin
Lyon
s
Cu
Paris
Q1
Time (Quarter)
Q2
Q3
Q4
Produce Seafood
Product (Category)
Difference of the original cube and the original one cube

Summarizing OLAP Operations

Operator Purpose
Add measure Adds a new measure to a cube computed from other measures or dimensions.
Aggregation opera- Aggregates the cells of a cube, possibly after performing a grouping of cells.
tors
Dice Keeps the cells that satisfy a Boolean condition over dimension levels, attributes, and
measures.
Difference Removes the cells of a cube that are in another cube. Both cubes must have the same
schema.
Drill-across Merges two cubes that have the same schema and instances using a join condition.
Drill-down Disaggregates measures along a dimension hierarchy to obtain data at a finer granular-
ity. It is the opposite of the roll-up operation.
Drill-through Shows data in the operational systems from which the cube was derived. This operation
does not formally belong to the OLAP algebra since the result is not a cube.
Drop measure Removes one or several measures from a cube.
Pivot Rotates the axes of a cube to provide an alternative presentation of its data.
Recursive roll-up Performs an iteration of roll-ups over a recursive hierarchy until the top level is reached.
Rename Renames one or several schema elements of a cube.
Roll-up Aggregates measures along a dimension hierarchy to obtain data at a coarser granular-
ity. It is the opposite of the drill-down operation.
Roll-up* Shorthand notation for a sequence of roll-up operations.
Slice Removes a dimension by fixing a single value in a level of the dimension.
Sort Orders the members of a dimension according to an expression.
Union Combines the cells of two cubes that have the same schema but disjoint members.

Slidesitba - Clase2 - OLAP

Uploaded by

Copyright:

Available Formats

Slidesitba - Clase2 - OLAP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slidesitba - Clase2 - OLAP

Uploaded by

Copyright:

Available Formats

Diplomatura en Big Data

Primer Semestre 2016

Data Warehousing y OLAP

Diplomatura en Big Data 1

Diplomatura en Big Data 2

OLAP vs. OLTP

Diplomatura en Big Data 3

OLAP vs. OLTP

Diplomatura en Big Data 4

Diplomatura en Big Data 5

Aspect Operational databases Data warehouses

Diplomatura en Big Data 6

Diplomatura en Big Data 7

Diplomatura en Big Data 8

Data Warehouse Architecture

Diplomatura en Big Data 9

Typical Data Warehouse Architecture

Data Back-end Data warehouse OLAP Front-end

Diplomatura en Big Data 10

Diplomatura en Big Data 11

_ Data loaded from manual processes each quarter: 10 million readings

Diplomatura en Big Data 12

Diplomatura en Big Data 13

Data Warehouse Tier

Diplomatura en Big Data 14

Diplomatura en Big Data 15

Diplomatura en Big Data 16

Variations of the Architecture

Diplomatura en Big Data 17

Diplomatura en Big Data 18

_ Attributes describe dimensions

Diplomatura en Big Data 19

Diplomatura en Big Data 20

Category Year Continent

Product Semester Country

Diplomatura en Big Data 21

Category Beverages ... Seafood

Product Chai Chang ... Ikura Konbu ...

Diplomatura en Big Data 22

Diplomatura en Big Data 23

Diplomatura en Big Data 24

Diplomatura en Big Data 25

Produce Seafood Produce Seafood

Original cube Roll-up to the Country level

Drill-down to the Month level Sort product by name

Diplomatura en Big Data 26

Diplomatura en Big Data 27

Köln Produce Seafood

Diplomatura en Big Data 28

Advanced OLAP Operations

Cube for 2011 Drill-across operator

Percentage change Total salesSUM

Diplomatura en Big Data 29

Advanced OLAP Operations

Three-months moving average Year-to-date sum

Diplomatura en Big Data 30

Advanced OLAP Operations

Diplomatura en Big Data 31

Advanced OLAP Operations