Slidesitba - Clase2 - OLAP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Diplomatura en Big Data

Primer Semestre 2016


Módulo 1

Data Warehousing y OLAP


Alejandro VAISMAN
Departmento de Ingenierı́a Informática
Instituto Tecnológico de Buenos Aires
[email protected]

Diplomatura en Big Data 1


Data Warehouse and Multidimensional Databases Concepts
Outline

y Data Warehouses
_ Data Warehouse Architecture
_ The Multidimensional Model
_ OLAP Operations

Diplomatura en Big Data 2


Data Warehouse Concepts Data Warehouses

OLAP vs. OLTP


_ Traditional database systems designed and tuned to support the day-to-day operation:
• Ensure fast, concurrent access to data
• Transaction processing and concurrency control
• Focus on online update data consistency
• Known as operational databases or online transaction processing (OLTP)
_ OLTP DB data characteristics:
• Detailed data
• Do not include historical data
• Highly normalized
• Poor performance on complex queries including joins an aggregation
_ Data analysis requires a new paradigm: online analytical processing (OLAP)
• Typical OLTP query: pending orders for customer c1
• Typical OLAP query: total sales amount by product and by customer

Diplomatura en Big Data 3


Data Warehouse Concepts Data Warehouses

OLAP vs. OLTP


_ OLAP characteristics
• OLTP paradigm focused on transactions, OLAP focused on analytical queries
• Normalization not good for analytical queries, reconstructing data requires a high number of joins
• OLAP databases support a heavy query load
• OLTP indexing techniques not efficient in OLAP: oriented to access few records
∗ OLAP queries typically include aggregation
_ The need for a different database model to support OLAP was clear: led to data warehouses
_ Data warehouse: (usually) large repositories that consolidate data from different sources (internal and
external to the organization), are updated offline, follow the multidimensional data model, designed
and optimized to efficiently support OLAP queries

Diplomatura en Big Data 4


Data Warehouse Concepts Data Warehouses

Data Warehouses
_ Data needed for analysis: typically spread across multiple heterogeneous sources, e.g., operational
systems, web
_ These data must be integrated
_ Data warehouse: a repository of integrated data obtained from several sources for multidimensional
data analysis.
_ A collection of subject-oriented, integrated, nonvolatile, and time-varying data to support manage-
ment decisions (classic Immon definition)
• Subject-oriented: data warehouses focus on the analytical needs of different areas of an organiza-
tion; e.g., in a retail company the analysis may focus on product sales or inventory management;
• Integrated: data obtained from several operational and external systems must be joined together
• Nonvolatile: durability of data is ensured by disallowing data modification and removal; a data
warehouse gathers data from several years, (typically 5 to 10); data in operational databases is
often kept for only a short period of time, e.g., 2 to 6 months
• Time-varying indicates the possibility of retaining different values for the same information, as
well as the time when changes to these values occurred

Diplomatura en Big Data 5


Data Warehouse Concepts Data Warehouses

Data Warehouses
_ No consensus about the relationship between data warehouses and data marts
• A bottom-up approach: a data warehouse is built by first building the smaller data marts and then
merging these to obtain the data warehouse
• A top-down (classic data warehouse view): data marts are obtained from the data warehouse, a
data mart just a logical view of a data warehouse
_ Several differences between OLTP and OLAP systems

Aspect Operational databases Data warehouses


1 User type Operators, office employees Managers, executives
2 Usage Predictable, repetitive Ad hoc, nonstructured
3 Data content Current, detailed data Historical, summarized data
4 Data organization According to operational needs According to analysis needs
5 Data structures Optimized for small transactions Optimized for complex queries
6 Access frequency High From medium to low
7 Access type Read, insert, update, delete Read, append only
8 Number of records per Few Many
access
9 Response time Short Can be long
10 Concurrency level High Low
11 Lock utilization Needed Not needed
12 Update frequency High None
13 Data redundancy Low (normalized tables) High (denormalized tables)
14 Data modeling UML, ER model Multidimensional model

Diplomatura en Big Data 6


Data Warehouse Concepts Data Warehouses

Data Warehouses
_ OLTP systems users: perform predefined operations through transactional applications (e.g., payroll
or ticket reservation systems)
_ Data warehouse users employ interactive OLAP tools to perform data analysis, for example, to detect
salary inconsistencies or most frequent tourist destinations (Lines 1–2)
_ Data:
• OLTP systems: current and detailed
• Data analytics require historical, summarized data (Line 3)
_ OLTP requires R/W operations, e.g., insert new orders, modify old ones, and delete orders if customers
cancel them → small number of records accessed
_ OLAP complex aggregation queries, thus requiring read access to all the records in one or more tables
_ OLAP systems not so frequently accessed as OLTP systems
_ OLTP systems: short response time; complex OLAP queries can take long time to complete (Line 9)
_ OLTP systems: heavy concurrent accesses (Lines 10–11).
_ OLAP systems are read only → queries can be submitted and computed concurrently
_ OLTP systems are constantly updated online; OLAP systems are updated offline periodically, then:
• OLTP: highly normalized schema; OLAP: denormalized schema

Diplomatura en Big Data 7


Data Warehouse Concepts
Outline

_ Data Warehouses
y Data Warehouse Architecture
_ The Multidimensional Model
_ OLAP Operations

Diplomatura en Big Data 8


Data Warehouse Concepts Data Warehouse Architecture

Data Warehouse Architecture


_ General data warehouse architecture: several tiers
_ Back-end tier composed of:
• The extraction, transformation, and loading (ETL) tools : Feed data into the data warehouse from
operational databases and internal and external data sources
• The data staging area: An intermediate database where all the data integration and transformation
processes are run prior to the loading of the data into the data warehouse
_ Data warehouse tier composed of:
• An enterprise data warehouse and/or several data marts
• A metadata repository storing information about the data warehouse and its contents
_ OLAP tier composed of:
• An OLAP server which provides a multidimensional view of the data, regardless the actual way in
which data are stored
_ Front-end tier is used for data analysis and visualization
• Contains client tools such as OLAP tools, reporting tools, statistical tools, and data-mining tools

Diplomatura en Big Data 9


Data Warehouse Concepts Data Warehouse Architecture

Typical Data Warehouse Architecture


Internal Data staging Metadata
sources
OLAP tools

Enterprise Reporting
ETL OLAP tools
Operational data
process warehouse server
databases

Statistical
tools

External
sources Data marts
Data mining
tools

Data Back-end Data warehouse OLAP Front-end


sources tier tier tier tier

Diplomatura en Big Data 10


Data Warehouse Concepts Data Warehouse Architecture

Back-End Tier
_ Performs Extraction, Transformation, and Loading
_ It is a three-step process
_ Extraction gathers data from multiple, heterogeneous data sources internal or external to the organi-
zation
_ Transformation modifies the data from the format of the data sources to the warehouse format; this
includes:
• Cleaning: Removes errors and inconsistencies in the data and converts it into a standardized format
• Integration: Reconciles data from different data sources, both at the schema and at the data level
• Aggregation: Summarizes the data obtained from data sources according granularity of the data
warehouse
_ Loading feeds the data warehouse with the transformed data, including refreshing the data warehouse,
that is, propagating updates from the data sources to the data warehouse at a specified frequency
_ Data staging area (usually called operational data store): A database where data extracted from the
sources undergoes successive modifications before being loaded into the data warehouse

Diplomatura en Big Data 11


Data Warehouse Concepts Data Warehouse Architecture

Back-End Tier
_ Example: DW for smart meter analysis: energy saving, offers, etc.
_ Traditional solution (see article Hadoop and DW)

_ Data loaded from manual processes each quarter: 10 million readings


_ Smart meters: a measure each 5 minutes → 100 billion reads each quarter
_ A new solution required

Diplomatura en Big Data 12


Data Warehouse Concepts Data Warehouse Architecture

Back-End Tier
_ Example: DW for smart meter analysis: energy saving, offers, etc.
_ Hadoop-based solution (see article Hadoop and DW)

_ Raw data moved into Hadoop, processed with Pig and moved to the DW
_ Hadoop not an ETL tool, just supports ETL

Diplomatura en Big Data 13


Data Warehouse Concepts Data Warehouse Architecture

Data Warehouse Tier


_ Components:
• An enterprise data warehouse, centralized and encompassing an entire organization
• Several data marts, specialized departmental data warehouses
_ Metadata
• Business metadata describes the semantics of the data, and organizational rules, policies, and
constraints related to the data
• Technical metadata describes how data are structured and stored in a computer system, and the
applications and processes that manipulate the data
_ The metadata repository may contain information such as:
• Metadata describing the structure of the data warehouse and the data marts, at the conceptual/logical
level (facts, dimensions, hierarchies, ...) and at the physical level (indexes, partitions,...)
• Security information (user authorization and access control), and monitoring information (usage
statistics, error reports, audit trails)
• Metadata describing data sources: schemas, ownership, update frequencies, legal limitations, ac-
cess methods
• Metadata describing the ETL: data lineage, data extraction, cleaning, transformation rules, etc.

Diplomatura en Big Data 14


Data Warehouse Concepts Data Warehouse Architecture

OLAP Tier
_ Composed of an OLAP server, which presents business users with multidimensional data from data
warehouses or data marts
_ Products include OLAP extensions and tools allowing building, querying, and navigating cubes, anal-
ysis, and reporting
_ Not yet a standardized language for defining and manipulating data cubes
_ MDX (MultiDimensional eXpressions): query language for OLAP databases, a de facto standard for
querying OLAP systems
_ SQL extended for providing analytical capabilities: SQL/OLAP

Diplomatura en Big Data 15


Data Warehouse Concepts Data Warehouse Architecture

Front-End Tier
_ Client tools that allow users to exploit the content of the data warehouse
_ OLAP tools: allow interactive exploration and manipulation of the warehouse data and formulation
of complex ad hoc queries
_ Reporting tools enable the production, delivery, and management of reports, which can be paper-
based, interactive, or web-based
_ Reports use predefined queries queries asking for specific information in a specific format, performed
on a regular basis
_ Statistical tools: used to analyze and visualize the cube data using statistical methods
_ Data-mining tools allow users to analyze data in order to discover valuable knowledge such as pat-
terns and trends, and also allow to make predictions based on current data

Diplomatura en Big Data 16


Data Warehouse Concepts Data Warehouse Architecture

Variations of the Architecture


_ Sometimes: only an enterprise data warehouse without data marts or, alternatively, an enterprise data
warehouse does not exist
_ In other situations, an OLAP server does not exist and/or the client tools directly access the data
warehouse (indicated by the arrow connecting the data warehouse tier to the front-end tier)
_ Extreme situation: neither a data warehouse nor an OLAP server → virtual data warehouse, which
defines a set of views over operational databases that are materialized for efficient access
_ Virtual data warehouse is easy to build but does not provide a real data warehouse solution (does not
contain historical data, centralized metadata, etc.)
_ Data staging area may not be needed when the data in the source systems conforms very closely to the
data in the warehouse

Diplomatura en Big Data 17


Data Warehouse Concepts
Outline

_ Data Warehouses
_ Data Warehouse Architecture
y The Multidimensional Model
_ OLAP Operations

Diplomatura en Big Data 18


The Multidimensional Model
_ Views data in an n-dimensional space: A data cube
_ A data cube is composed of dimensions and facts
_ Dimensions: Perspectives used to analyze the data
• Example: A three-dimensional cube for sales data with dimensions Product, Time, and Customer,
and a measure Quantity

) r
Köln

ity e
(C tom
Berlin

s
Lyon

Cu
Paris
Q1 measure

Time (Quarter)
values
Q2
dimensions
Q3

Q4
Produce Seafood
Beverages Condiments
Product (Category)

_ Attributes describe dimensions


• Product dimension may have attributes ProductNumber and UnitPrice (not shown in the figure)
_ The cells or facts of a data cube have associated numeric values called measures
_ Each cell of the data cube represents Quantity of units sold by category, quarter, and customer’s city

Diplomatura en Big Data 19


The Multidimensional Model
_ Data granularity: level of detail at which measures are represented for each dimension of the cube
• Example: sales figures aggregated to granularities Category, Quarter, and City
• We may want sales figures at a finer granularity (Month), or at a coarser granularity (Country)
_ Elements of a dimension at a certain granularity are called members
• Example: Seafood and Beverages are members of Product at granularity Category
_ A dimension instance comprises all members at all granularity levels in a dimension
_ A data cube contains several measures, e.g. Amount, indicating the total sales amount (not shown)
_ A data cube may be sparse (typical case) or dense
• Example: not all customers may have ordered products of all categories during all quarters
_ Hierarchies: allow viewing data at several granularities
• Define a sequence of mappings relating lower-level, detailed concepts to higher-level ones
• The lower level is called the child and the higher level is called the parent
• The hierarchical structure of a dimension is called the dimension schema

Diplomatura en Big Data 20


The Multidimensional Model
_ Hierarchies of the Product, Time, and Customer dimensions
Product Time Customer
All All All

Category Year Continent

Product Semester Country

Quarter State

Month City

Day Customer

Diplomatura en Big Data 21


The Multidimensional Model
_ Members of a hierarchy Product → Category

All all

Category Beverages ... Seafood

Product Chai Chang ... Ikura Konbu ...

Diplomatura en Big Data 22


The Multidimensional Model: Measures
_ Aggregation of measures changes the abstraction level at which data in a cube are visualized
_ Measures can be:
• Additive: can be meaningfully summarized along all the dimensions, using addition
∗ The most common type of measures
• Semiadditive: can be meaningfully summarized using addition along some dimensions
∗ Example: inventory quantities, which cannot be added along the Time dimension
• Nonadditive measures cannot be meaningfully summarized using addition across any dimension
∗ Example: item price, cost per unit, and exchange rate
_ Another classification of measures:
• Distributive: defined by an aggregation function that can be computed in a distributed way
∗ Functions count, sum, minimum, and maximum are distributive, distinct count is not
∗ Example: S = {3, 3, 4, 5, 8, 4, 7, 3, 8} partitioned in subsets {3, 3, 4}, {5, 8, 4}, {7, 3, 8} gives a
result of 8, while the answer over the original set is 5
• Algebraic measures are defined by an aggregation function that can be expressed as a scalar
function of distributive ones; example: average, computed by dividing the sum by the count
• Holistic measures cannot be computed from other subaggregates (e.g., median, rank)

Diplomatura en Big Data 23


The Multidimensional Model: Measures
_ When defining a measure we must determine the associated aggregation functions
• For example, a semiadditive measure representing inventory quantities can be aggregated using
average along the Time dimension, and using addition along other dimensions
_ Summarizability refers to the correct aggregation of cube measures along dimension hierarchies
_ Summarizability conditions:
• Disjointness of instances: the grouping of instances in a level with respect to their parent in the
next level must result in disjoint subsets
• Completeness: all instances are included in the hierarchy and each instance is related to one parent
in the next level
• Correctness: refers to the correct use of the aggregation functions (more on this next)

Diplomatura en Big Data 24


Data Warehouse Concepts
Outline

_ Data Warehouses
_ Data Warehouse Architecture
_ The Multidimensional Model
y OLAP Operations

Diplomatura en Big Data 25


Data Warehouse Concepts OLAP Operations

OLAP Operations
Köln

) r
ity e

ntr er
(C tom
Berlin

y)
ou m
Lyon Germany

(C usto
s
Cu
Paris France

C
Q1 Q1

Time (Quarter)
Time (Quarter)
Q2 Q2

Q3 Q3

Q4 Q4

Produce Seafood Produce Seafood


Beverages Condiments Beverages Condiments
Product (Category) Product (Category)

Original cube Roll-up to the Country level


Köln
) r
ity e
(C tom

Berlin
Lyon
s

Köln

) r
Cu

ity e
(C tom
Paris Berlin
Lyon

s
Jan

Cu
Paris
Time (Month)

Feb Q1

Time (Quarter)
Mar Q2

... Q3

Dec Q4
Produce Seafood Condiments Seafood
Beverages Condiments Beverages Produce
Product (Category) Product (Category)

Drill-down to the Month level Sort product by name

Diplomatura en Big Data 26


Data Warehouse Concepts OLAP Operations

OLAP Operations
_ Starting cube: quarterly sales (in thousands) by product category and customer cities for 2012
_ We first compute the sales quantities by country: a roll-up operation to the Country level along the
Customer dimension
_ Sales of category Seafood in France significantly higher in the first quarter
• To find out if this occurred during a particular month, we take cube back to City aggregation level,
and drill-down along Time to the Month level
_ To explore alternative visualizations, we sort products by name
_ To see the cube with the Time dimension on the x axis, we rotate the axes of the original cube, without
changing granularities → pivoting (see next slide)
_ To visualize the data only for Paris → slice operation, results in a 2-dimensional subcube, basically a
collection of time series (see next slide)
_ To obtain a 3-dimensional subcube containing only sales for the first two quarters and for the cities
Lyon and Paris, we go back to the original cube and apply a dice operation

Diplomatura en Big Data 27


Data Warehouse Concepts OLAP Operations

OLAP Operations

or t
y)
eg c
Seafood

at du
Condiments

(C Pro
Produce
Q1

Time (Quarter)
Beverages
Paris Q2

Customer (City)
Lyon Q3

Berlin Q4

Köln Produce Seafood


Beverages Condiments
Q1 Q2 Q3 Q4 Product (Category)
Time (Quarter)
Pivot Slice on City=’Paris’

) r
(Quarter) City me
( to
Lyon
s
Cu

Paris
Q1
Time

Q2
Produce Seafood
Beverages Condiments
Product (Category)
Dice on City=‘Paris’ or ‘Lyon’ and Quarter=’Q1’ or ’Q2’

Diplomatura en Big Data 28


Data Warehouse Concepts OLAP Operations

Advanced OLAP Operations


Köln Köln

) r

) r
ity e

ity e
(C tom

(C tom
Berlin Berlin
Lyon Lyon

s
Cu

Cu
Paris Paris
Q1 Q1

Time (Quarter)

Time (Quarter)
Q2 Q2

Q3 Q3

Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)

Cube for 2011 Drill-across operator


Köln
) r
ity e
(C tom

Berlin
Lyon
s
Cu

Paris
Q1 Q1 84 89 106 84

Time (Quarter)
Time (Quarter)

Q2 Q2 82 77 93 79

Q3 Q3 105 72 65 88

Q4 Q4 112 61 96 102
Produce Seafood Lyon Köln
Beverages Condiments Paris Berlin
Product (Category) Customer (City)

Percentage change Total salesSUM


byBYquarter and city
Time, Customer

Diplomatura en Big Data 29


Data Warehouse Concepts OLAP Operations

Advanced OLAP Operations


Köln

) r
Köln

) r
ity e

ity e
(C tom

(C tom
Berlin Berlin

s
Lyon

s
Lyon

Cu

Cu
Paris Paris
Q1 Q1

Time (Quarter)

Time (Quarter)
Q2 Q2

Q3 Q3

Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)

Maximum sales by quarter and city Top two sales by quarter and city
Köln Köln
) r

) r
ity e

ity e
(C tom

(C tom
Berlin Berlin
Lyon Lyon
s

s
Cu

Cu
Paris Paris
Jan Jan
Time (Month)

Time (Month)
Feb Feb

Mar Mar

... ...

Dec Dec
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)

Three-months moving average Year-to-date sum

Diplomatura en Big Data 30


Data Warehouse Concepts OLAP Operations

Advanced OLAP Operations


Köln Köln

) r
) r

ity e
ity e

(C tom
(C tom
Berlin Berlin
Lyon Lyon

s
s

Cu
Cu
Paris Paris
Q1 Q1

Time (Quarter)
Time (Quarter) Q2 Q2

Q3 Q3

Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
Top 70% per quarter Rank quarter by category and city

_ We go back to the original cube to compute the quarter sales that amount to 70% of the total sales,
applying the top percent aggregation operator
ADDMEASURE(Sales, Top70Perc = TOPPERCENT(Quantity, 70)
OVER Time ALL CELLS PRECEDING)
_ Finally, we rank the quarterly sales by category and city
ADDMEASURE(Sales, Rank = RANK(Quantity) OVER Time ALL CELLS PRECEDING)

Diplomatura en Big Data 31


Data Warehouse Concepts OLAP Operations

Advanced OLAP Operations


_ Union merges two cubes having the same schema but disjoint instances.
_ Example: If CubeSpain is a cube having the same schema as the original cube but containing only
the sales to Spanish customers, we can perform:
UNION(Sales, SalesSpain)
_ Difference removes the cells in a cube that belong to another one; the two cubes must have the same
schema
_ Example: Given the cubes TopTwoSales and the original one, compute a cube with all sales measures
except the top two sales by quarter and city
DIFFERENCE(Sales, TopTwoSales)
_ Drill-through operation allows to move from data at the bottom level in a cube to data in the opera-
tional systems from which the cube was derived
_ Could be used when trying to determine the reason for outlier values in a data cube

Diplomatura en Big Data 32


Data Warehouse Concepts OLAP Operations

Advanced OLAP Operations


Madrid

) r
ity e
Bilbao

(C tom
Köln

s
Cu
Berlin
Lyon
Paris
Q1

Time (Quarter)
Q2

Q3

Q4
Produce Seafood
Beverages Condiments
Product (Category)

Union of the original cube and a cube with data from Spain
Köln

) r
ity e
(C tom
Berlin
Lyon
s
Cu

Paris
Q1
Time (Quarter)

Q2

Q3

Q4
Produce Seafood
Beverages Condiments
Product (Category)

Difference of the original cube and the original one cube

Diplomatura en Big Data 33


Data Warehouse Concepts OLAP Operations

Summarizing OLAP Operations


Operator Purpose
Add measure Adds a new measure to a cube computed from other measures or dimensions.
Aggregation opera- Aggregates the cells of a cube, possibly after performing a grouping of cells.
tors
Dice Keeps the cells that satisfy a Boolean condition over dimension levels, attributes, and
measures.
Difference Removes the cells of a cube that are in another cube. Both cubes must have the same
schema.
Drill-across Merges two cubes that have the same schema and instances using a join condition.
Drill-down Disaggregates measures along a dimension hierarchy to obtain data at a finer granular-
ity. It is the opposite of the roll-up operation.
Drill-through Shows data in the operational systems from which the cube was derived. This operation
does not formally belong to the OLAP algebra since the result is not a cube.
Drop measure Removes one or several measures from a cube.
Pivot Rotates the axes of a cube to provide an alternative presentation of its data.
Recursive roll-up Performs an iteration of roll-ups over a recursive hierarchy until the top level is reached.
Rename Renames one or several schema elements of a cube.
Roll-up Aggregates measures along a dimension hierarchy to obtain data at a coarser granular-
ity. It is the opposite of the drill-down operation.
Roll-up* Shorthand notation for a sequence of roll-up operations.
Slice Removes a dimension by fixing a single value in a level of the dimension.
Sort Orders the members of a dimension according to an expression.
Union Combines the cells of two cubes that have the same schema but disjoint members.

Diplomatura en Big Data 34

You might also like