DW Slides
DW Slides
DW Slides
for BS(CS)
Data Marts
2
Operational Sources (OLTP’s)
Operational computer systems did provide information to run
day-to-day operations, and answer’s daily questions
Also called online transactional processing system (OLTP)
Data is read or manipulated with each transaction
Transactions/queries are simple, and easy to write
Usually for middle management
Examples
Sales systems
HRM Applications
General-Ledger
Payroll
Etc.
3
Typical decision queries
Data set are mounting everywhere, but not useful for decision
support
Strategic Information = Enterprise wide essential data
People need timely strategic information to take appropriate
decisions to get competitive advantages
Decision-making require complex questions from integrated data.
Decision makers want to know:
Where to build new oil warehouse?
How much is the total sale by month/ year/ quarter for each
offices?
Is there any relation between promotion campaigns and sales
growth?
Can OLTP answer all such questions, efficiently?
4
Characteristics of strategic
information
Integrated
Must have a single, enterprise-wide view
Data Integrity
Information must be accurate and must conform to business
rules
Accessible
Easily accessible with intuitive access paths and responsive
for analysis
Credible
Every business factor must have one and only one value
Timely
Information must be available within the stipulated time frame
5
Information Crisis
6
History of Decision Support Systems
(DSS)
Ad-Hoc Reports: IT created small program for each required report.
Special Extract Programs: Collection of programs written for
anticipated reports.
Small Applications: Small programs with the option of taking
parameters as input from the user
IT centers: Special places where users can go for asking pre-
generated reports. IT people present there helped users to obtain
information.
Decision Support Systems: Much sophisticated systems, menu
driven, online processing, printable reports, but were made using
extracted files.
Executive Information Systems: Simple reports on user desktop,
but again preprogramed reports.
7
Failure of old DSS (cont.)
Inability to provide strategic information
IT receive too many ad hoc requests, so large over
load
Requests are not only numerous, they change
overtime
For more understanding more reports
Users are in spiral of reports
Users have to depend on IT for information
Can't provide enough performance, slow
Strategic information have to be flexible (not
predefined strictly) and conductive (could be guided)
for analysis
8
Failure of old DSS (Cont.)
A typical spiral of user needs and IT efforts
9
OLTP vs. DSS
OLTP (Get the data in)
10
OLTP vs. DSS (cont.)
Trait OLTP DSS
User Middle management Executives, decision-makers
Function For day-to-day operations For analysis & decision support
DB (modeling) E-R based, after normalization Star oriented schemas
Data Current, Isolated Archived, derived, summarized
Unit of work Transactions Complex query
Access, type DML, read Read
Access frequency Very high Medium to Low
11
Expectations of new soln.
12
DW meets expectations
13
Definition of DW
Inmon defined
“A DW is a subject-oriented, integrated, non-volatile, time-variant
collection of data in favor of decision-making”.
Kelly said
“Separate available, integrated, time-stamped, subject-oriented, non-
volatile, accessible”
Four properties of DW
14
Subject-oriented
In operational sources data is organized by applications, or
business processes.
In DW subject is the organization method
Subjects vary with enterprise
These are critical factors, that affect performance
Example of Manufacturing Company
Sales
Shipment
Inventory etc
15
Integrated Data
Data comes from several applications
Problems of integration comes into play
File layout, encoding, field names, schema, data heterogeneity are the
issues
Bank example, variance: naming convention, attributes for data item,
account no, account type, size, currency
In addition to internal, external data sources
External companies data sharing
Websites
Others
Removal of inconsistency
So process of extraction, transformation & loading
16
Time variant
Operational data has current values
Comparative analysis is one of the best techniques for business
performance evaluation
Time is critical factor for comparative analysis
Every data structure in DW contains time element
In order to promote product in certain area, analyst has to know
about current and historical values
The advantages are
Allows for analysis of the past
Relates information to the present
Enables forecasts for the future
17
Non-volatile
Data from operational systems are moved into DW after specific
intervals
Data is persistent/ not removed i.e. non volatile
Every business transaction don’t update in DW
Data from DW is not deleted
Data is neither changed by individual transactions
18
Architecture of DW
19
Components
Major components
Source data component
Data staging component
Data storage component
Information delivery component
Metadata component
Management and control component
20
1. Source Data Component
Source data can be grouped into 4 components
Production data
Comes from operational systems of enterprise
Some segments are selected from it depending on requirements
Narrow scope, e.g. order details
Internal data
Private datasheet, documents, customer profiles etc.
E.g. Customer profiles for specific offering
Special strategies to transform ‘it’ to DW (text document)
Archived data
Old data is archived in operational systems using maybe separate
archival database that might be still online, maybe stored in flat files
and ever more old data on tapes.
DW have snapshots of historical data
External data
Executives depend upon external sources
E.g. market data of competitors, market indicators.
For example, car rental business would like to have information
about the production schedule of car making companies to make
strategic decisions
But integration require conformance with your data
21
Architecture of DW
22
2. Data Staging Component
After data is extracted, data is to be prepared
Data extracted from sources needs to be
changed, converted and made ready in
suitable format
Three major functions to make data ready
Extract
Transform
Load
23
Architecture of DW
24
3. Data Storage Component
Separate repository
Data structured for efficient processing
Redundancy is increased
Updated after specific periods
Only read-only
25
Architecture of DW
26
4. Information Delivery Component
Depending on the type of users, different methods
and means of information delivery are used are
used.
27
Meta-data component
It is kind of information about the data in the data
warehouse
Three kinds of meta-data
Operational: Contains information about the mappings
between source systems and data warehouse, e.g., field
length, data types
28
Management and Control
Component
It is concerned to coordinate all activities and
29
Difference between Data Warehouse
and Data Marts
Sometimes called synonymously, however, strictly
speaking there is a difference between them on the
basis of their scale
Data warehouse is enterprise wide.
Data marts are local or at departmental level or
targeted for particular group of users.
The main topic of discussion is whether to make
data warehouse (top-down approach) first or data
marts (bottom-up approach). Both approaches have
their advantages and disadvantages.
30
Top-down approach
Advantages:
Single enterprise wide integrated data
Disadvantages:
Takes too longer to build
High chances of failure
Requires experienced professionals
Senior management is not likely to see the results
immediately
31
Bottom-up approach
Advantages
Faster and easier implementations
Faster return on investment and proof of concept
Less risk of failure
Inherently incremental (can prioritize which data
mart to build first)
Disadvantages:
Narrow view of data in each data mart
Spread redundancy in data marts
Can cause inconsistent data
32
Practical approach propose by Ralph
Kimball
Compromise between top-down and bottom-up
approaches
First define the requirements of the enterprise
data marts
33
DW Design
34
Designing DW
35
Background (ER Modeling)
For ER modeling, entities are collected from
the environment
Each entity act as a table
Success reasons
Normalized after ER, since it removes redundancy
(to handle update/delete anomalies)
But number of tables is increased
Is useful for fast access of small amount of data
36
ER Drawbacks for DW / Need of Dimensional
Modeling
ER Hard to remember, due to increased number of tables
Complex for queries with multiple tables (table joins)
Ideally no calculated attributes
The DW does not require to update data like in OLTP
system so there is no need of normalization
Efficient indexing scheme to avoid screening of all data
37
Dimensional Modeling
Dimensional Modeling focuses subject-
orientation, critical factors of business
Critical factors are stored in facts
Redundancy is no problem, achieve efficiency
Logical design technique for high performance
Is the modeling technique for storage
38
Dimensional Modeling (cont.)
Two important concepts
Fact
39
Dimensional Modeling (cont.)
Facts are stored in fact table
Dimensions are represented by dimension
tables
Each fact is surrounded by dimension tables
Looks like a star so called Star Schema
40
Example
PRODUCT
TIME
product_key (PK)
time_key (PK)
SKU
SQL_date
description
day_of_week
brand
month FACT
category
time_key (FK)
store_key (FK)
STORE clerk_key (FK) CUSTOMER
store_key (PK) product_key (FK) customer_key (PK)
store_ID customer_key (FK) customer_name
store_name promotion_key (FK) purchase_profile
address dollars_sold credit_profile
district units_sold address
floor_type dollars_cost
CLERK PROMOTION
clerk_key (PK) promotion_key (PK)
clerk_id promotion_name
clerk_name price_type
clerk_grade ad_type 41
Inside Dimensional Modeling
Inside Dimension table
Key attribute of dimension table, for identification
Large no of columns, wide table
Non-calculated attributes, textual attributes
Attributes are not directly related (e.g., brand
and package size)
Un-normalized in Star schema
Ability to drill-down and roll-up are two ways of
exploiting dimensions
Can have multiple hierarchies (product category
for marketing and product category for
accounting)
Relatively small number of records
42
Inside Dimensional Modeling
Have two types of attributes
Key attributes, for connections
Facts
Inside fact table
Concatenated key
Grain or level of data identified
Large number of records
Limited attributes
Sparse data set
Degenerate dimensions (order number
Average products per order)
Fact-less fact table
43
Star Schema Keys
Surrogate keys in Dimension tables
Replacement of primary key
System generated
Foreign keys in Fact tables
Collection of primary keys of dimension tables
Primary key in fact table
Collection of P.Ks came from dimension tables
Maybe degenerated dimension
Maybe system generated surrogate key
44
Advantage of Star Schema
Ease for users to understand
Optimized for navigation (less joins
fast)
Most suitable for query processing (drill-
down, roll-up)
Special techniques for join and indexing
for further query optimization
45
Normalization [1]
46
1st Normal Form [2]
“A relation is in first normal form if and only if
every attribute is single-valued for each tuple”
STU_ID STU_NAME MAJOR CREDITS CATEGORY
47
1st Normal Form (Cont.)
STU_ID STU_NAME MAJOR CREDITS CATEGORY
48
Another Example (composite key:
SID, Course) [1]
49
1st Normal Form Anomalies [1]
Update anomaly: Need to update all six rows
for student with ID=1if we want to change his
location from Islamabad to Karachi
Delete anomaly: Deleting the information about
a student who has graduated will remove all of
his information from the database
Insert anomaly: For inserting the information
about a student, that student must be
registered in a course
50
Solution 2nd Normal Form
dependencies [1]
SID —> campus, degree
Campus degree
(SID, Course) Marks
51
Example in 2nd Normal Form [1]
52
Anomalies [1]
53
Solution 3rd Normal Form
Campus degree
54
Example in 3rd Normal Form [1]
55
Denormalization [1]
56
Five techniques to denormalize
relations [1]
Collapsing tables
Pre-joining
Splitting tables (horizontal, vertical)
Adding redundant columns
Derived attributes
57
Collapsing tables (one-to-one) [1]
59
Splitting tables [1]
60
Redundant columns [1]
61
Derived Attributes
62
Updates to Dimension Tables
63
Updates to Dimension Tables (Cont.)
64
Updates to Dimension Tables (Cont.)
65
Updates to Dimension Tables (Cont.)
Proposed solution:
66
Updates to Dimension Tables (Cont.)
67
Updates to Dimension Tables (Cont.)
Solution: Add a new column of attribute
68
Updates to Dimension Tables (Cont.)
69
Rapidly Changing Dimension
70
Rapidly Changing Dimension (Cont.)
“For example, an important attribute for customers might
be their account status (good, late, very late, in arrears,
suspended), and the history of their account status” [4]
“If this attribute is kept in the customer dimension table
and a type 2 change is made each time a customer's
status changes, an entire row is added only to track this
one attribute” [4]
“The solution is to create a separate account_status
dimension with five members to represent the account
states” [4] and join this new table or dimension to the
fact table.
71
Example
72
Junk Dimensions
Sometimes there are some informative flags and
texts in the source system, e.g., yes/no flags,
textual codes, etc.
If such flags are important then make their own
dimension to save the storage space
However, caution, if you have large number of
combinations then make separate dimensions.
For example, 5 attributes with 100 distinct values
each. Total rows in it will be millions (1005). In this
case make separate dimension for each attribute
73
Junk Dimension Example [3]
74
Junk Dimension Example (Cont.) [3]
75
The Snowflake Schema
Snowflacking involves normalization of
dimensions in Star Schema
Reasons:
To save storage space
Easy maintenance of tables with low cardinality
However, some designers discourages it.
According to them it is better to sacrifice
storage space over performance and ease of
use.
76
Example 1 of Snowflake Schema
77
Example 2 of Snowflake Schema
78
Aggregate Fact Tables
79
Example
81
A way of making aggregates
Example:
82
Making Aggregates
83
Example of Aggregate Fact Table
84
Families of Stars
85
Families of Stars (Cont.)
Transaction (day to day) and snapshot tables (data after
some specific intervals)
86
Families of Stars (Cont.)
Core and custom tables
87
Families of Stars (Cont.)
Conformed Dimension: The attributes of a dimension
must have the same meaning for all those fact tables
with which the dimension is connected.
88
Extract, Transform, Load (ETL)
90
Data Extraction
Data can be extracted using third party tools
or in-house programs or scripts
Data extraction issues:
1. Identify sources
2. Method of extraction for each source (manual,
automated)
3. When and how much frequently data will be extracted
from each source
4. Time window
5. Sequencing of extraction processes
91
How data is stored in operational
systems
Current value: Values continue to changes as
daily transactions are performed. We need to
monitor these changes to maintain history for
decision making process, e.g., bank balance,
customer address, etc.
Periodic status: sometimes the history of
changes is maintained in the source system
92
Example
93
Data Extraction Method
94
Incremental data extraction
Immediate data extraction: involves data extraction in
real time when the transaction occurs.
Possible options:
1. Capture through transactions logs
2. Make triggers/Stored procedures
3. Capture via source application
Deferred data extraction: involves data extraction at later
time.
1. Capture on the basis of time and date stamps
2. Capture by comparing files
95
Data Transformation
96
Data Transformation (Cont.)
97
Data Loading
Determine when (time) and how (as a whole or in
chunks) to load data
Four modes to load data
1. Load: removes old data if available otherwise load data
2. Append: The old data is not removed, the new data is
appended with the old data
3. Destructive Merge: If primary key of the new record
matched with the primary key of an old record then
update old record
4. Constructive Merge: If primary key of the new record
matched with the primary key of an old record then do
not update old record just add the new record and mark
it as superseding record
Data Loading (cont.)
99
Data Loading (Cont.)
Data Refresh Vs. Data Update
Full refresh reloads whole data after deleting old data and
data updates are used to update the changing attributes
Data Loading (Cont.)
Loading for dimensional tables: You need to define
a mapping between source system key and system
generated key in data warehouse, otherwise you will
not be able to load/update data correctly
Data Loading (Cont.)
Updates to dimension table
Data Quality Management
It is important to ensure that the data is correct to
make right decisions
Imagine the user working on operational system is
entering wrong regions’ codes of customers.
Imagine that the relevant business has never sent
an invoice using these regions codes (so they are
ignorant). But what will happen if the data
warehouse will use these codes to make decisions?
You need to put proper time and effort to ensure
data quality
Data Quality
http://www.humaninference.com/master-data-management/data-qu
ality/data-cleansing
Google Refine,
http://code.google.com/p/google-refine/
Wrangler
http://vis.stanford.edu/wrangler/
Text Pipe Pro
http://www.datamystic.com/textpipe.html#.UKjm9eQ3vyo
Information Supply
Naive users
Regular users: daily users but cannot make
queries and reports themselves. They need
query templates and predefined reports
Power users: Technically sound users, who
can make queries, reports, scripts, import
and export data themselves
User classes from other perspective
High-level Executives and Managers: Need
standard pre-processed reports to make strategic
decisions
Technical Analysts: complex analysis, statistical
analysis, drill-down, slice, dice
Business Analysts: comfortable with technology but
cannot make queries, can modify reports to get
information from different angles up to some extent.
Business-Oriented users: Predefined GUIs, and
might be support for some ad-hoc queries and
standard reports
The ways of interaction
Preprocessed reports: routine reports which are
delivered at some specific interval
Predefined queries and templates: The users can
use own parameters with predefined queries
templates and reports with predefined format
Limited ad-hoc access: few and simple queries
which are developed from scratch
Complex ad-hoc access: complicated queries and
analysis. Can be used as a basis for predefined
reports and queries
Tools selection
127
Information delivery framework
Online Analytical Processing
(OLAP)
used for fast and complex analysis on data warehouse
It is not a database design technique but is only a category of
applications
Definition from OLAP Council:
OLAP?
OLAP (cont.)
What is the solution?
133
OLAP as FASMI
Further reading:
What is OLAP?
by Nigel Pendse, Principal of OLAP Solutions
and Co-author of the OLAPreport.com
Source:
http://dssresources.com/papers/features/pends
e04072002.htm
134
Major features of OLAP
Dimensional Analysis
1. What are cubes?
2. What are hyper-cubes?
Drill-down and Roll-up
Drill through, Drill across
Slice and Dice
Pivoting
Dimensional Analysis
OLAP supports multi-dimensional analysis
Cube: have three dimensions
Z-axis
X-axis Y-axis
Same view on spreadsheet
Multidimensional Analysis (Spreadsheet
view) / Hypercube
An example of hyper cube with 4
dimensions
139
Drill-down and Roll-up [5]
Drill through, Drill across
Slicing [5]
Dicing [5]
Pivoting [5]
A SQL server view for cube [6]
OLAP Implementations [1]
153
Relational OLAP (ROLAP) [1]
191
How relations/tables are organized
on disk [11]
File Organizations
Relations are stored in terms of logical structures called files, the
way files are organized on disk is called file organization
Usually files are stored as a sequence of records
A record is a tuple in a relational database
Files are read in terms of physical blocks on disk
There can be many records per block and can be many blocks per
record or there can be one-one relationship which is very rare
File organization simply refers to the way records are stored in
terms of blocks and the way blocks are placed on the storage
medium and interlinked
192
How relations/tables are organized
on disk [11]
Types of File Organizations
1. Unsorted (piled file organization)
2. Sorted
3. Hashing
193
How relations/tables are organized
on disk [11]
Records: A record is basically a tuple in a
table/relation
A file is a sequence of records
Records can be of fixed length and variable
length
Records comprises of sequence of fields
(columns or attributes)
194
How relations/tables are organized
on disk [11]
Blocks: They are basically the physical units of storage
in storage devices (for example, sectors in hard disks)
Usually (but not necessarily, it depends on the file
system structure) stores records of a single file
They are of fixed length, based on physical
characteristics of the storage/computing device and
operating system
Storage device is either defragmented or fragmented
depending on whether contiguous sets of records lie in
contiguous blocks
195
How relations/tables are organized
on disk [11]
Unspanned records
196
How relations/tables are organized
on disk [11]
Spanned records
Record4
Block p
remaining
part
If record size is bigger than block then we also
need to use spanned records
197
How relations/tables are organized
on disk [11]
Unordered or Pile File organization
198
How relations/tables are organized
on disk [11]
Unordered or Pile File organization
Simplest file organization
Records are inserted in the order of their
arrival
Usually requires secondary files for efficient
search
Insertion: very easy
Search very expensive because we need to
search linearly
199
Unordered or Pile File
organization [11]
Record deletion causes fragmentation and
inefficient in terms of space
200
Unordered or Pile File
organization
[11]
Record update is also problematic in case of
201
Sorted Files [11]
File organization where records are physically
sorted based on the value of some field called
the ordering field
Ordering field should be a key field (unique for
each record) and belong to an ordinal domain
(where values can be compared)
Insertion and deletion: both expensive
Updating may involve physical migration of
record, if ordering field is modified
Searching: efficient based on binary search
202
Sorted Files [11]
203
Benefits of sorted files [11]
204
Hashing Techniques [11]
205
Hashing Techniques [11]
206
Overflow in Hashing Techniques
[11]
When buckets overflow means that it
receives more records than it can hold
Three techniques to overcome overflow
1. Open addressing: use next available bucket
2. Chaining: Pointers to extra buckets called
overflow buckets
3. Rehashing: Use a second hashing function to
see whether it can hash without overflow
207
Handling overflow through chaining
[11]
208
Indexing [11]
211
Primary Index [11]
212
Primary Index [11]
213
Primary Index [11]
214
Sparse Index [1]
The index is kept for a block of data items
It takes less space but at the expense of
some efficiency in terms of time
Dense Indexing [1]
For each record store the key of that record and the
pointer where the record is actually placed on disk
If it fits in memory it requires one I/O operation if not then
performance will degrade
Primary index [11]
218
Clustering Index [11]
219
Clustering Index [11]
220
Clustering Index [11]
221
Secondary Index [1]
222
Secondary Index on key field (dense
index) [11]
223
Secondary Index on non-key field
[11]
Three techniques to handle duplication
1. Duplicate index entries
2. Variable length records
3. Extra redirection levels
224
Secondary Index on non-key field
[11]
Duplicate index entries
Index entries are repeated for each duplicate
occurrences of the non-key attribute
Binary search becomes more complicated.
Mid-point of a search may have duplicate
entries on either side
Insertion of records may need restructuring of
index table
225
Secondary Index on non-key field
[11]
Variable length records: use variable length
records for index table in order to
accommodate duplicate key entries
226
Extra Redirection Levels [11]
227
Extra Redirection Levels [11]
228
Multilevel Indexing [1]
It uses a little more space but it increases efficiency in
terms of time
It is good for queries posing some conditions or range
conditions
Example of B-Tree [1]
B-Tree limitations [1]
232
233
Join Indexes
234
Recommended Indexes for Data
Warehouse
For primary key, foreign keys, and frequently
accessed metrics in Fact tables you can use
B-Tree index (distinct values)
For primary keys in dimensions use B-Tree.
For other attributes that are repetitive in
dimensions use Bitmap index
235
Join Techniques [1]
236
Join Techniques (cont.) [1]
237
Join Techniques (cont.) [1]
238
Join Techniques (cont.) [1]
Sort-merge join
- Sort both relations on join attributes and then
finds matching rows
- Both relations are scanned once
- However, additional cost of sorting
239
Join Techniques (cont.) [1]
Sort merge join
240
Hash join [1]
- Suitable for large databases
- The choice about which table first gets hashed
plays a pivotal role in the overall performance of
the join operation, and left to the optimizer
- The optimizer decides by using the smaller of
the two tables (say) Table_A or data sources to
build a hash table in the main memory on the
join key used in the WHERE clause.
- It then scans the larger table (say) Table_B and
probes the hashed table to find the joined rows
Hash Join [12]
242
Questions?
243
References
[1] Abdullah, A.: “Data warehousing handouts”, Virtual
University of Pakistan
[2] Ricardo, C. M.: “Database Systems: Principles
Design and Implementation”, Macmillan Coll Div.
[3] Junk Dimension,
http://www.1keydata.com/datawarehousing/junk-dimensi
on.html
[4] Advanced Topics of Dimensional Modeling
https://mis.uhcl.edu/rob/Course/DW/Lectures/Advanced
%20Dimensional%20Modeling.pdf
[5] http://en.wikipedia.org/wiki/OLAP_cube
244
References
[6]
http://www.mssqltips.com/sqlservertutorial/2011/processing-di
mensions-and-cube/
[7]
http://dev.mysql.com/doc/refman/5.1/en/partitioning-hash.html
[8] Karen Corral, et al. (2006) The impact of alternative
diagrams on the accuracy of recall: A comparison of star-
schema diagrams and entity-relationship diagrams, Decision
Support Systems, 42(1), 450-468.
[9]
http://www.ibm.com/developerworks/data/library/techarticle/dm
-0909infospherevirtualcubes/
246