Unit-3 - I MGN 343

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

DATA

WAREHOUSING
Learning Objectives
• Understand the basic definitions and concepts
of data warehouses
• Understand data warehousing architectures
• Describe the processes used in developing and
managing data warehouses
• Explain data warehousing operations
• Explain the role of data warehouses in decision
support
Learning Objectives
• Explain data integration and the
extraction, transformation, and load (ETL)
processes
• Describe real-time (active) data
warehousing
• Understand data warehouse
administration and security issues
Main Data Warehousing
Topics
• DW definition
• Characteristics of DW
• Data Marts
• ODS, EDW, Metadata
• DW Framework
• DW Architecture & ETL Process
• DW Development
• DW Issues
Data Warehousing
Definitions and Concepts
• Data warehouse
A physical repository where relational data
are specially organized to provide
enterprise-wide, cleansed data in a
standardized format
What is a Data Warehouse?
• “The data warehouse is a collection of
integrated, subject-oriented databases
designed to support DSS functions,
where each unit of data is non-volatile
and relevant to some moment in time”
Data Warehousing
Definitions and Concepts
• Characteristics of data warehousing
– Subject oriented (Sales, products, or customers)
– Integrated
– Time variant (time series)
– Nonvolatile
– Web based
– Relational/multidimensional
– Client/server
– Real-time
– Include metadata
A Historical Perspective to
Data Warehousing
ü Mainframe computers ü Centralized data storage ü Big Data analytics
ü Simple data entry ü Data warehousing was born ü Social media analytics
ü Routine reporting ü Inmon, Building the Data Warehouse ü Text and Web Analytics
ü Primitive database structures ü Kimball, The Data Warehouse Toolkit ü Hadoop, MapReduce, NoSQL
ü Teradata incorporated ü EDW architecture design ü In-memory, in-database

1970s 1980s 1990s 2000s 2010s

ü Mini/personal computers (PCs) ü Exponentially growing data Web data


ü Business applications for PCs ü Consolidation of DW/BI industry
ü Distributer DBMS ü Data warehouse appliances emerged
ü Relational DBMS ü Business intelligence popularized
ü Teradata ships commercial DBs ü Data mining and predictive modeling
ü Business Data Warehouse coined ü Open source software
ü SaaS, PaaS, Cloud Computing
Data Warehousing
Definitions and Concepts
• Data mart
A departmental data warehouse that stores
only relevant data
• Dependent data mart
A subset that is created directly from a data
warehouse
• Independent data mart
A small data warehouse designed for a
strategic business unit or a department
Data Warehousing
Definitions and Concepts
• Operational data stores (ODS)
A type of database often used as an interim area
for a data warehouse, especially for customer
information files (like short term memory, stores
only recent information, while DE, is like long term
memory).
• Oper marts
An operational data mart. An oper mart is a small-
scale data mart typically used by a single
department or operational area in an organization
Data Warehousing
Definitions and Concepts
• Enterprise data warehouse (EDW)
A technology that provides a vehicle for
pushing data from source systems into a
data warehouse
• Metadata
Data about data. In a data warehouse,
metadata describe the contents of a data
warehouse and the manner of its use
Data Warehousing
Process Overview
No data marts option
Data Applications
Sources (Visualization)
Access
Routine
ERP Business
ETL
Reporting
Process Data mart
(Marketing)
Select

/ Middleware
Legacy Metadata Data/text
Extract mining
Data mart
(Engineering)
Transform Enterprise
POS Data warehouse
OLAP,
Integrate

API
Data mart Dashboard,
(Finance) Web
Other Load
OLTP/wEB
Replication Data mart
(...) Custom built
External
applications
data
Data warehouse is

• The actual discovery phase of a knowledge discovery process

• The stage of selecting the right data for a KD process

• A subject-oriented integrated time variant non-volatile collection of data in support of


management

• None of these
Decision support systems
(DSS) is
• A.
• A family of relational database management systems marketed by IBM

• B.
• Interactive systems that enable decision makers to use databases and
models on a computer in order to solve ill-structured problems

• C.
• It consists of nodes and branches starting from a single root node. Each
node represents a test, or decision

• D.
• None of these
Which of these is an advantage of the
database approach to data storage?

• Data independence
• Data consistency
• Data quality
• All of the above.
___________ is a data warehouse
that is limited in scope.

• A data strip
• A database
• A data box
• A data mart
A data warehouse is which of
the following?

• A. Can be updated by end users.


• B. Contains numerous naming
conventions and formats.
• C. Organized around important subject
areas.
• D. Contains only current data.
The extract process is
which of the following?
• A. reading data from one or more
database
• B. Capturing a subset of the data
contained in various operational systems
• C. Capturing few of the data contained in
various decision support systems
• D. Capturing a subset of the data
contained in various decision support
systems
Data transformation includes
which of the following?

A. A process to change data from a detailed


level to a summary level
B. A process to change data from a
summary level to a detailed level
C. Joining data from one source into
various sources of data
D. Separating data from one source into
various sources of data
What is a data mart?
• It is the collection of data in data
warehouse
• It is usually smaller and focuses on a
particular subject or department
• It is the process of extracting patterns
from large data sets
• It is the process of extracting data
Data Warehousing
Process Overview
• The major components of a data
warehousing process
– Data sources
– Data extraction
– Data loading
– Comprehensive database
– Metadata
– Middleware tools
Data Integration and the Extraction,
Transformation, and Load Process
• ETL = Extract Transform Load

Packaged Transient
application data source

Data
warehouse

Legacy
Extract Transform Cleanse Load
system

Data mart
Other internal
applications
ETL (Extract, Transform,
Load)
• Issues affecting the purchase of an ETL tool
– Data transformation tools are expensive
– Data transformation tools may have a long learning
curve
• Important criteria in selecting an ETL tool
– Ability to read from and write to an unlimited number
of data sources/architectures
– Automatic capturing and delivery of metadata
– A history of conforming to open standards
– An easy-to-use interface for the developer and the
functional user
Data Warehousing Architectures

• Three parts of the data warehouse


– Data acquisition (back-end) software that
extracts data from legacy systems and external
sources, consolidates and summarizes them,
and loads them into the data warehouse
– The data warehouse that contains the data and
associated software
– Client (front-end) software that allows users to
access and analyze data from the warehouse
Data Warehousing
Process Overview
Data Warehousing
Process Overview
Data Warehousing
Process Overview

Web pages
Application
Server

Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse
Data Warehousing Architectures

• Issues to consider when deciding which


architecture to use:
– Which database management system (DBMS)
should be used?
– Will parallel processing and/or partitioning be
used?
– Will data migration tools be used to load the
data warehouse?
– What tools will be used to support data retrieval
and analysis?
Data Warehousing
Process Overview
Data Warehousing
Process Overview
Data Warehousing
Architectures
Ten factors that potentially affect the architecture selection
decision:
1. Information 5. Constraints on resources
interdependence 6. Strategic view of the data
between organizational warehouse prior to
units implementation
2. Upper management’s 7. Compatibility with existing
information needs systems
3. Urgency of need for a 8. Perceived ability of the in-
data warehouse house IT staff
4. Nature of end-user tasks 9. Technical issues
10. Social/political factors
Data Warehouse
Development
Data warehouse development approaches
– Inmon Model: EDW approach (top-down)
– Kimball Model: Data mart approach
(bottom-up)
– Which model is best?
• Table 3.3 provides a comparative analysis
between EDW and Data Mart approach
• One alternative is the hosted warehouse
Additional DW Considerations
Hosted Data Warehouses
• Benefits:
– Requires minimal investment in infrastructure
– Frees up capacity on in-house systems
– Frees up cash flow
– Makes powerful solutions affordable
– Enables solutions that provide for growth
– Offers better quality equipment and software
– Provides faster connections
Additional DW Considerations
Hosted Data Warehouses
• Costs:
– If the company already have underused
internal infrastructure
– if the see outsourcing as a loss of control
– It might be unwise to outsource sensitive
applications for reasons of security and privacy
Representation of Data in
DW
• Dimensional Modeling
– A retrieval-based system that supports high-volume
query access
• Star schema
– The most commonly used and the simplest style of
dimensional modeling
– Contain a fact table surrounded by and connected to
several dimension tables
• Snowflakes schema
– An extension of star schema where the diagram
resembles a snowflake in shape
Data Warehouse Development
Multidimensionality
The ability to organize, present, and analyze data
by several dimensions, such as sales by region, by
product, by salesperson, and by time (four
dimensions)
• Multidimensional presentation
– Dimensions: products, salespeople, market
segments, business units, geographical locations,
distribution channels, country, or industry
– Measures: money, sales volume, head count,
inventory profit, actual versus forecast
– Time: daily, weekly, monthly, quarterly, or yearly
Star versus Snowflake
Schema
Star Schema Snowflake Schema
Dimension Dimension Dimension Dimension
TIME PRODUCT MONTH BRAND
Quarter Brand M_Name Brand
... ... ... Dimension Dimension ...
DATE PRODUCT
Date LineItem
Fact Table
SALES Dimension ... ... Dimension
QUARTER CATEGORY
UnitsSold
Q_Name Category
... Fact Table
... SALES ...
UnitsSold
Dimension Dimension
PEOPLE GEOGRAPHY ...
Division Country
... ... Dimension Dimension
PEOPLE STORE
Division LocID
... ... Dimension
LOCATION
State
...
Analysis of Data in DW
• OLTP vs. OLAP…
• OLTP (online transaction processing)
– Capturing and storing data from ERP, CRM, POS, …
– The main focus is on efficiency of routine tasks

• OLAP (Online analytical processing)


– Converting data into information for decision support
– Data cubes, drill-down / rollup, slice & dice, …
– Requesting ad hoc reports
– Conducting statistical and other analyses
– Developing multimedia-based applications
– …more in the book
OLAP vs. OLTP
OLAP Operations
• Slice - a subset of a multidimensional array
• Dice - a slice on more than two dimensions
• Drill Down/Up - navigating among levels of data
ranging from the most summarized (up) to the
most detailed (down)
• Roll Up - computing all of the data relationships
for one or more dimensions
• Pivot - used to change the dimensional
orientation of a report or an ad hoc query-page
display
OLAP
A 3-dimensional
OLAP cube with Sales volumes of
slicing a specific Product
operations on variable Time
and Region

Slicing Operations on a
Simple Tree-Dimensional e
Data Cube m
Ti

Product

Geography
Cells are filled
with numbers
Sales volumes of
representing a specific Region
sales volumes on variable Time
and Products

Sales volumes of
a specific Time on
variable Region
and Products
Variations of OLAP
• Multidimensional OLAP (MOLAP)
OLAP implemented via a specialized
multidimensional database (or data store) that
summarizes transactions into
multidimensional views ahead of time
• Relational OLAP (ROLAP)
The implementation of an OLAP database on
top of an existing relational database
• Database OLAP and Web OLAP (DOLAP and
WOLAP)...
DW Implementation Issues
• Identification of data sources and governance
• Data quality planning, data model design
• ETL tool selection
• Establishment of service-level agreements
• Data transport, data conversion
• Reconciliation process
• End-user support
• Political issues
• … more in the book
Successful DW
Implementation
Things to Avoid
• Starting with the wrong sponsorship chain
• Setting expectations that you cannot meet
• Engaging in politically naive behavior
• Loading the data warehouse with information
just because it is available
• Believing that data warehousing database
design is the same as transactional database
design
• Choosing a data warehouse manager who is
technology oriented rather than user oriented
• … more in the book
Failure Factors in DW
Projects
• Lack of executive sponsorship
• Unclear business objectives
• Cultural issues being ignored
– Change management
• Unrealistic expectations
• Inappropriate architecture
• Low data quality / missing information
• Loading data just because it is available
Massive DW and
Scalability
• Scalability
– The main issues pertaining to scalability:
• The amount of data in the warehouse
• How quickly the warehouse is expected to grow
• The number of concurrent users
• The complexity of user queries
– Good scalability means that queries and
other data-access functions will grow linearly
with the size of the warehouse
Real-Time/Active DW/BI
• Enabling real-time data updates for real-
time analysis and real-time decision
making is growing rapidly
– Push vs. Pull (of data)
• Concerns about real-time BI
– Not all data should be updated continuously
– Mismatch of reports generated minutes apart
– May be cost prohibitive
– May also be infeasible
Enterprise Decision Evolution
and Data Warehousing
Real-Time/Active DW at
Teradata
Traditional versus Active
DW
DW Administration and
Security
• Data warehouse administrator (DWA)
– DWA should…
• have the knowledge of high-performance software,
hardware and networking technologies
• possess solid business knowledge and insight
• be familiar with the decision-making processes so as to
suitably design/maintain the data warehouse structure
• possess excellent communications skills
• Security and privacy is a pressing issue in DW
– Safeguarding the most valuable assets
– Government regulations (HIPAA, etc.)
– Must be explicitly planned and executed
The Future of DW
• Sourcing…
– Web, social media, and Big Data
– Open source software
– SaaS (software as a service)
– Cloud computing
• Infrastructure…
– Columnar
– Real-time DW
– Data warehouse appliances
– Data management practices/technologies
– In-database & In-memory processing New DBMS
– Advanced analytics
– …
End of the Chapter

• Questions, comments

You might also like