Materi Data Warehouse

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

DATA

WAREHOUSING
DATA WAREHOUSING
• Data for Decision Support
• OLAP & Data Warehousing
• Data Flows in the Data Warehouse
• Data in the Data Warehouse
DATA FOR DECISION
SUPPORT

Traditional Data Management


DATA FOR DECISION
SUPPORT
▪ Data Types:
- Data
- Information
- Knowledge

▪ Data Origin
- Internal
- External
- Personal
DATA FOR DECISION
SUPPORT
▪ Databases
- Relational
- Hierarchical & Network
- Object-Oriented
- Multimedia & Document Systems

▪ Data Problems
- Incorrect data
- Data not timely
- Data not measured or indexed properly
- Needed data do not exist
OLAP & DATA
WAREHOUSING

Databases for Decision Support


In the late 1980s, an interesting intersection of ideas &
technologies...

▪ DSS: data modeling & analysis tools


▪ Databases: tables & query languages
▪ EIS: consolidation, drill-down & data cubes
▪ User Interfaces: focus on end user

and a need for more extensive access to


information for decision support purposes!
DATA ANALYSIS
LIMITATIONS
▪ Organizations have been collecting data for many years.
However, much of that data is hardly used in
sophisticated data analysis for decision support
purposes.

▪ TPS databases are not structured to support analysis.


▪ SQL has difficulty answering common business questions.
▪ Complex queries may take days to answer.
▪ Some questions may take several queries to answer.

▪ Sales ratio of promoted vs. unpromoted products?


▪ Quantity shipped to discount stores vs. department stores?
DATA ANALYSIS
LIMITATIONS
▪ Consider a typical business analysis problem:
▪ Find the share of total sales represented by each product in different
markets, categories, and periods, compared with the same period a
year ago.
▪ To do so, you would calculate the percentage each number
is of the total of its column, a simple and common concept.
▪ However, in a classic relational database these calculations
and display would require definition of a separate view,
requiring complex SQL commands.
THE PROBLEM: OLTP
▪ OLTP = On-Line Transaction Processing
▪ OLTP is the basic architecture underlying
relational database-based transaction processing
systems.
▪ Data is broken up into relational tables, with
relationships described using foreign keys.
▪ OLTP is optimized for the entry and retrieval of
individual pieces of fact.
▪ OLTP was not designed to support large-scale
data analysis.
THE SOLUTION: OLAP
▪ OLAP = On-Line Analytical Processing

Analytical Processing
▪ A database architecture
not designed for optimal
data processing but for
optimal data analysis!
▪ The differences in data
processing and data
analysis suggest that
this would be a
separate database
system.
Source: OlapReport.com
THE DATA WAREHOUSE
▪ Most applications of OLAP are not known as such, but
rather as Data Warehouses.
▪ A data archive in which the data has been structured and formatted
for analysis purposes.
▪ A database system separate from the transaction processing system
so as not to interfere with the TPS performance.
▪ The current foundation for Decision Support.
THE DATA WAREHOUSE
▪ A managed database in which the data is:
▪ Subject-oriented:
▪ Designed for decision-support rather than for TPS.
▪ Integrated:
▪ Contains data from different legacy & DB-systems.
▪ Nonvolatile:
▪ New data is integrated with existing data, rather than replacing it.
THE DATA WAREHOUSE
▪ Time-variant:
▪ Contains informational data (historic data along a time-axis) rather
than operational data (momentary snapshot).

▪ Aggregated:
▪ Data exists at different levels of granularity

to support management needs

William Inmon, 1990, 1992


THE DATA WAREHOUSE
▪ "A data warehouse is a copy of transaction
data specifically structured for query and
analysis“
(Kimball, 1996)
THE DATA WAREHOUSE
▪ Additional Data Warehouse characteristics:
▪ Web based
▪ Relational/multidimensional
▪ Client/server
▪ Real-time
▪ Include metadata
WHY DO WE NEED A DATA
WAREHOUSE?
▪ Business Intelligence
▪ One version of the truth – consistency
▪ Cleaner data – scrubbed
▪ Security
▪ Accessibility
▪ Historical data
▪ Many applications
▪ mining, strategic, tactical, performance
DATA WAREHOUSING &
LEGACY SYSTEMS
▪ The Data Warehouse:
▪ Allows existing legacy systems to continue in operation.
▪ Consolidates inconsistent data from the
various legacy systems into a coherent set.
▪ Reaps benefits from vital information about
current operations.
▪ Provides fertile ground to architect new
operational systems.
DATA WAREHOUSE USAGE
▪ Data mining software - based on OLAP – is
multidimensional query and analysis software which can
actively hunt for patterns and add value to the data it is
searching.
THE DATA MART
▪ A Data Mart is a subset of the Data Warehouse
▪ Most users only need a subset of DW data
▪ Makes it easier for users to navigate
▪ Replicated Data Marts are fed DW data only
▪ Independent Data Marts maintain their own
databases
DATA WAREHOUSE TYPES
GARDNER (1998)

Enterprise Data Ware House Independent Data Mart Dependant Data Mart

21
OPERATIONAL UNITS
▪ Operational data stores (ODS):
▪ A type of database often used as an interim area for a data
warehouse. It includes short-term data from multiple sources in a
near real-time manner to support short-term decision making

▪ Open marts:
▪ An operational data mart, which allows for multi-dimensional data
analysis of ODS data.
DATA FLOWS IN THE DATA
WAREHOUSE

Putting the Data Warehouse together


DATA FLOWS IN A DATA
WAREHOUSE
▪ Inflow - bring data into data warehouse.
▪ Upflow - add value to data through summarizing,
packaging, and distributing.
▪ Downflow - archive data with low business value.
▪ Outflow - make data available to users through
access and delivery.
▪ Metaflow - describe data and data flows to
constantly refine and improve the data
warehouse.
DATA FLOWS IN A DATA
WAREHOUSE
▪ Inflow
▪ Legacy data is cleaned up and fed into the data warehouse.
INFLOW ISN’T EASY...
▪ Because of incorrect, contaminated, or just plain missing
data, the migration phase to a data warehouse can dominate
the entire project!
▪ Integrating data about a single entity from
multiple sources is often difficult.
▪ Automated tools are available to help out:
▪ Data Migration tools for data extraction
▪ Data Scrubbing tools for cleaning up data.
▪ Data Auditing tools for patterns and structures contained within the
data.
INFLOW ISN’T EASY...
INFLOW BY ANY OTHER
NAME…
▪ Extraction, transformation, and load (ETL)
▪ A data warehousing process that consists of:
▪ Extraction - reading data from a database
▪ Transformation - converting the extracted data from its previous form into
the form in which it needs to be so that it can be placed into a data warehouse
or simply another database
▪ Load - putting the data into the data
warehouse
INFLOW BY ANY OTHER
NAME…

The ETL Process


INFLOW BY ANY OTHER
NAME…
▪ Enterprise application integration (EAI)
▪ A technology that provides a vehicle for
pushing data from source systems into a data
warehouse

▪ Enterprise information integration (EII)


▪ A tool space that provides real-time data
integration from various sources, such as
relational databases, Web services, and
multidimensional databases
DATA FLOWS IN A DATA
WAREHOUSE
▪ Upflow
▪ Highly detailed data
is aggregated,
summarized, and
otherwise processed
in the warehouse to
make it easier for
users to get quick
responses to their
queries
DATA FLOWS IN A DATA
WAREHOUSE
▪ Outflow
▪ Users run
queries to get
the information
they need. In a
correctly
structured data
warehouse,
most queries will
go against
highly
summarized
data.
DATA FLOWS IN A DATA
WAREHOUSE
▪ Downflow
▪ With new data
constantly being
added, move
marginally useful old
data to off to an
archive where it can
be accessed if
needed.
DATA FLOWS IN A DATA
WAREHOUSE
▪ The reason for downflow: data loses business
value over time!
DATA FLOWS IN A DATA
WAREHOUSE
DATA FLOWS IN A DATA
WAREHOUSE
▪ Metaflow
▪ Users & IS need to know about the data in the warehouse: where it
is located, what it represents, where it came from, etc. This
information is metadata - data about the data.
DATA WAREHOUSE
ARCHITECTURES
DATA WAREHOUSE
ARCHITECTURES
Common Architecture of a Three-Tier Data Warehouse
DATA WAREHOUSE
ARCHITECTURES
INMON’S DATA
WAREHOUSING TARGET
ARCHITECTURE
REAL-TIME DATA
WAREHOUSE
▪ Real-time or Active Data Warehouses load operational
data as soon as it becomes available
▪ Immediate data analysis
▪ View of organization’s changes over time
▪ Support for real-time decision making
DATA IN THE DATA
WAREHOUSE

The Star Schema


THE STAR SCHEMA
▪ A currently popular data architecture for data warehouses
is the Star Schema.
▪ Simple structure with relatively few tables.
▪ Data is not fully normalized as in OLTP.
▪ Star design is easily understood by analysts and end users,
especially those without much database familiarity.
▪ The star design eases data analysis: more analysis possibilities and
simpler queries.
THE STAR SCHEMA TABLE
TYPES
▪ Fact Tables (Major Tables): large tables which contain the
quantitative or factual business data.
▪ Example: a fact table might contain sales revenue for company
products for each customer in each geographical market over a
period of time.
▪ Dimension Tables (Minor Tables): smaller tables which
contain descriptive data that reflect the dimensions of a
business.
▪ Example: dimension tables which define the
customers, products, geographic markets, and time
periods used in the fact table.
THE STAR SCHEMA

▪ Sales database with simple star design:


NINE DECISIONS IN DATA
DESIGN
▪ Choose the subject matter
▪ Decide what a fact table represents
▪ Identify and conform the dimensions
▪ Choose the facts
▪ Store pre-calculations in the fact table
▪ Round out the dimension tables
▪ Choose the duration of the database
▪ The need to track slowly changing dimension
▪ Decide the query priorities and the query modes
ESTIMATING DATA
WAREHOUSE SIZE
ESTIMATING DATA
WAREHOUSE SIZE
▪ Time dimension: 2 years X 365 days = 730 days
▪ Store dimension: 300 stores, reporting sales each day
▪ Product dimension: 30,000 products in each store, of
which 3,000 sell each day in a given store
▪ Promotion dimension: a sold item appears in only one
promotion condition in a store on a day
▪ Number of base fact records = 730 X 300 X 3000 X 1 =
657 million records
▪ Number of key fields = 4; Number of fact fields = 4; Total fields
=8
▪ Base fact table size = 657 million X 8 fields X 4 bytes =
21 GB
DATA WAREHOUSE PROS &
CONS
▪ Direct benefits of a data warehouse
▪ Allows end users to perform extensive analysis
▪ Allows a consolidated view of corporate data
▪ Better and more timely information A
▪ Enhanced system performance
▪ Simplification of data access
DATA WAREHOUSE PROS &
CONS
▪ Indirect benefits result from end users using these
direct benefits
▪ Enhance business knowledge
▪ Present competitive advantage
▪ Enhance customer service and satisfaction
▪ Facilitate decision making
▪ Help in reforming business processes
DATA WAREHOUSE PROS &
CONS
▪ Data warehousing implementation issues
▪ Implementing a data warehouse is generally a massive effort
that must be planned and executed according to established
methods
▪ Data warehouses are expensive and time-consuming to
develop
▪ Benefits might not be immediately obvious
▪ Management support and championing is vital to success

You might also like