Data Warehousing Dr. L. Rajya Lakshmi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Data Warehousing

Dr. L. Rajya Lakshmi


Multiple Hierarchies
Extraction, transformation and loading
• What is the factor that differentiates between the data in operational
systems and the data in a data warehouse?
• The set of functions that fall under extraction, transformation and loading
• Extraction, transformation and loading (ETL) reshape the data
• In which area(s) of DW environment do these ETL functions fall?
• ETL encompasses data acquisition and data storage
• ETL has all backend processes
• Functions and procedures for extraction, transformation, and loading
• Which data has to be extracted?
• The user requirements drive the data extraction process
• Requirement based selection process
Extraction transformation loading
• The extent and complexity of backend processes differ from one data
warehouse to another
• What decides the complexity?
• The disparateness of operational systems, their platforms, network protocols,
operating systems, etc.
• The quality of the data
• Feasibility of loading methods
ETL: important and challenging
• Data to strategic information
• Extract the data
• Not just dump it, transform it into the format suitable for conversion into
information
• Move the data to the repository
• Now, we can provide strategic information to the users
ETL: important and challenging
• Users want to analyse product sales by store by month
• Assume that the sales data is gathered by several applications in the
organization
• There exists a product master list
• Each sales transaction happen at a store, then?
• Sales units and sales cost in fact table;
• The product dimension, the store dimension, and the time dimension
• Extract the data, reconcile representations, and transform all product sales,
load the data into fact and dimension tables, and provide the information to
users
• All three functions should be performed in the sequence
ETL: important and challenging
• What makes ETL challenging?
• Disparities among the source systems
• Reasons for difficulties in ETL functions:
• Source systems are diverse and disparate
• Multiple platforms and different operating systems
• Older legacy applications on obsolete database technologies
• History is important for data warehouse, but changes in values are not
preserved in operational systems
• Lack of consistency (salary data in payroll systems)
• Lack of means to resolve mismatches
• Data representations are cryptic, ambiguous and not meaningful for data
warehouse users
• Data quality is dubious
ETL: time consuming and arduous
• 50 to 70% of effort on ETL functions
• Why time consuming and arduous?
• Extract big amounts of data; complex source systems makes this task difficult
• Metadata must contain information on every database and data structures
(very detailed information like database size and volatility)
• Time window for data extraction
• Mechanisms for capturing changes
• Large number of transformation methods (reformat internal data, re-
sequence data, apply various forms of conversion techniques, missing value
handling, aggregates, etc.) on big quantities of data
• Initial load populates millions of rows; creating and managing load images for
such large numbers is not an easy task (two or more weeks to complete initial
loading)
• Incremental changes: Extraction methods for individual sources and time
windows for incremental loads
ETL: requirements and steps
• Functional steps for initial and incremental data loads:
• Triggering for incremental changes
• Filtering for refreshes and incremental loads
• Extraction
• Transformation
• Integration
• Cleansing
• Loading/applying data
ETL: requirements and steps
ETL: requirements and steps
• The activities and tasks that compose the ETL process are:
• Aggregates planning
• Transformation and cleansing rules
• Data extraction rules establishment
• Source to target data element mapping
• Integrate internal and external data sources
• Decide the target data to be in the warehouse
• Combine several source data structures into a single row
• Split a single data structure into multiple data structures
• Read dictionaries and catalogues
• Read data from a variety of files
• Populating aggregates or summary fact tables with aggregates
• Transform data, change cryptic values, etc.
ETL: key factors
• Primary reason for complexity is diversity of data sources
• Begin with a complete inventory of source systems
• Mass refreshes (initial or periodic) take too long (find a proper time)
• Incremental refreshes (the best method for each source, impact less
extracts and loads)
Data extraction
• Two factors differentiate data extraction of a data warehouse and
operational systems
• Can use third party tools (expensive and records metadata), in-house
tools (increases cost of maintenance and hard to maintain)
• Data extraction issues:
• Identify source applications and structures
• Define extraction process (manual or tool based)
• Define data extraction frequency (daily, weekly, monthly, etc.)
• Time window for data sources
• Job sequencing
• Exception handling
Source identification
• Identify all proper data sources
• Examination and verification that the identified sources will add the
necessary value
• Consider a data mart that has to provide strategic information on
fulfilment of orders
• Historical data about fulfilled and pending orders, channels, order statuses
• Fact table: total order amount, discounts, commissions, expected delivery
time, actual delivery time, dates associated with different phases
• Dimensions: product, order disposition, delivery channel, and customers
• Determine whether you have source systems to provide the data
• Establish correct data elements for each target data element
• Verify the identified sources are the right ones
Source identification
Data extraction techniques
• Source data is temporal or time dependent (Current values, subjected
to constant changes)
• How do we capture the history?
• Data in operational systems:
• Two ways (current values and periodic status)
• Current values:
• Most attributes fall under this category
• Transient and current values
• Remains constant until a transaction changes it
• Periodic status
• The value is preserved as status every time its value changes (store reference effective
dates), events
• Ex: policy, claims
• Extraction of history is easier (preserved in the operational systems)

You might also like