The document discusses data extraction, transformation, and loading (ETL) processes in data warehousing. It describes how ETL involves acquiring data from various sources, transforming it to fit the data warehouse needs, and loading it into fact and dimension tables. The document notes that ETL can be time-consuming due to the diversity of data sources and complex transformations required to clean, standardize, and consolidate data from different operational systems into a single format suitable for analysis. It also outlines some of the key steps and considerations in ETL design and implementation, such as identifying appropriate source systems, defining extraction frequencies, and techniques for capturing historical data.
The document discusses data extraction, transformation, and loading (ETL) processes in data warehousing. It describes how ETL involves acquiring data from various sources, transforming it to fit the data warehouse needs, and loading it into fact and dimension tables. The document notes that ETL can be time-consuming due to the diversity of data sources and complex transformations required to clean, standardize, and consolidate data from different operational systems into a single format suitable for analysis. It also outlines some of the key steps and considerations in ETL design and implementation, such as identifying appropriate source systems, defining extraction frequencies, and techniques for capturing historical data.
The document discusses data extraction, transformation, and loading (ETL) processes in data warehousing. It describes how ETL involves acquiring data from various sources, transforming it to fit the data warehouse needs, and loading it into fact and dimension tables. The document notes that ETL can be time-consuming due to the diversity of data sources and complex transformations required to clean, standardize, and consolidate data from different operational systems into a single format suitable for analysis. It also outlines some of the key steps and considerations in ETL design and implementation, such as identifying appropriate source systems, defining extraction frequencies, and techniques for capturing historical data.
The document discusses data extraction, transformation, and loading (ETL) processes in data warehousing. It describes how ETL involves acquiring data from various sources, transforming it to fit the data warehouse needs, and loading it into fact and dimension tables. The document notes that ETL can be time-consuming due to the diversity of data sources and complex transformations required to clean, standardize, and consolidate data from different operational systems into a single format suitable for analysis. It also outlines some of the key steps and considerations in ETL design and implementation, such as identifying appropriate source systems, defining extraction frequencies, and techniques for capturing historical data.
Multiple Hierarchies Extraction, transformation and loading • What is the factor that differentiates between the data in operational systems and the data in a data warehouse? • The set of functions that fall under extraction, transformation and loading • Extraction, transformation and loading (ETL) reshape the data • In which area(s) of DW environment do these ETL functions fall? • ETL encompasses data acquisition and data storage • ETL has all backend processes • Functions and procedures for extraction, transformation, and loading • Which data has to be extracted? • The user requirements drive the data extraction process • Requirement based selection process Extraction transformation loading • The extent and complexity of backend processes differ from one data warehouse to another • What decides the complexity? • The disparateness of operational systems, their platforms, network protocols, operating systems, etc. • The quality of the data • Feasibility of loading methods ETL: important and challenging • Data to strategic information • Extract the data • Not just dump it, transform it into the format suitable for conversion into information • Move the data to the repository • Now, we can provide strategic information to the users ETL: important and challenging • Users want to analyse product sales by store by month • Assume that the sales data is gathered by several applications in the organization • There exists a product master list • Each sales transaction happen at a store, then? • Sales units and sales cost in fact table; • The product dimension, the store dimension, and the time dimension • Extract the data, reconcile representations, and transform all product sales, load the data into fact and dimension tables, and provide the information to users • All three functions should be performed in the sequence ETL: important and challenging • What makes ETL challenging? • Disparities among the source systems • Reasons for difficulties in ETL functions: • Source systems are diverse and disparate • Multiple platforms and different operating systems • Older legacy applications on obsolete database technologies • History is important for data warehouse, but changes in values are not preserved in operational systems • Lack of consistency (salary data in payroll systems) • Lack of means to resolve mismatches • Data representations are cryptic, ambiguous and not meaningful for data warehouse users • Data quality is dubious ETL: time consuming and arduous • 50 to 70% of effort on ETL functions • Why time consuming and arduous? • Extract big amounts of data; complex source systems makes this task difficult • Metadata must contain information on every database and data structures (very detailed information like database size and volatility) • Time window for data extraction • Mechanisms for capturing changes • Large number of transformation methods (reformat internal data, re- sequence data, apply various forms of conversion techniques, missing value handling, aggregates, etc.) on big quantities of data • Initial load populates millions of rows; creating and managing load images for such large numbers is not an easy task (two or more weeks to complete initial loading) • Incremental changes: Extraction methods for individual sources and time windows for incremental loads ETL: requirements and steps • Functional steps for initial and incremental data loads: • Triggering for incremental changes • Filtering for refreshes and incremental loads • Extraction • Transformation • Integration • Cleansing • Loading/applying data ETL: requirements and steps ETL: requirements and steps • The activities and tasks that compose the ETL process are: • Aggregates planning • Transformation and cleansing rules • Data extraction rules establishment • Source to target data element mapping • Integrate internal and external data sources • Decide the target data to be in the warehouse • Combine several source data structures into a single row • Split a single data structure into multiple data structures • Read dictionaries and catalogues • Read data from a variety of files • Populating aggregates or summary fact tables with aggregates • Transform data, change cryptic values, etc. ETL: key factors • Primary reason for complexity is diversity of data sources • Begin with a complete inventory of source systems • Mass refreshes (initial or periodic) take too long (find a proper time) • Incremental refreshes (the best method for each source, impact less extracts and loads) Data extraction • Two factors differentiate data extraction of a data warehouse and operational systems • Can use third party tools (expensive and records metadata), in-house tools (increases cost of maintenance and hard to maintain) • Data extraction issues: • Identify source applications and structures • Define extraction process (manual or tool based) • Define data extraction frequency (daily, weekly, monthly, etc.) • Time window for data sources • Job sequencing • Exception handling Source identification • Identify all proper data sources • Examination and verification that the identified sources will add the necessary value • Consider a data mart that has to provide strategic information on fulfilment of orders • Historical data about fulfilled and pending orders, channels, order statuses • Fact table: total order amount, discounts, commissions, expected delivery time, actual delivery time, dates associated with different phases • Dimensions: product, order disposition, delivery channel, and customers • Determine whether you have source systems to provide the data • Establish correct data elements for each target data element • Verify the identified sources are the right ones Source identification Data extraction techniques • Source data is temporal or time dependent (Current values, subjected to constant changes) • How do we capture the history? • Data in operational systems: • Two ways (current values and periodic status) • Current values: • Most attributes fall under this category • Transient and current values • Remains constant until a transaction changes it • Periodic status • The value is preserved as status every time its value changes (store reference effective dates), events • Ex: policy, claims • Extraction of history is easier (preserved in the operational systems)