How To Design DWH
How To Design DWH
How To Design DWH
different objectives, and their design reflects these differences. Data Warehouse Design The data warehouse is optimized for load and data extraction performance, flexibility, and data maintenance. The flexibility objective requires that the data be an unbiased representation so that each business user can extract the portion of the data he or she needs and apply the assumptions germane to his or her analysis. This model is substantially based on the business rules; hence, its model is developed by transforming the business data model as needed to support these objectives. Step 1: Select the data of interest. During the development of a data warehouse, the analyst needs to determine the data that will be needed to support strategic analysis. One of the questions to avoid asking a user is "What data do you need?" The answer to this question is pretty obvious -- all of it. Incorporating unneeded data into the data warehouse increases the development time, adds a burden to the regular load cycle, and wastes storage space due to the retention of history for the data elements that are not used. The first step of the data warehouse design is determining the elements that are needed. Sometimes the answer is not very clear, and we must then ask whether or not we feel there is a reasonable chance that the data element will be needed to support strategic analysis. There is an important reason that this is the first step in the process. It defines our scope. We have many other steps to traverse, and by eliminating the data we don't need, we will not be wasting our time incorporating data elements that will be discarded later. Step 2: Add time to the key. The data warehouse is a time-variant data store. Unlike the operational systems that often overwrite the historical view of the data with the current view, the data warehouse maintains the history. The time dimension is needed to distinguish multiple instances of the same item. This is the second step in the process because it has the greatest impact on the data model, and the data model is the foundation of our design. Introducing the historical perspective means that attributes (e.g., last name) could have multiple values over time, and we need to retain each of them. To do this, we need to make each instance unique. Further, the business rules change when we opt for the historical perspective. For the operational system, we need to know the department to which a person is assigned, and the person can only be assigned to one department at a time. In the data warehouse, we have history, and we must therefore deal with a person moving from one department to another.
Step 3: Add derived data. Some business measures are used repeatedly. To ensure consistency, these terms should be standardized, and the formulas for deriving them should be determined. (This is one of the stewardship functions.) Step 4: Determine the level of granularity. The fourth step entails determining the level of granularity, or level of detail, in the data warehouse. A retailer that is trying to determine which products sold with other products needs to capture each sales transaction. A retailer that is only concerned about product movement may be satisfied with a daily summary of product sales. These are very different granularity levels, and there is a substantial difference in the cost for developing each of these warehouses. Upon completion of the first four steps, the data warehouse design should meet the business needs. The warehouse will have the needed data and will store it in a way that provides flexibility to the business users to use it to meet their needs. Step 5: Summarize data. The warehouse is used for strategic analyses, and often these entail data summaries. By summarizing the data, we are in a better position to ensure consistency and reduce the need for calculating the same summaries for delivery to multiple data marts. Data summaries do not necessarily reduce storage costs -- they may actually increase them. The impact on storage depends on whether we still need the detailed data. If all of our analysis is on the monthly sales figure, for example, then we may not need the details. If, however, we want to delve into exceptions, we still need the details to support the analysis. In such an instance, we retain the details and add another table with the summary. Step 6: Merge tables. Sometimes related entities can be compressed into one. For example, if we have a customer file and want to relate each customer to a metropolitan statistical area (MSA), instead of relating the customer entity to an MSA entity, we could simply add the MSA as a foreign key (fk) attribute of customer, as shown in Figure 4.
Step 7: Create arrays. Sometimes arrays are appropriate in the data warehouse, and this step creates them. For accounts receivable analysis, for example, we need a few buckets of information (e.g., current debt, debt aged 1-30 days, debt aged 31-60 days, etc.). We update all the buckets at one time, and we use all of these together. In cases such as this, an array may be appropriate within the warehouse. The criteria for using arrays are that we have a small number of occurrences of the data, that it's always the same number of occurrences, that the data is all available for insertion at the same time, and that the data is used together. Step 8: Segregate data. The last step is segregating data based on stability and usage. The data warehouse contains history, and updates are made by adding records with data changes. If there is a significant difference in the data volatility, it often makes sense to partition data, with the rapidly changing data in one table and the slowly changing data in another. These eight steps transform the business data model into a data model for the data warehouse. Additional adjustments may be made to improve performance.
This is the major denormalization step in the process. By combining the three levels of the hierarchy into a single table, each piece of data for the product line is repeated for each product. Step 3: Add attributes to the dimensions. The dimensions represent the business constraints. Often, users will want to have information about the dimensions also available. For a product, for example, users may want to know the weight, color, and size. These attributes are added to the dimension table in this step. Step 4: Ensure that the dimensions have good keys. The key of the dimension table usually becomes part of the key of the fact table. To perform this role efficiently, it needs to obey the rules of good keys,15 and it needs to be relatively short. Since we are pulling data from the data warehouse, the first criterion is usually already met. If the key is too long, then it may be advisable to use a system-generated key to replace it. Step 5: Normalize dimensions. In the second step, we created denormalized dimension tables. In this step, we examine those tables and consider normalizing them. In general, we should not be normalizing dimensions merely to save space. The space savings will typically be superseded by increases in query complexity. Step 6: Factor refreshment for the dimensions. History of the data in the marts is accomplished by including a date-oriented dimension. Data in some of the other dimensions may also change over time, and if we are interested in that history, we may need to create what is known as a slowly changing dimension. Step 7: Prototype and refine. The last step of the process recognizes that, despite our best efforts, needs will be identified as people start using the data marts. This step allocates time to have the business users exercise the data mart, with the development team incorporating appropriate modifications. These seven steps create a star schema that can be used by the business community if they are equipped with the appropriate tools. Additional measures may also be undertaken to further improve performance. Model Comparison The data models for the data warehouse and data marts have both similarities and differences, as shown in Table 1. Both reflect integrated strategic data, include derived data, and typically include an element of time. They differ in the degree of normalization and in the organization philosophy, with the data warehouse being organized for stability and usage and the data marts being organized for ease of use and response time.
Business Model
Normalized
Somewhat denormalized
Highly denormalized
Enterprise perspective
No derived data
DATA ACQUISITION
Data acquisition is the complex set of processes by which data is moved from the operational systems to the data warehouse. It consists of several discrete processes: capture, cleansing, integration, transformation, and loading (see Figure 5).
Data Capture The first step of this process is capture. During the capture process, we get to determine which systems will be used for which data, understand those systems, and extract that data from them. Before we can pull data out of a source system, we must choose the system to be used. Sometimes the decision is easy -- there's only one universally accepted major data source. More often, however, we must choose from among several candidates. In making this selection, the following criteria should be considered: Point-of-data origin. Traditionally, the best source of data is the system in which it is initially entered. Sales information needs to be accurate in the point-of-sale system, and it could be considered to be the best source of the data. As the data flows through the rest of the operational environment, changes may take place, and these changes could affect the validity of the data of interest. Completeness of the data. If we need data that originates in several systems, an easier path may be to pick one source into which that data is collected during normal operational processing. Selecting this source simplifies both the data capture and the data integration, since the data from a few sources is already integrated.
System reliability. Some systems are considered to be more reliable than others. Systems that are implemented more recently typically reflect more stringent validation rules, and these may provide a better source of data for the data warehouse. Data currency. If we want customer contact information, we may choose to use the billing system rather than using the order entry system. The billing system is more likely to receive corrections since it is a source of revenue for the company. Documentation availability. Some systems are better documented than others. To the extent that the system is well documented, our data capture activities (particularly our source system analysis activities) are simplified. Accessibility. The location and technology of the source system may also affect our choice. Data that is locally available is often easier to obtain than data that is remote and managed by a different computer center. Similarly, data in a current technology is easier to gather than data stored in a technology no longer commonly used in the company. These are all rational reasons for selecting a particular source system. There is another factor that should also be considered: politics. Selection of the source system may be impacted by the faith the users will have in the data, and some users may have preconceived notions concerning the viability of some of the source systems. Source System Analysis The process used for understanding the operational systems is source system analysis. Although the starting point is looking at the names of the data elements, the analysis needs to go much deeper. Using the data element name implies something about what is in the field. The analyst should either locate the definition for the field in the system documentation (if any is available) or create it based on information gleaned from the system users and maintainers. Once the definition is determined, the analyst needs to examine the data in the field to ensure that all of it conforms to that definition. There are several conditions that often lead to exceptions. For example:
A field that was included in ..the system may no longer be needed, and when a new field was needed, the programmer reused the existing field without changing the field name (or the documentation). The original field only applies in some cases (e.g., residential customers), and the programmer used the field to mean something else for other cases (e.g., commercial customer). The original field (e.g., work order number) did not apply to a particular group, and the programmer used the field to mean something else (e.g., vehicle number).
Once the definition of the field is known, the analyst needs to examine the quality of the data with respect to its accuracy and completeness. The accuracy examination entails looking at each individual field and examining field dependencies. For example, if one field indicates an insurance claim for pregnancy, the gender field should be "female."
A luxury we have in examining the entire data set is looking at the demographics of the data. Although 11/11/11 is a valid birth date that would pass system edit checks, if we find that 30% of the people in the data set have that birthday, we may become suspicious about the data's accuracy. Understanding the quality is extremely important. We use it both to set the quality expectations and to determine the cleansing requirements of the data acquisition process. The completeness examination determines whether or not the field has a data value when it is needed. As with the data accuracy, this is a reflection of the quality of the data, and we use it to set expectations and determine the data acquisition processing requirements. Data Extraction For the initial load of the data warehouse, we need to look at all the data in the appropriate source systems. After the initial load, our processing time and cost are significantly reduced if we can readily identify the data that has changed. We can then restrict our processes to that data. There are six basic methods for capturing data changes.16 1. Use source system time-stamps. Operational systems often have time-stamps in each record to indicate the last time it was updated. When such time-stamps are available, our extract programs can select only the data that has changed since our last extract. If the source system performs physical record deletions, this approach cannot be used, since there won't be a record showing the deletion. 2. Read the database management system (DBMS) log. The DBMS log maintains information about changes to the database. With some of the tools available today, we can read this log to detect the data changes. 3. Modify the operational system. When the operational system does not have a way of marking changes to the data, one approach is to change that system. This option is often difficult to justify, since changes to the operational system may be costly and each change introduces a risk with respect to that system's reliability. If this option is selected, it should be done as a separate project and not be absorbed into the data warehouse development effort. 4. Compare before and after images of the database. For older batch systems, backup files can be captured and compared using utilities such as Comparex, Superc, or Syncsort. The comparison process itself may be slow. Once the changed records are identified, these can be extracted for further processing. 5. Create snapshots of the operational systems and use them in the load process. This technique rarely applies to the data warehouse since it entails rebuilding the history. 6. Use database triggers from the operational system. If the operational system employs a modern relational database management system, triggers can be used to update a table with the changes. The major drawback of this technique is that it places an additional burden on each transaction within the operational system.
Applying a changed data capture technique can improve the data capture process efficiency, but it is not always practical. It is, however, something that needs to be researched in designing the data capture logic. Cleansing We analyzed the quality of the source systems during source system analysis. Now we need to do something about it. During the cleansing process, we set the quality expectations and then incorporated the steps needed to meet those expectations. In setting the quality expectations, we need to balance the ideal situation (e.g., perfection) with the cost of attaining it. The data warehouse is designed to support strategic analysis, and data perfection is often unnecessary. This is another important role for the data steward. Once the data quality expectations are set, we need to use data cleansing tools or develop algorithms to attain that quality level. One aspect of quality that is not specific to the individual systems deals with data integration. This will be addressed in the next section. We have four fundamental choices in dealing with errors that we encounter: 1. Reject the record. The error in the source data may be so severe that we would need to reject the entire record. For example, we may receive a record that is missing critical data or for which critical data does not pass our validation rules. 2. Accept the error. Sometimes we can detect an error in the source system but determine that, given our quality expectations, it is within our tolerance levels. When this occurs, we may accept the record. Depending on the type of error and our interest in tracking it, we may also issue an alert about the error. 3. Set a default value. We may be receiving data from multiple customer systems, some of which may not have a value for a field such as customer type. If we know that most of the customers in that system are of a particular type, we may be willing to accept an error of misclassifying a few customers for the increased value of inserting a correct customer type for most customers. 4. Correct the error. In the previous example, there may be other data that can be used to identify the customer type. For example, if we know that small businesses rarely spend more than $1 million with us and large businesses almost always spend more than $10 million, we may insert a customer type code based on the business volume. As with the default value, the data may not be perfect, but it provides a more accurate picture of our customer base. If either of the last two options are selected, we will have a mismatch between the data in the operational sources and the data warehouse. Having this difference is not necessarily bad -- what's important is that we recognize that the difference exists. If we correct an error, we also need to recognize that nothing has changed either in the business process and source system that permitted the error to exist or in the source system data that contains the data.
Actually, there's also a fifth option. After we evaluate the implications of exercising one of the above four options, we may decide that none of them is acceptable. When this happens, we will need to reexamine and potentially change our quality expectations. When the quality expectations change, we may also need to reevaluate whether or not the data warehouse can still meet its intended objectives. The data cleansing process may consist of a combination of automated and manual processes. For example, the validation checks may detect an error that we determined fell into the fourth category. If we can't create an algorithm to fix the error condition, we may be faced with the prospect of suspending the record and waiting until a person looks at the record and makes the correction. A cleansing process that depends on manual intervention requires a strong business commitment to make the corrections. Otherwise, the data will sit in the suspense files for a long time and will not be available to support strategic analysis. Integration Data integration merges data from multiple sources into a single, enterprise-oriented view. First, we must recognize that duplicate instances of the same item exist. Once we recognize that, we need to merge the information from these multiple instances. With customer data, for example, the same customer may exist in multiple files. We could be faced with three customer records, as shown in Figure 6. On the surface, these may or may not be instances of the same customer. Our first challenge in data integration is to determine this.
If we're implementing a data warehouse to support CRM, this step is crucial. We need to know the value of each of our customers, all the products they own, and all the interactions we've had with them. Only then can we devise a set of actions that will benefit the customer and be profitable. Fortunately, for customer data, there is a variety of data scrubbing, data matching, and data householding tools available to help with data integration. Within the data acquisition process, we may need to create a table that relates the customer identifier in the source system with the customer identifier in the data warehouse. Once the customers are integrated, we can use this table to relate the customer in the source system to the data warehouse instance. The second part of the integration process entails merging the information from the multiple sources together. This requires an element-by-element decision by the business community and the implementation of the logic by the IT community. Resolving the conflicts that often arise in this process is another role of the data steward. Transformation The coding structures may differ among the source systems, and these need to be transformed into a single structure for the data warehouse. Also, the physical representation of the data may differ, and again, a single approach is needed. These are two examples of data transformation. In the first instance, the business community often needs to be involved; the second instance is a technical decision, as long as the business needs can be met. Loading The last step of the data acquisition process is the load. During this step, the data is physically moved into the data warehouse and is available for subsequent dissemination to the data marts. The data warehouse load is a batch process and, with rare exception, consists of record insertions. Due to the retention of history in the data warehouse, each time changed data is brought in, it appends an existing record. Some factors to consider in designing the load process include the use of a staging area to prepare the data for the load, making a backup copy of the data being loaded, determining the sequence with which each of the sources needs to be loaded, and within that, determining the sequence in which the data itself needs to be loaded.