Unit 9 - Data Warehousing
Unit 9 - Data Warehousing
Unit 9 - Data Warehousing
Data Warehousing
The data warehouse is the heart of the architected environment, and is the foundation of all DSS
processing. The job of the DSS analyst in the data warehouse environment is massively easier than
in the classical legacy environment because there is a single integrated source of data (the data
warehouse) and because the granular data in the data warehouse is easily accessible.
A data warehouse is a large database built from the operational database that organizes all the data
available in an organization, makes it accessible & usable for the all kinds of data analysis and
also allows to create a lots of reports by the use of mining tools.
Subject Oriented
Integrated
Arjun Lamichhane 1
Data Mining and Data Warehousing Unit 9: Data Warehousing
image. Figure 2 illustrates the integration that occurs when data passes from the application-
oriented operational environment to the data warehouse.
Nonvolatile
Time Variant
Time variance implies that every unit of data in the data warehouse is accurate as of some one
moment in time. In some cases, a record is time stamped. In other cases, a record has a date of
transaction. Different environments have different time horizons. A time horizon is the parameters
of time represented in an environment. The collective time horizon for the data found inside a data
warehouse is significantly longer than that of operational systems. A 60-to-90-day time horizon is
normal for operational systems; a 5-to-10-year time horizon is normal for the data warehouse. As
a result of this difference in time horizons, the data warehouse contains much more history than
any other environment
Arjun Lamichhane 2
Data Mining and Data Warehousing Unit 9: Data Warehousing
Data Mart
A data mart contains a subset of corporate-wide data that is of value to a specific group of users.
The scope is confined to specific selected subjects. For example, a marketing data mart may
confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized. Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX or Windows based. The implementation cycle of a data mart is more likely to be
measured in weeks rather than months or years. However, it may involve complex integration in
the long run if its design and planning were not enterprise-wide.
Depending on the source of data, data marts can be categorized as independent or dependent.
Independent data marts are sourced from data captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or
geographic area. Dependent data marts are sourced directly from enterprise data warehouses.
Arjun Lamichhane 3
Data Mining and Data Warehousing Unit 9: Data Warehousing
Extract
Transform
Load
Load the data into temporary data source first and then perform simple transformation into
structure similar to one in data warehouse
Consume as little resource as possible
Warehouse Manager (Data Manager): It is the system component that performs analysis of data
to ensure consistency. The data from various sources and temporary storage are merged into data
warehouse by the warehouse manager. The job of backing-up and archiving data as well as creation
of index is performed by this manager.
Arjun Lamichhane 4
Data Mining and Data Warehousing Unit 9: Data Warehousing
Query Manager: Performs all the operations necessary to support the query management process
by directing queries to the appropriate tables. They present the data to the user in a form they
understand. In some cases it also stores query profiles to allow the warehouse manager to
determine which indexes and aggregations are appropriate.
A data warehouse requires a concise, subject-oriented schema that facilitates on-line data analysis.
The most popular data model for a data warehouse is a multidimensional model. Such a model can
exist in the form of a star schema, a snowflake schema, or a fact constellation schema.
2. Hardware integration: Once the hardware and software have been selected, they need to be
put together by integrating the servers, the storage devices and the client software tools.
3. Modeling: Modeling is a major step that involves designing the warehouse schema and views.
This may involve using a modeling tool if the data warehouse is complex.
4. Physical modeling: For the data warehouse to perform efficiently, physical modeling is
required. This involves designing the physical data warehouse organization, data placement, data
partitioning, deciding on access methods and indexing.
Arjun Lamichhane 5
Data Mining and Data Warehousing Unit 9: Data Warehousing
5. Sources: The data for the data warehouse is likely to come from a number of data sources. This
step involves identifying and connecting the sources using gateways, ODBC drives or other
wrappers.
6. ETL: The data from the source systems will need to go through an ETL process. The step of
designing and implementing the ETL process may involve identifying a suitable ETL tool vendor
and purchasing and implementing the tool. This may include customizing the tool to suit the needs
of the enterprise.
7. Populate the data warehouse: Once the ETL tools have been agreed upon, testing the tools
will be required. Once everything is working satisfactorily, the ETL tools may be used in
populating the warehouse given the schema and view definitions.
8. User applications: For the data warehouse to be useful there must be end-user applications.
This step involves designing and implementing applications required by the end users.
9. Roll-out the warehouse and applications: Once the data warehouse has been populated and
the end-user applications tested, the warehouse system and the applications may be rolled out for
the user community to use.
Implementation Guidelines
1. Build incrementally: Data warehouses must be built incrementally. Generally it is
recommended that a data mart may first be built with one particular project in mind and once it is
implemented a number of other sections of the enterprise may also wish to implement similar
systems. An enterprise data warehouse can then be implemented in an iterative manner allowing
all data marts to extract information from the data warehouse. Data warehouse modeling itself is
an iterative methodology as users become familiar with the technology and are then able to
understand and express their requirements more clearly.
2. Need a leader: A data warehouse project must have a leader who is willing to carry out
considerable research into expected costs and benefits of the project. Data warehousing projects
require inputs from many units in an enterprise and therefore need to be driven by someone who
is capable of interaction with people in the enterprise and can actively persuade colleagues.
Without the cooperation of other units, the data model for the warehouse and the data required to
Arjun Lamichhane 6
Data Mining and Data Warehousing Unit 9: Data Warehousing
populate the warehouse may be more complicated than they need to be. Studies have shown that
having a champion can help adoption and success of data warehousing projects.
3. Senior management support: A data warehouse project must be fully supported by the senior
management. Given the resource intensive nature of such projects and the time they can take to
implement, a warehouse project calls for a sustained commitment from senior management. This
can sometimes be difficult since it may be hard to quantify the benefits of data warehouse
technology and the managers may consider it a cost without any explicit return on investment.
Data warehousing project studies show that top management support is essential for the success
of a data warehousing project.
4. Ensure quality: Only data that has been cleaned and is of a quality that is understood by the
organization should be loaded in the data warehouse. The data quality in the source systems is not
always high and often little effort is made to improve data quality in the source systems. Improved
data quality, when recognized by senior managers and stakeholders, is likely to lead to improved
Support for a data warehouse project.
5. Corporate strategy: A data warehouse project must fit with corporate strategy and business
objectives. The objectives of the project must be clearly defined before the start of the project.
Given the importance of senior management support for a data warehousing project, the fitness of
the project with the corporate strategy is essential.
6. Business plan: The financial costs (hardware, software, and HR), expected benefits and a
project plan (including an ETL plan) for a data warehouse project must be clearly outlined and
understood by all stakeholders. Without such understanding, rumors about expenditure and
benefits can become the only source of information, undermining the project.
7. Training: A data warehouse project must not overlook data warehouse training requirements.
For a data warehouse project to be successful, the users must be trained to use the warehouse and
to understand its capabilities. Training of users and professional development of the project team
may also be required since data warehousing is a complex task and the skills of the project team
are critical to the success of the project.
8. Adaptability: The project should build in adaptability so that changes may be made to the data
warehouse if and when required. Like any system, a data warehouse will need to change, as needs
Arjun Lamichhane 7
Data Mining and Data Warehousing Unit 9: Data Warehousing
of an enterprise change. Furthermore, once the data warehouse is operational, new applications
using the data warehouse are almost certain to be proposed. The system should be able to support
such new applications.
9. Joint management: The project must be managed by both IT and business professionals in the
enterprise. To ensure good communication with the stakeholders and that the project is focused on
assisting the enterprise’s business, business professionals must be involved in the project along
with technical professionals.
References
[1] J. Han and K. Micheline, Data Mining: Concepts and Techniques, San Francisco: Elsevier
Inc., 2006.
[2] W. Inmon, Building the Data Warehouse, New York: John Wiley & Sons, Inc., 2002.
Arjun Lamichhane 8