A Complete Notes

Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 10

DATA WAREHOUSE CONCEPTS

PURPOSE OF DATA WAREHOUSE:

A Data Warehouse is not an individual repository product. Rather, it is an overall strategy, or process,
for building decision support systems and a knowledge-based applications architecture and
environment that supports both everyday tactical decision making and long-term business strategizing.
The Data Warehouse environment positions a business to utilize an enterprise-wide data store to link
information from diverse sources and make the information accessible for a variety of user purposes,
most notably, strategic analysis. Business analysts must be able to use the Warehouse for such strategic
purposes as trend identification, forecasting, competitive analysis, and targeted market research.

Data Warehouses and Data Warehouse applications are designed primarily to support executives,
senior managers, and business analysts in making complex business decisions. Data Warehouse
applications provide the business community with access to accurate, consolidated information from
various internal and external sources.

The big question is on why we have to create a Data-Warehouse?


The reasons are as follows.
As a starter, lets say You want to know the monthly variations in 3 months running average on Your
customer balances over last twelve months grouped by products+ channels+ customer segments. Lets
see why do You need a data-warehouse for this purpose.

Keeping Analysis/Reporting and Production Separate:

If You run the above-said query on Your production systems, You will find that it will lock all Your
tables and will eat-up most of Your resources, as it will be accessing a lot of data doing a lot of
calculations. This results in the production work to come to a virtual halt. Imagine hundreds of such
above-said queries running at the same time on Your production systems.

Reporting and analysis work typically access data across the database tables, whereas production work
typically accesses specific customer OR product OR channel record at a point of time. That’s why it is
important to have the Information generation work to be done from an offline platform (aka. Data
Warehouse). Purpose of Data Warehouse is to keep analysis/reporting (non-production use data)
separate from production data.

DW purpose for Data Consistency and Quality:

Organizations are riddled with tens of important systems from which their information comes. Each of
these systems may carry the information in different formats and also may be having out of synch
information. (Different customer ID formats, mismatch in the supplier statuses). By bringing the data
from these disparate sources at a common place, one can effectively undertake to bring the uniformity
and consistency in data (Refer to cleansing and Data Transformation).

High Response Time- Production Databases are tuned to expected transaction load:

Even if You run the above-said query on an offline database, it will take a lot of time on the database
design, which is same as that of production. This is because the production databases are created to
cater to production work. In production systems, there is some level of expected intensity for different
kind of actions. Therefore, the indexing and normalization and other design considerations are for
given transaction loads. However, the Data-warehouse has to be ready for fairly unexpected loads and
type of queries, which demands a high degree of flexibility and quick response time.

Data Warehouse objective of providing an adaptive and flexible source of information.

Its easier for users to define the production work and functionalities they want, but difficult to define
the analysis they need. The analysis needs keep on changing and Data-Warehouse has the capabilities
to adapt quickly to the changing requirements. Please refer to 'Dimension Modeling'
Establish the foundation for Decision Support
Decisioning process of an organization will involve analysis, data mining, forecasting, decision
modeling etc. By having a common point, which can provide consistent, quality data with high
response time provides the core enabler for making fast and informed decisions.

DW – ARCHITECTURE
Data Warehouse Architecture: with a Staging Area and Data Marts

Although the architecture in Figure 1-3 is quite common, You may want to customize Your
warehouse's architecture for different groups within Your organization. You can do this by adding data
marts, which are systems designed for a particular line of business. Figure 1-4 illustrates an example
where purchasing, sales, and inventories are separated. In this example, a financial analyst might want
to analyze historical data for purchases and sales or mine historical data to make predictions about
customer behavior.
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts

DIMENSIONS AND FACTS


A fact table typically has two types of columns: those that contain facts and those that are foreign keys
to dimension tables. facts may be composed of measures,degenerated dimensions. The primary key of a
fact table is usually a composite key that is made up of all of its foreign keys.
Dimension tables, also known as lookup or reference tables, contain the relatively static data in the
warehouse. Dimension tables store the information You normally use to contain queries. Dimension
tables are usually textual and descriptive and You can use them as the row headers of the result set.
Examples are customers or products.

CHARACTERISTICS OF DATA WAREHOUSE

The fundamental characteristics of a data warehouse are:

• Subject Oriented: A data warehouse is organized around high-level business groupings called
subjects. They do not have the same atomic entity focus as OLTP systems.
• Integrated: The data in the warehouse must be integrated and consistent. That is, if two different
source systems store conflicting data about entities, or attributes of an entity, the differences
need to be resolved during the process of transforming the source data and loading it into the
data warehouse.
• Time Variant: One of the key characteristics distinguishing warehouses from operational
environments is the currency of the data. Operational systems require real-time views of the
data. Data warehouse applications generally deal with longer term, historical data. They can
also provide access to a greater volume of more detailed information, as required, over the
longer time period.
• Non-Volatile: The content of OLTP systems are, by their nature, continuously changing. Inserts,
deletes, and updates form the basis of a large volume of business transactions that result in a
very volatile set of data. By contrast, data warehouses are static. The data in the warehouse is
read-only; updates or refresh of the data occur on a periodic incremental or full refresh basis.

SURROGATE KEY

Surrogate Key is artificial that is created for data warehousing for generating the unique number for
each record. Surrogate key works on the behalf of primary key. And main purpose of surrogate key is
to maintain the historical record.

SESSION LOG FILE

Informatica server log: Informatica server creates a log for all status and error messages . It also
creates an error log for error messages. These files will be created in informatica home directory.

Session log file: Informatica server creates session log file for each session. It writes information about
session into log files such as initialization process, creation of sql commands for reader and writer
threads, errors encountered and load summary. The amount of detail in session log file depends on the
tracing level that You set.

Session detail file: This file contains load statistics for each targets in mapping. Session detail include
information such as table name, number of rows written or rejected. You can view this file by double
clicking on the session in monitor window

Performance detail file: This file contains information known as session performance details which and
helps You where performance can be improved. To genarate this file select the performance detail
option in the session property sheet.
Reject file: This file contains the rows of data that the writer does not write to targets.

Control file: Informatica server creates control file and a target file when You run a session that uses
the external loader. The control file contains the information about the target flat file such as data
format loading instructions for the external loader.

Post session email: Post session email allows You to automatically communicate information about a
session run to designated recipents. U can create two different messages. One if the session completed
sucessfully the other if the session fails.

Indicator file: If u use the flat file as a target,You can configure the informatica server to create
indicator file. For each target row, the indicator file contains a number to indicate whether the row was
marked for insert, update, delete or reject.

output file: If session writes to a target file,the informatica server creates the target file based on file
properties entered in the session property sheet.

Cache files: When the informatica server creates memory cache it also creates cache files. For the
following circumstances informatica server creates index and datacache files.

Aggreagtor transformation
Joiner transformation
Rank transformation
Lookup transformation

TARGET OVERRIDES
Target Update (by using Update strategy) will work only when we have the Primary key in Target.
Update override will happen based on this key.If we don't have the Pkey Update override will not
work.
To overcome the above issue..If our Target is not having any key column or if we want to update the
Target based on a column other than PKey column then we need to go for Target Update override
option available in Target properties.

MAPPING OPTIMIZATION
When to optimize mappings
-------------------------
The best time in the development cycle is after system testing. Focus on mapping-level optimization
only after optimizing the target and source databases.

Use Session Log to identify if the source, target or transformations are the performance bottleneck
----------------------------------------------------------------------------------------------
The session log contains thread summary records:

MASTER> PETL_24018 Thread [READER_1_1_1] created for the read stage of


partition point [SQ_test_all_text_data] has completed: Total Run Time =
[11.703201] secs, Total Idle Time = [9.560945] secs, Busy Percentage =
[18.304876].
MASTER> PETL_24019 Thread [TRANSF_1_1_1_1] created for the transformation
stage of partition point [SQ_test_all_text_data] has completed: Total Run
Time = [11.764368] secs, Total Idle Time = [0.000000] secs, Busy Percentage
= [100.000000].
MASTER> PETL_24022 Thread [WRITER_1_1_1] created for the write stage of
partition point(s) [test_all_text_data1] has completed: Total Run Time =
[11.778229] secs, Total Idle Time = [8.889816] secs, Busy Percentage =
[24.523321].

If one thread has busy percentage close to 100% and the others have significantly lower value, the
thread with the high busy percentage is the bottleneck. In the example above, the session is
transformation bound

Identifying Target Bottlenecks


------------------------------
The most common performance bottleneck occurs when the Informatica Server writes to a target
database. You can identify target bottlenecks by configuring the session to write to a flat file target. If
the session performance increases significantly when you write to a flat file, you have
a target bottleneck.

Consider performing the following tasks to increase performance:


* Drop indexes and key constraints.
* Increase checkpoint intervals.
* Use bulk loading.
* Use external loading.
* Increase database network packet size.
* Optimize target databases.

Identifying Source Bottlenecks


------------------------------
If the session reads from relational source, you can use a filter transformation, a read test mapping, or a
database query to identify source bottlenecks:
* Filter Transformation - measure the time taken to process a given amount of data, then add an always
false filter transformation in the mapping after each source qualifier so that no data is processed past
the filter transformation. You have a source bottleneck if the new session runs in about the same time.
* Read Test Session - compare the time taken to process a given set of data using the session with that
for a session based on a copy of the mapping with all transformations after the source qualifier
removed with the source qualifiers connected to file targets. You have a source bottleneck if the new
session runs in about the same time.
* Extract the query from the session log and run it in a query tool. Measure the time taken to return the
first row and the time to return all rows. If there is a significant difference in time, you can use an
optimizer hint to eliminate the source bottleneck

Consider performing the following tasks to increase performance:


* Optimize the query.
* Use conditional filters.
* Increase database network packet size.
* Connect to Oracle databases using IPC protocol.

INCREMENTAL AGGREGATION
Whenever a session is created for a mapping Aggregate Transformation, the session option for
Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it
passes new source data through the mapping and uses historical cache data to perform new aggregation
calculations incrementally.
Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate
calculations in a session. If the source changes incrementally and we can capture the changes, then we
can configure the session to process those changes. This allows the Integration Service to update the
target incrementally, rather than forcing it to delete previous loads data, process the entire source data
and recalculate the same data each time you run the session.
When the session runs with incremental aggregation enabled for the first time say 1st week of Jan, we
will use the entire source. This allows the Integration Service to read and store the necessary aggregate
data information. On 2nd week of Jan, when we run the session again, we will filter out the CDC
records from the source i.e the records loaded after the initial load. The Integration Service then
processes these new data and updates the target accordingly.
Use incremental aggregation when the changes do not significantly change the target. If
processing the incrementally changed source alters more than half the existing target, the session may
not benefit from using incremental aggregation. In this case, drop the table and recreate the target with
entire source data and recalculate the same aggregation formula . INCREMENTAL AGGREGATION,
may be helpful in cases when we need to load data in monthly facts in a weekly basis.
Let us see a sample mapping to implement incremental aggregation:

Look at the Source Qualifier query to fetch the CDC part using a BATCH_LOAD_CONTROL table
that saves the last successful load date for the particular mapping.
Look at the ports tab of Expression transformation.

Look at the ports tab of Aggregator Transformation.


Now the most important session properties configuation to implement incremental Aggregation

If we want to reinitialize the aggregate cache suppose during first week of every month we will
configure another session same as the previous session the only change being the Reinitialize aggregate
cache property checked in
Now have a look at the source table data:
CUSTOMER_KE INVOICE_KE
AMOUNT LOAD_DATE
Y Y

1111 5001 100 01/01/2010

2222 5002 250 01/01/2010

3333 5003 300 01/01/2010

1111 6007 200 07/01/2010

1111 6008 150 07/01/2010

2222 6009 250 07/01/2010

4444 1234 350 07/01/2010

5555 6157 500 07/01/2010

After the first Load on 1st week of Jan 2010, the data in the target is as follows:
CUSTOMER_KE INVOICE_KE MON_KE
AMOUNT
Y Y Y

1111 5001 201001 100

2222 5002 201001 250

3333 5003 201001 300

Now during the 2nd week load it will process only the incremental data in the source i.e those records
having load date greater than the last session run date. After the 2nd weeks load after incremental
aggregation of the incremental source data with the aggregate cache file data will update the target
table with the following dataset:

CUSTOMER_ INVOICE_ MON_ Remarks/


AMOUNT
KEY KEY KEY Operation

The cache
file
updated
1111 6008 201001 450
after
aggretatio
n

The cache
file
updated
2222 6009 201001 500
after
aggretatio
n

The cache
file
3333 5003 201001 300 remains
the same
as before

New
group row
4444 1234 201001 350
inserted in
cache file

5555 6157 201001 500 New


group row
inserted in
cache file

The first time we run an incremental aggregation session, the Integration Service processes the entire
source. At the end of the session, the Integration Service stores aggregate data for that session run in
two files, the index file and the data file. The Integration Service creates the files in the cache directory
specified in the Aggregator transformation properties. Each subsequent time we run the session with
incremental aggregation, we use the incremental source changes in the session. For each input record,
the Integration Service checks historical information in the index file for a corresponding group. If it
finds a corresponding group, the Integration Service performs the aggregate operation incrementally,
using the aggregate data for that group, and saves the incremental change. If it does not find a
corresponding group, the Integration Service creates a new group and saves the record data.

When writing to the target, the Integration Service applies the changes to the existing target. It saves
modified aggregate data in the index and data files to be used as historical data the next time you run
the session.
Each subsequent time we run a session with incremental aggregation, the Integration Service creates a
backup of the incremental aggregation files. The cache directory for the Aggregator transformation
must contain enough disk space for two sets of the files.

The Integration Service creates new aggregate data, instead of using historical data, when we configure
the session to reinitialize the aggregate cache, Delete cache files etc.
When the Integration Service rebuilds incremental aggregation files, the data in the previous files is
lost.
Note: To protect the incremental aggregation files from file corruption or disk failure,
periodically back up the files.

You might also like