A Complete Notes
A Complete Notes
A Complete Notes
A Data Warehouse is not an individual repository product. Rather, it is an overall strategy, or process,
for building decision support systems and a knowledge-based applications architecture and
environment that supports both everyday tactical decision making and long-term business strategizing.
The Data Warehouse environment positions a business to utilize an enterprise-wide data store to link
information from diverse sources and make the information accessible for a variety of user purposes,
most notably, strategic analysis. Business analysts must be able to use the Warehouse for such strategic
purposes as trend identification, forecasting, competitive analysis, and targeted market research.
Data Warehouses and Data Warehouse applications are designed primarily to support executives,
senior managers, and business analysts in making complex business decisions. Data Warehouse
applications provide the business community with access to accurate, consolidated information from
various internal and external sources.
If You run the above-said query on Your production systems, You will find that it will lock all Your
tables and will eat-up most of Your resources, as it will be accessing a lot of data doing a lot of
calculations. This results in the production work to come to a virtual halt. Imagine hundreds of such
above-said queries running at the same time on Your production systems.
Reporting and analysis work typically access data across the database tables, whereas production work
typically accesses specific customer OR product OR channel record at a point of time. That’s why it is
important to have the Information generation work to be done from an offline platform (aka. Data
Warehouse). Purpose of Data Warehouse is to keep analysis/reporting (non-production use data)
separate from production data.
Organizations are riddled with tens of important systems from which their information comes. Each of
these systems may carry the information in different formats and also may be having out of synch
information. (Different customer ID formats, mismatch in the supplier statuses). By bringing the data
from these disparate sources at a common place, one can effectively undertake to bring the uniformity
and consistency in data (Refer to cleansing and Data Transformation).
High Response Time- Production Databases are tuned to expected transaction load:
Even if You run the above-said query on an offline database, it will take a lot of time on the database
design, which is same as that of production. This is because the production databases are created to
cater to production work. In production systems, there is some level of expected intensity for different
kind of actions. Therefore, the indexing and normalization and other design considerations are for
given transaction loads. However, the Data-warehouse has to be ready for fairly unexpected loads and
type of queries, which demands a high degree of flexibility and quick response time.
Its easier for users to define the production work and functionalities they want, but difficult to define
the analysis they need. The analysis needs keep on changing and Data-Warehouse has the capabilities
to adapt quickly to the changing requirements. Please refer to 'Dimension Modeling'
Establish the foundation for Decision Support
Decisioning process of an organization will involve analysis, data mining, forecasting, decision
modeling etc. By having a common point, which can provide consistent, quality data with high
response time provides the core enabler for making fast and informed decisions.
DW – ARCHITECTURE
Data Warehouse Architecture: with a Staging Area and Data Marts
Although the architecture in Figure 1-3 is quite common, You may want to customize Your
warehouse's architecture for different groups within Your organization. You can do this by adding data
marts, which are systems designed for a particular line of business. Figure 1-4 illustrates an example
where purchasing, sales, and inventories are separated. In this example, a financial analyst might want
to analyze historical data for purchases and sales or mine historical data to make predictions about
customer behavior.
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts
• Subject Oriented: A data warehouse is organized around high-level business groupings called
subjects. They do not have the same atomic entity focus as OLTP systems.
• Integrated: The data in the warehouse must be integrated and consistent. That is, if two different
source systems store conflicting data about entities, or attributes of an entity, the differences
need to be resolved during the process of transforming the source data and loading it into the
data warehouse.
• Time Variant: One of the key characteristics distinguishing warehouses from operational
environments is the currency of the data. Operational systems require real-time views of the
data. Data warehouse applications generally deal with longer term, historical data. They can
also provide access to a greater volume of more detailed information, as required, over the
longer time period.
• Non-Volatile: The content of OLTP systems are, by their nature, continuously changing. Inserts,
deletes, and updates form the basis of a large volume of business transactions that result in a
very volatile set of data. By contrast, data warehouses are static. The data in the warehouse is
read-only; updates or refresh of the data occur on a periodic incremental or full refresh basis.
SURROGATE KEY
Surrogate Key is artificial that is created for data warehousing for generating the unique number for
each record. Surrogate key works on the behalf of primary key. And main purpose of surrogate key is
to maintain the historical record.
Informatica server log: Informatica server creates a log for all status and error messages . It also
creates an error log for error messages. These files will be created in informatica home directory.
Session log file: Informatica server creates session log file for each session. It writes information about
session into log files such as initialization process, creation of sql commands for reader and writer
threads, errors encountered and load summary. The amount of detail in session log file depends on the
tracing level that You set.
Session detail file: This file contains load statistics for each targets in mapping. Session detail include
information such as table name, number of rows written or rejected. You can view this file by double
clicking on the session in monitor window
Performance detail file: This file contains information known as session performance details which and
helps You where performance can be improved. To genarate this file select the performance detail
option in the session property sheet.
Reject file: This file contains the rows of data that the writer does not write to targets.
Control file: Informatica server creates control file and a target file when You run a session that uses
the external loader. The control file contains the information about the target flat file such as data
format loading instructions for the external loader.
Post session email: Post session email allows You to automatically communicate information about a
session run to designated recipents. U can create two different messages. One if the session completed
sucessfully the other if the session fails.
Indicator file: If u use the flat file as a target,You can configure the informatica server to create
indicator file. For each target row, the indicator file contains a number to indicate whether the row was
marked for insert, update, delete or reject.
output file: If session writes to a target file,the informatica server creates the target file based on file
properties entered in the session property sheet.
Cache files: When the informatica server creates memory cache it also creates cache files. For the
following circumstances informatica server creates index and datacache files.
Aggreagtor transformation
Joiner transformation
Rank transformation
Lookup transformation
TARGET OVERRIDES
Target Update (by using Update strategy) will work only when we have the Primary key in Target.
Update override will happen based on this key.If we don't have the Pkey Update override will not
work.
To overcome the above issue..If our Target is not having any key column or if we want to update the
Target based on a column other than PKey column then we need to go for Target Update override
option available in Target properties.
MAPPING OPTIMIZATION
When to optimize mappings
-------------------------
The best time in the development cycle is after system testing. Focus on mapping-level optimization
only after optimizing the target and source databases.
Use Session Log to identify if the source, target or transformations are the performance bottleneck
----------------------------------------------------------------------------------------------
The session log contains thread summary records:
If one thread has busy percentage close to 100% and the others have significantly lower value, the
thread with the high busy percentage is the bottleneck. In the example above, the session is
transformation bound
INCREMENTAL AGGREGATION
Whenever a session is created for a mapping Aggregate Transformation, the session option for
Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it
passes new source data through the mapping and uses historical cache data to perform new aggregation
calculations incrementally.
Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate
calculations in a session. If the source changes incrementally and we can capture the changes, then we
can configure the session to process those changes. This allows the Integration Service to update the
target incrementally, rather than forcing it to delete previous loads data, process the entire source data
and recalculate the same data each time you run the session.
When the session runs with incremental aggregation enabled for the first time say 1st week of Jan, we
will use the entire source. This allows the Integration Service to read and store the necessary aggregate
data information. On 2nd week of Jan, when we run the session again, we will filter out the CDC
records from the source i.e the records loaded after the initial load. The Integration Service then
processes these new data and updates the target accordingly.
Use incremental aggregation when the changes do not significantly change the target. If
processing the incrementally changed source alters more than half the existing target, the session may
not benefit from using incremental aggregation. In this case, drop the table and recreate the target with
entire source data and recalculate the same aggregation formula . INCREMENTAL AGGREGATION,
may be helpful in cases when we need to load data in monthly facts in a weekly basis.
Let us see a sample mapping to implement incremental aggregation:
Look at the Source Qualifier query to fetch the CDC part using a BATCH_LOAD_CONTROL table
that saves the last successful load date for the particular mapping.
Look at the ports tab of Expression transformation.
If we want to reinitialize the aggregate cache suppose during first week of every month we will
configure another session same as the previous session the only change being the Reinitialize aggregate
cache property checked in
Now have a look at the source table data:
CUSTOMER_KE INVOICE_KE
AMOUNT LOAD_DATE
Y Y
After the first Load on 1st week of Jan 2010, the data in the target is as follows:
CUSTOMER_KE INVOICE_KE MON_KE
AMOUNT
Y Y Y
Now during the 2nd week load it will process only the incremental data in the source i.e those records
having load date greater than the last session run date. After the 2nd weeks load after incremental
aggregation of the incremental source data with the aggregate cache file data will update the target
table with the following dataset:
The cache
file
updated
1111 6008 201001 450
after
aggretatio
n
The cache
file
updated
2222 6009 201001 500
after
aggretatio
n
The cache
file
3333 5003 201001 300 remains
the same
as before
New
group row
4444 1234 201001 350
inserted in
cache file
The first time we run an incremental aggregation session, the Integration Service processes the entire
source. At the end of the session, the Integration Service stores aggregate data for that session run in
two files, the index file and the data file. The Integration Service creates the files in the cache directory
specified in the Aggregator transformation properties. Each subsequent time we run the session with
incremental aggregation, we use the incremental source changes in the session. For each input record,
the Integration Service checks historical information in the index file for a corresponding group. If it
finds a corresponding group, the Integration Service performs the aggregate operation incrementally,
using the aggregate data for that group, and saves the incremental change. If it does not find a
corresponding group, the Integration Service creates a new group and saves the record data.
When writing to the target, the Integration Service applies the changes to the existing target. It saves
modified aggregate data in the index and data files to be used as historical data the next time you run
the session.
Each subsequent time we run a session with incremental aggregation, the Integration Service creates a
backup of the incremental aggregation files. The cache directory for the Aggregator transformation
must contain enough disk space for two sets of the files.
The Integration Service creates new aggregate data, instead of using historical data, when we configure
the session to reinitialize the aggregate cache, Delete cache files etc.
When the Integration Service rebuilds incremental aggregation files, the data in the previous files is
lost.
Note: To protect the incremental aggregation files from file corruption or disk failure,
periodically back up the files.