Informatica Best Practices
Informatica Best Practices
Informatica Best Practices
- Best Practices
You can set sessions to retain log files according to the following options:
1. Session Runs. The Integration Service creates a designated number of session log files. Configure
the number of session logs in the Save Session Log for These Runs option. The Integration
Service does not archive binary logs.
2. Session Time Stamp. The Integration Service creates a log for all sessions, appending a time
stamp to each log. When you save a session log by time stamp, the Integration Service archives
the binary logs and text log files.
To retain multiple session and workflow log files in the session log folder edit the default session
configuration as follows:
1. Open the Workflow Manager.
2. Select Tasks > Session Configuration > Edit
3. Click the Properties tab.
4. For Save Session log by select Session runs or Session timestamp from the list
5. Optional. If you selected Session runs in step 4, enter a numeric value for the Save session log
for these runs attribute:
Note: These settings will take effect for new sessions you create.
For old existing sessions you will get the Revert option. If you click on Revert, then it will take your
modified configuration file settings.
This change will be applicable to all the new sessions.
$PMSessionLogCount Integration Service Variable
You can also use the $PMSessionLogCount service variable to create the configured number of session
logs for the Integration Service.
1. Go to Administration Console.
2. Go to the Integration Service > Properties > General Properties.
3. Set the value $PMSessionLogCount.
In this example it is set to 2 which will keep a history of 3 (2+1) log files.
4. Open the Workflow Manager.
5. Select Tasks > Session Configuration > Edit.
6. Select the Properties tab.
7. Use of variable $PMSessionLogCount set in the Administration Console:
HOW TO: Renaming Informatica Repository Name
There may be cases when you would like to rename Informatica Repository, the effective way of doing
as follows -
Make sure while performing following operation, none of the developers are connected to Informatica
Repository.
Step 1: Using Repository Server Administration - Create repository backup: File
Name “BI_Informatica_Rep” and skip workflow and session logs (this will save time and
space) as shown in figure 1.
Figure 1: Repository Backup - Advance Options
Step 2: Validate the backup is created on X:\Infa_714\RepositoryServer\bin\Backup
Step 3: Create a copy of license file to NewName.lic from OldName.lic, the license file can be
found at X:\Infa_714\RepositoryServer\bin
Step 4: Using Repository Server Administration – Drop the existing repository
Step 5: Using Repository Server Administration – Create new repository from backup the file
"BI_Informatica_Rep" as shown in Figure 2
Figure 2: Create New Repository using Restore
Informatica Performance Improvement Tips
We often come across situations where Data Transformation Manager (DTM) takes more time to read
from Source or when writing in to a Target. Following standards/guidelines can improve the overall
performance.
Use Source Qualifier if the Source tables reside in the same schema
Make use of Source Qualifer “Filter” Properties if the Source type is Relational.
If the subsequent sessions are doing lookup on the same table, use persistent cache in the first
session. Data remains in the Cache and available for the subsequent session for usage.
Use flags as integer, as the integer comparison is faster than the string comparison.
Use tables with lesser number of records as master table for joins.
While reading from Flat files, define the appropriate data type instead of reading as String and
converting.
Have all Ports that are required connected to Subsequent Transformations else check whether
we can remove these ports
Suppress ORDER BY using the ‘–‘ at the end of the query in Lookup Transformations
Minimize the number of Update strategies.
Group by simple columns in transformations like Aggregate, Source Qualifier
Use Router transformation in place of multiple Filter transformations.
Turn off the Verbose Logging while moving the mappings to UAT/Production environment.
For large volume of data drop index before loading and recreate indexes after load.
For large of volume of records Use Bulk load Increase the commit interval to a higher value large
volume of data
Set ‘Commit on Target’ in the sessions
We can leverage the metadata collected in the Informatica repository for many interesting
analysis, few of the scenarios where I have leveraged the Informatica Metadata are as following.
This SQL Queries can be executed in Oracle database with no changes and requires little
modification with other databases.
Failed Sessions
The following query lists the failed sessions. To make it work for the last ‘n’ days, replace
SYSDATE-1 with SYSDATE – n
QUERY:
SELECT SUBJECT_AREA AS FOLDER_NAME,
SESSION_NAME,
LAST_ERROR AS ERROR_MESSAGE,
DECODE (RUN_STATUS_CODE,3,’Failed’,4,’Stopped’,5,’Aborted’) AS STATUS,
ACTUAL_START AS START_TIME,
SESSION_TIMESTAMP
FROM REP_SESS_LOG
WHERE RUN_STATUS_CODE != 1
AND TRUNC(ACTUAL_START) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
RESULT:
Long running Sessions
The following query lists long running sessions. To make it work for the last ‘n’ days, replace
SYSDATE-1 with SYSDATE – n
QUERY:
SELECT SUBJECT_AREA AS FOLDER_NAME,
SESSION_NAME,
SUCCESSFUL_SOURCE_ROWS AS SOURCE_ROWS,
SUCCESSFUL_ROWS AS TARGET_ROWS,
ACTUAL_START AS START_TIME,
SESSION_TIMESTAMP
FROM REP_SESS_LOG
WHERE RUN_STATUS_CODE = 1
AND TRUNC(ACTUAL_START) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
AND (SESSION_TIMESTAMP – ACTUAL_START) > (10/(24*60))
ORDER BY SESSION_TIMESTAMP
RESULT:
Invalid Tasks
The following query lists folder names and task name, version number, and last saved for all
invalid tasks.
QUERY:
SELECT SUBJECT_AREA AS FOLDER_NAME,
DECODE(IS_REUSABLE,1,’Reusable’,’ ‘) || ‘ ‘ ||TASK_TYPE_NAME AS TASK_TYPE,
TASK_NAME AS OBJECT_NAME,
VERSION_NUMBER,
LAST_SAVED
FROM REP_ALL_TASKS
WHERE IS_VALID=0
AND IS_ENABLED=1
ORDER BY SUBJECT_AREA,TASK_NAME
RESULT:
The use of aggregator transformation would group the records by a key, but retrieval of the values for a
particular column as individual columns is a challenge, hence designed a component ‘Flattener’ based on
expression transformation.
Flattener is a reusable component, a mapplet that performs the function of flattening records.
Flattener consists of an Expression and a Filter transformation. The expression is used to club each
incoming record based on certain logic. Decision to write the record to target is taken using the Filter
transformation.
2. Design:
The mapplet can receive up to five inputs, of the following data types:
i_Col1 (string), Customer
i_Col2 (string), Product
i_Col3 (decimal), Cost
i_Col4 (decimal) and
i_Col5 (date/time)
Have kept the names generic trying to accept different data types, so that the mapplet can be used in
any scenario where there is a need for flattening records.
The mapplet gives out 15×5 sets of output, in the following manner:
o_F1_1 (string), Customer
o_F2_1 (string), Product1
o_F3_1 (decimal), Cost1
o_F4_1 (decimal) and
o_F5_1 (date/time)
o_F1_2 (string), Customer
o_F2_2 (string), Product2
o_F3_2 (decimal), Cost2
o_F4_2 (decimal) and
o_F5_2 (date/time)
… … and so on
The output record is going to have repetitive sets of 5 columns each (Each set would refer to one
incoming row). Based on the requirement the number of occurrence of these sets can be increased. The
required fields alone can be used / mapped. For the above example we use just 2 strings and one
decimal for mapping Customer, Product and Cost.
The mapplet receives records from its parent mapping. The Expression would initially save each
incoming value to a variable and compare it with its counterpart that came in earlier and is held in its
cache as long as the condition to flatten is satisfied.
A. Described below is a scenario where the requirement is to have a stored procedure that returns a
cursor as a source.By and large PowerCenter does not support a stored procedure that returns a cursor
as a source. The workaround for this is1. The procedure that will load the data to a new table:
2. Create the table that will receive the data from the procedure:
3. Add a Store Procedure transformation to the PowerCenter mapping. This transformation will execute
this new procedure called as LOAD on this example.
4. Set the run method to be Source Pre Load, to be executed before read the source table.
5. Import the EMP table as a Source Definition. This table will be populated by the new Store Procedure.
If the original store procedure is used by the customer application and you can’t change the source
code, you can create a new store procedure that call the original one (without inserting into a table),
and execute the insert on the new table executing a loop on the returned cursor.
B. Given below is a situation where you wanted to pass a mapping variable to a stored procedure
transformation (it can either be connected or unconnected).
The parameters that are passed to a connected Stored Procedure have to be linked from another
transformation.
Given below are the steps to pass mapping variable to a connected Stored Procedure transformation:
$$mapping_variable
This sets the value of this output port to the mapping variable.
For unconnected Stored Procedure transformations you can use the mapping variable in the expression
calling the stored procedure.
Follow the steps below to pass mapping variable to a unconnected Stored Procedure transformation:
In case if you are attempting to use a mapping variable to store the output value of the stored
procedure, the session will fail with the below error.
“TE_7002 Transformation Parse Fatal Error; transformation stopped: invalid function reference. Failed
to Initialize Server Transformation.”
To resolve the issue replace the mapping variable with the PROC_RESULT system variable.
Example:
Incorrect, using a mapping variable:
:SP.PROCEDURE(FIELD1, $$mapping_variable)
Correct, using the PROC_RESULT system variable:
:SP.PROCEDURE(FIELD1,PROC_RESULT)
Or
:SP.PROCEDURE($$mapping_variable,PROC_RESULT)
The PROC_RESULT system variable assigns the stored procedure output to the port with this expression.
The Integration Service process generates output files when we run workflows and sessions. By default,
the Integration Service logs status and error messages to log event files.
Log event files are binary files that the Log Manager uses to display log events. When we run each
session, the Integration Service also creates a reject file. Depending on transformation cache settings
and target types, the Integration Service may create additional files as well.
The Integration Service creates the following output files:
Output Files
Session Details/logs:
When we run a session, the Integration service creates session log file with the load
statistics/table names/Error information/threads created etc based on the tracing level that
have set in the session properties.
We can monitor session details in the session run properties while session
running/failed/succeeded.
Workflow Log:
The Integration Service process generates performance details for session runs.
Through the performance details file we can determine where session performance can be
improved.
Performance details provide transformation-by-transformation information on the flow of data
through the session.
Reject Files:
By default, the Integration Service process creates a reject file for each target in the session. The
reject file contains rows of data that the writer does not write to targets.
The writer may reject a row in the following circumstances:
o It is flagged for reject by an Update Strategy or Custom transformation.
o It violates a database constraint such as primary key constraint
o A field in the row was truncated or overflowed
o The target database is configured to reject truncated or overflowed data.
Note: By default, the Integration Service process saves the reject file in the directory entered for the
service process variable $PMBadFileDir in the Workflow Manager, and names the reject file
target_table_name.bad. We can view this file name in session level.
When we configure a session, we can choose to log row errors in a central location.
When a row error occurs, the Integration Service process logs error information that allows to
determine the cause and source of the error.
The Integration Service process logs information such as source name, row ID, current row data,
transformation, timestamp, error code, error message, repository name, folder name, session
name, and mapping information.
we enable flat file logging, by default, the Integration Service process saves the file in the
directory entered for the service process variable $PMBadFileDir in the Workflow Manager.
The Integration Service process creates recovery tables on the target database system when it
runs a session enabled for recovery.
When you run a session in recovery mode, the Integration Service process uses information in
the recovery tables to complete the session.
When the Integration Service process performs recovery, it restores the state of operations to
recover the workflow from the point of interruption.
The workflow state of operations includes information such as active service requests,
completed and running status, workflow variable values, running workflows and sessions, and
workflow schedules.
Control File:
When we run a session that uses an external loader, the Integration Service process creates a
control file and a target flat file.
The control file contains information about the target flat file such as data format and loading
instructions for the external loader.
The control file has an extension of .ctl. The Integration Service process creates the control file
and the target flat file in the Integration Service variable directory, $PMTargetFileDir, by default.
Email:
We can compose and send email messages by creating an Email task in the Workflow Designer
or Task Developer and the Email task can be placed in a workflow, or can be associated it with a
session.
The Email task allows to automatically communicate information about a workflow or session
run to designated recipients.
Email tasks in the workflow send email depending on the conditional links connected to the
task. For post-session email, we can create two different messages, one to be sent if the session
completes successfully, the other if the session fails.
We can also use variables to generate information about the session name, status, and total
rows loaded.
Indicator File:
If we use a flat file as a target, we can configure the Integration Service to create an indicator file
for target row type information.
For each target row, the indicator file contains a number to indicate whether the row was
marked for insert, update, delete, or reject.
The Integration Service process names this file target_name.ind and stores it in the Integration
Service variable directory, $PMTargetFileDir, by default.
Cache Files:
When the Integration Service process creates memory cache, it also creates cache files. The
Integration Service process creates cache files for the following mapping objects:
o Aggregator transformation
o Joiner transformation
o Rank transformation
o Lookup transformation
o Sorter transformation
o XML target
By default, the DTM creates the index and data files for Aggregator, Rank, Joiner, and Lookup
transformations and XML targets in the directory configured for the $PMCacheDir service
process variable.
• Second workflow will read process XML files and populate staging tables. Details of workflow
are:
o A session will read XML file names using indirect file and load into staging tables.
o A command task will move the XML file just processed in file into an archive folder. Using
batch script
o A command task will execute a batch script which will:
Check whether there are any more XML files to be processed.
If yes then it will trigger the first workflow. This will ensure all XML files are processed and
loaded into staging tables.
If no then process will complete.
Category Technique
While loading Staging Tables for FULL LOADS, Truncate target table option should be
checked. Based on the Target database and the primary key defined, Integration Service fires
TRUNCATE or DELETE statement.Database Primary Key Defined No
Primary KeyDB2 TRUNCATE TRUNCATE
INFORMIX DELETE DELETE
ODBC DELETE DELETE
ORACLE DELETE UNRECOVERABLE TRUNCATE
MSSQL DELETE TRUNCATE
SYBASE TRUNCATE TRUNCATE
Workflow Property “Commit interval” (Default value : 10,000) should be increased for
increased for Volumes more than 1 million records. Database Rollback Segment size should
Session
also be updated, while increasing “Commit Interval”.
Properties
Insert/Update/Delete options should be set as determined by the target population
method.
The following are generally accepted “Best Practices” for Informatica PowerCenter ETL development and
if implemented, can significantly improve the overall performance.
1. Quantitaive testing
2.Qualitative testing
Steps.
Once the session is succeeded the right click on session and go for statistics tab.
There you can see how many number of source rows are applied and how many number of
rows loaded in to targets and how many number of rows rejected.This is called
Quantitative testing.
If once rows are successfully loaded then we will go for qualitative testing.
Steps
1.Take the DATM(DATM means where all business rules are mentioned to the
corresponding source columns) and check whether the data is loaded according to the
DATM in to target table.If any data is not loaded according to the DATM then go and check
in the code and rectify it.
while using a static cache in lookup we can use all operators like =,<,>... while giving
condition in condition tab
Connected lookup
Unconnected lookup
Receives input values diectly from the pipe line.
Receives input values from the result of a lkp expression in a another transformation.
Stop Vs Abort
stop: _______If the session u want to stop is a part of batch you must stop the batch,
if the batch is part of nested batch, Stop the outer most bacth
Abort:----
You can issue the abort command , it is similar to stop command except it has 60 second
time out .
If the server cannot finish processing and committing data within 60 sec
If you find your box running slower and slower over time, or not having enough memory
to allocate new sessions, then I suggest that ABORT not be used.
So then the question is: When I ask for a STOP, it takes forever. How do I get the session
to stop fast?
well, first things first. STOP is a REQUEST to stop. It fires a request (equivalent to a
control-c in SQL*PLUS) to the source database, waits for the source database to clean up.
The bigger the data in the source query, the more time it takes to "roll-back" the source
query, to maintain transaction consistency in the source database. (ie: join of huge tables,
big group by, big order by).
It then cleans up the buffers in memory by releasing the data (without writing to the
target) but it WILL run the data all the way through to the target buffers, never sending it
to the target DB. The bigger the session memory allocations, the longer it takes to clean
up.
Then it fires a request to stop against the target DB, and waits for the target to roll-back.
The higher the commit point, the more data the target DB has to "roll-back".
If you use abort, be aware, you are choosing to "LOSE" memory on the server in which
Informatica is running (except AIX).
If you use ABORT and you then re-start the session, chances are, not only have you lost
memory - but now you have TWO competing queries on the source system after the same
data, and you've locked out any hope of performance in the source database. You're
competing for resources with a defunct query that's STILL rolling back.
A)
1)Size of data cache = Bytes required for variable columns + bytes required for output
columns.
B) If you provide sorted data for group by ports aggregation will be faster, so for ports
which are used in group by of an aggregator sort those ports in a sorter.
C) We can use incremental aggregation if we think that there will be no change in data
which is already aggregated.
When we create a target as flat file and source as oracle.. how can i specify first rows as
column names in flat files...
use a pre sql statement....but this is a hardcoding method...if you change the column
names or put in extra columns in the flat file, you will have to change the insert statement
You can also achive this by changing the setting in the Informatica Repository manager to
display the columns heading. The only disadvantage of this is that it will be applied on all
the files that will be generated by This server
In Version 8.2 of Informatica , we have in session properties , an option Called " Header
options". Use that to get the field names as the first row in the Target Flat file
When importing a flat file into target designer a flat file import wizard appears. In this
there is an option as 'import field names from first line'. Just check this option so
integration server treats first row values as column names.
Can anyone explain error handling in informatica with examples so that it will be easy to
explain the same in the interview.
There is one file called the bad file which generally has the format as *.bad and it contains
the records rejected by informatica server. There are two parameters one fort the types of
row and other for the types of columns. The row indicators signifies what operation is
going to take place ( i.e. insertion, deletion, updation etc.). The column indicators contain
information regarding why the column has been rejected.( such as violation of not null
constraint, value error, overflow etc.) If one rectifies the error in the data preesent in the
bad file and then reloads the data in the target,then the table will contain only valid data.
You can use nested IIF statements to test multiple conditions. The following example tests
for various conditions and returns 0 if sales is zero or negative:
IIF( SALES > 0, IIF( SALES < 50, SALARY1, IIF( SALES < 100, SALARY2, IIF( SALES <
200, SALARY3, BONUS))), 0 )
You can use DECODE instead of IIF in many cases. DECODE may improve readability. The
following shows how you can use DECODE instead of IIF :
Decode function can used in sql statement. where as if statment cant use with SQL
statement.
Answers
It overrides the default update statemet in the target properties
target update override it is also like souce qualifier override target update override is use ful to
update the target with out using the update strategy transformation.
When we don't have primary keys defined on database level. And still we need update on this target
from Informatica. We neeed to define keys at informatica level and use update override in target
property. This way we can update the table.
The Integration Service updates target tables based on key values. However you can override the
default UPDATE statement for each target in a mapping. You might want to update the target based
on non-key columns.
When the Integration Service executes SQL against a source target or lookup database it searches the
reserved words file stored in the Integration Service installation directory. It encloses matching
reserved words in quotes. If you use target update override you must manually put all reserved words
in quotes.
For a mapping without an Update Strategy transformation or a Custom transformation with the update
strategy property enabled configure the session to mark source rows as update. The Target Update
option only affects source rows marked as update. The Integration Service processes all rows marked
as insert delete or reject normally.
When you configure the session mark source rows as data-driven. The Target Update Override only
affects source rows marked as update by the Update Strategy or Custom transformation.
For example a mapping passes the total sales for each salesperson to the T_SALES table.
The Designer generates the following default UPDATE statement for the target
Because the target ports must match the target column names the update statement includes the
keyword:TU to specify the ports in the target transformation.
If you modify the UPDATE portion of the statement be sure to use :TU to specify ports.
For example you might want to update records for employees named Mike Smith only. To do this you
edit the WHERE clause as follows:
Target Update (by using Update strategy) will work only when we have the Primary key in Target.
Update override will happen based on this key.If we don't have the Pkey Update override will not work.
To overcome the above issue..If our Target is not having any key column or if we want to update the
Target based on a column other than PKey column then we need to go for Target Update override
option available in Target properties.
To configure a Joiner transformation to emulate a Lookup transformation in a PowerCenter mapping do the following:
1.Create a Source Definition based on the database table used for the lookup.
2.Add this Source Definition to the mapping.
3.Connect this Source Definition to the Joiner transformation as the master source
4.Connect the original source table to the Joiner as the detail source.
5.Change the Join Type to Master Outer Join .
6.Configure the join condition.
The join condition must use the same ports as the lookup condition.
Limitations
A Joiner can be used to emulate the lookup, but with certain limitations:
The Lookup will always return one value when it can find data and you can choose to get the first value or
last value using the lookup property Lookup Policy on Multiple Match . However, as the Joiner does not
have that property, you will have to ensure that that all key values only occur once. If this is not ensured
then you will be allowing duplicates of your input data flow into the mapping.
Scenario
Very often you might have to compare incoming data to previous loaded data in the warehouse. Often the
target table is loaded into a lookup, comparisons are made and actions are taken. This is not always the
most efficient way. Searching for a value in a cache with millions of rows will take longer than searching
for a value in a cache with a few hundred rows which can result in poor performance. You might want to
keep the caches small, however, a data warehouse only gets bigger and performance goes down. The
input data often has the same size for each delivery as deltas are often created by some smart
mechanism and how would you cache your input data instead of target data and make comparisons that
way? How will you make the cache to be smaller and yet increase the performance?
You can put your input data in a lookup cache and read all rows from the warehouse table as a source,
then make comparisons and act. However, this cannot detect new data because the new data will never
be looked up because it does not exist in the warehouse table.
One way to address this issue is to use the Joiner transformation. The Lookup transformation is a
specialized Joiner transformation and when you use the outer join option the Joiner can operate similar to
a Lookup and if you use the sorted input option a very small cache is created.
SCD 123
http://www.dwforum.net/data_warehousing_concepts/scd_12_129.0.html
SCD2 Concept is that if the source record found match in target,then that will Inserted into the target as a new
record.So in SCD2 there are multiple record will be in target redundantly i,e let say for a single Empno there is more
than one record that's why it maintain history.
The points are as below
1)when the matched record is inserted into the Target Dimension table,it remain keep the source key value(Let say
Empno),but the target table primary key is the surogate key which is not created upon the Employee Key.I suppose to
say that it means in target table,the employee no may be redundant but the surogate key will be unique.
2)The Lookup transformation always make condition upon Src.empno=Tgt.empno.Here the situation is when for a
single source record if it match more than one record in target,then you can choose First,Last or any when you use
Static Lookup.So by static lookup it is easier.But in case of Dynamic Lookup if for a single source record found more
than one corrosponding record in target,that will fail.So you make sql override in Dynamic lookup that will extract only
one match record (Last record)from target.then it will work,other wise not.
3)In dynamic Lookup,if the newlookup row is 0-You can drop it.
2-It will be inserted but use a new sequence generator having the start
value somthing more than the privious sequence generator's last value,then the both are not contradict.
4)Overwride the sql query in dynamic lookup,that's why in every run for each source record if it found match in target
then always return the last matched record.
Then it work fine.But the better solution is Static Lookup for SCD2.
SQL Override
Target update override:(done in designer in target instance and also view in session's target
option)
Update Override it is an option available in TARGET instance
===============================================
Update override with non Primary Key field in the where condition:
Ex: Target table Target1 has A1,B1 as Primary key defined (local to informatica in designer)
UPDATE OVERRIDE SQL ON TARGET TABLE: