Best Practices
Best Practices
Best Practices
Work in Progress
Version 1.0
1 General Overview........................................................................................................4
Introduction......................................................................................................................4
2 General Best Practices................................................................................................4
Data handling with files...............................................................................................4
Intermediate Data handling..............................................................................................6
Exception Handling Guidelines.......................................................................................6
General Error Handling...................................................................................................6
Data Handling..............................................................................................................7
Restartability and Recovery.............................................................................................8
3 ETL Tool Ab Initio...................................................................................................10
4 General Guidelines for Development of a Graph......................................................11
PART - I
Disparate source handling is the problem that is being faced by the IT industry.
Handling such complicated problems, more often identified as the red threats to the
project (in the initial stages of the project) has become inevitable.
This document states useful principles, guidelines, best practices for the
successful implementation of the project, which would result in very efficient,
maintainable and robust systems for ETL and EAI architecture.
Having a common development standard will ultimately lead to greater
consistency in design, and coding patterns, introduce best practices and consistency in
coding and naming standards, reduce ongoing maintenance costs, improve readability of
software, making it easier for developers to understand new code more quickly, and
integrate controls into the data movement process to ensure data quality and integrity.
This document can be used to identify the points where one needs to develop
specific standards to the projects. This document will not provide naming conventions,
but would state the places where those kinds of processes should fit in.
The document is divided into two parts. One part discusses on the general best
practices and the second part discusses the tool specific best practices. The second part
focuses on the ETL and EAI tool called Ab Initio.
During the design of the system, data file transmission software and standards
should be decided for all internal and external file transmissions.
The file retention and purging strategy must be agreed upon before the load
process is designed.
Native database connections should be used to relational databases like Oracle or
UDB. They are much faster than the third party drivers
Parallelism strategy and the level of parallelism must be decided during the design
time
Any source files or intermediate files that are being used or created should be
compressed after the process is done. The strategy must be planned during the
design phase.
Very large data files should be split into multiple files and loaded in parallel.
During the design process the level of parallelism must be decided.
If the sources are being taken from disparate sources, then the character set must
be converted to the native operating system character set before the sources is
being processed. During the design phase, one need to identify the place for this
conversion.
The records that are being rejected, records that cannot be processed must be
captured in a separate file and must be stored in a separate area. The process of
dealing these files must be decided during the design process.
At the end of the transaction process, a check process must run which should
ensure the number of records that are being produced is equal to the number of
records consumed.
Use the checkpoint feature of the ETL tool to enable restart-ability of the graph.
Use the parallelism feature of the ETL tool as much as possible to attain
maximum throughput.
The transaction processing is the most crucial processing part of the ETL or EAI
architecture. Apart from the business rules implementation, most projects design fail to
address some important data quality issues.
The NULL values, blanks and other invalid values must definitely have a default
value, if it cannot be derived from any other available data field. NULL
processing may result in data errors and unnecessary validations in the source
code. The replacements of NULL values, blanks and invalid values must be
replaced preferably at the starting points of data processing. The detailed design
documents must specify the default values.
Incoming static values have to be translated in accordance with the reference data
sources.
Trailing blanks, Control Characters, Escape sequences must be removed from
character fields.
It is also necessary to store the source data, so bad records could be traced back if
needed.
The data that is being processed and being published must be in a standard
format.
Data elements that represent various code values should be brought to an agreed
upon uniform value base. For examples, codes representing data indicators should
be represented as Y/N or 0/1 or T/F.
Character fields must be either in UPPER or lower cases only.
DEDUP of records is mandatory
Details of the task being performed by the object should be entered in the
comments or description box of each and every object in the graph. This would
serve as in-line documentation for the code.
Any process, which throws an error, must stop immediately. “Abort on First
Reject” must be used throughout the graph.
Intermediate Data handling
ETL or EAI processing can consume large amounts of temporary disk space. .
As such, capacity planners must include temporary disk space in any overall capacity-
planning estimate. Accurate space estimates will prevent ETL processes from failing due
to temporary disk shortages.
In EAI Systems, following are the places where temporary systems are being
used.
Disk space for sorting
ETL tool space for lookup cache files or joins.
Space for performing aggregations.
Intermediate steps in graphs.
Checkpoints and phasing in graphs.
Non-fatal errors will not cause the process to stop. It will continue
processing data and run to completion.
Data Handling
Invalid data handling or data exceptions are non-fatal errors that result out of bad
data. They are caused because of Invalid data, NULL values, invalid data types, corrupt
data, inconsistent data or violation of business rule.
If not the above, then a standard exception handling procedure must be used to
process the rejected records and the procedure must be consistent throughout.
By providing the standard exception routines, the development will be benefited by
speeding up the development as the components can be reused.
The application that is being developed must adhere to common development principles,
where the re-startability and recovery occupies an important part.
Following are some of the important points:
Every job should be re-startable and recoverable
Jobs must not redundantly process data
Logs must be generated at run time to ensure that the job can start from the
latest crash
Data must be written at the places where there are resource intense places for
better performance
If restart is not possible, at least the process with the help of the log files must
be able to skip the committed records.
PART - II
Co>Operating System: data processing engine and the foundation for all Ab
Initio applications, allowing distributed and parallel execution, platform
independent data transport, checkpointing and process monitoring.
MDG and EME are optional products for any development. There are many more
products of Ab Initio, which are not been talked of in this document.
Real Time Integration is the new requirement for the data warehouse environment. It
keeps the analytical environment to up-to-date by integrating the operational transaction
data directly into ETL processes. Ab Initio provides connectors to real time message
queue systems, such as MQSeries, JMS, and connectors to Internet protocols such as
HTTP, TCP are available within Ab Initio to provide subscription/publication services,
and the continuous flow components are built-in to provide real time processing.
Message based middle ware will bring additional benefits such as guaranteed
asynchronous transaction with ACID properties, seamless integration of heterogeneous
systems, and unified access points and mechanisms.
4 General Guidelines for Development of a Graph
Following are the general development guidelines.
All graphs should contain clearly labels that can be used to comment the
graphs.
o One label should specify the title of the graph.
o One label should specify the detailed descriptions of the functionality
of the graph
o One label should specify the creation dates and author of the graph and
revision history.
o Use these labels any where necessary to explain the process
Graphs naming standards must be established. For example, the naming
standard of the graph could be
<project_name>.<module_name>.<process_name>.<short_process
_description>.mp
Example: CICS.premium_processing.cleanse_records.mp
The graph name may not exceed 30 characters and must be always in lower
case.
All the transformations, source & target file URLs, layouts, DML URLs must
be passed as parameters.
The DML s in the graphs must be passed as URLs.
The graph should not refer the same DML twice by specifying the path. They
should always be propagated.
Example:
Sort- Sorting for dedup – Keys: EMP_ID – Sort on Ascending Order
The data file names will be used to create the labels using the following rules:
The label name will be lower case
The complete file name in lower case will be provided as the “Component
Information”
If the file name includes a datestamp such as “02012004”, “new” will be
used to replace the actual dates beginning with a dot.
If the file is attached to any port other than the out or in, it will end with
the name of the port beginning with a dot.
Lookup File
Lookup files are similar to lookup tables. Every lookup file has a key and the
lookup will be performed using the key.
Please note: Lookup files are read into memory and can be used to better the
performance of a graph. At the same time, they should not be used
indiscriminately since they stay in memory for the duration of the execution. As a
general rule, the total size of all lookup files in a graph should not be more than
300MB.
Filter by Expression
This component is used to select records from a data stream by specifying the
selection criteria as an expression.
Reformat
This is the basic transformation component used to manipulate a data stream in
Ab Initio. A reformat can have one or more outputs.
Join
The Join component is used to match records between two or more data streams
utilizing a specified key. A Join belongs to one of the following types: inner, full
outer, and explicit.
Please note: A join can either use sorted data or perform in memory joins. The
following rules should be applied to determine the type:
If the data coming in is sorted, use a sorted join.
If all the data files (except one) are smaller than 300MB in total size, use
the in-memory join with the largest file specified as the driver, else use the
sorted join.
As a best practice do not use the “select” parameter. Perform filters utilizing the
Filter By Expression component. Be advised: If the total size of the intermediary
results held in the Join holds in memory exceeds the number of bytes specified in
the max-core parameter, temporary files are written to disk. If the max-core is set
to low, the component executes more slowly than expected. Conversely, if the
max-core is set to high, the Join may utilize too many machine resources and slow
performance.
Rollup
The Rollup component is used to summarize a data stream using a specified key
in Ab Initio.
Please note: A rollup can either use sorted data or perform in-memory joins. The
following rules should be applied to determine the type:
If the data coming in is sorted use a sorted join.
If the data file is smaller than 250MB, use in-memory rollup else use
sorted rollup.
Similar to Join components, if the total size of the intermediary results held in the
Rollup holds in memory exceeds the number of bytes specified in the max-core
parameter, temporary files are written to disk. If the max-core is set to low, the
component executes more slowly than expected. Conversely, if the max-core is
set to high, the Rollup may utilize too many machine resources and slow
performance.
Aggregate
The Aggregate component is used to summarize groups of data records by
specified keys.
Normalize
The Normalize component is used to generate multiple data records from each
input data record.
Please note: As a best practice do not use the ‘select’ parameter. Perform filters
utilizing the Filter By Expression component.
Sort
The sort component is used to sort a data stream using a specified key in Ab
Initio.
Please note: As a general principle, the max core setting for sorts should be
150MB per CPU. Please be prudent in setting the value if there are many sorts in
the same phase of the graph.
Dedup Sorted
The Dedup Sorted component is used to remove duplicates from a sorted data
stream.
Please note: As a best practice, the dup port should be captured. If the duplicates
are not processed separately they should be written to a file.
Input Table/Output Table/Update DB Table
Ab Initio has a suite of database components for performing database operations.
These include Input Table, Output Table, and Update DB Table components,
which are used to unload, load, or update/insert a table respectively.
Please note: Always land the data to a file before performing any database
operation as this aids restartability. All log, error, and reject ports must be
connected to a file if the reject threshold is greater than zero (0). For Update DB
Table components, please be advised, SQL scripts must not have “;” at the end.
The naming standard for the Partition and De-partition components are as
follows:
The general component naming standards apply.
Wherever possible, the name of the data stream should also be included as
part of the “Component Information”.
The key used to perform the sort must be specified as the “Component
Information”. The key must be prefaced with the word “by.”
If an expression is used, it is written out and used as the “Component
Information”.
Run Program
The Run Program component is used as a wrapper to non - Ab Initio executables
in Ab Initio.
Please note: The layout of the component must exist within the run directory.
Run SQL
The Run SQL component is used as a wrapper to SQL scripts in Ab Initio.
Please note: The layout of the component must exist within the run directory.