Best Practices

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 18

ETL Standards and Guidelines

Work in Progress

Version 1.0
1 General Overview........................................................................................................4
Introduction......................................................................................................................4
2 General Best Practices................................................................................................4
Data handling with files...............................................................................................4
Intermediate Data handling..............................................................................................6
Exception Handling Guidelines.......................................................................................6
General Error Handling...................................................................................................6
Data Handling..............................................................................................................7
Restartability and Recovery.............................................................................................8
3 ETL Tool Ab Initio...................................................................................................10
4 General Guidelines for Development of a Graph......................................................11
PART - I

General Best Practices in ETL


1 General Overview
Introduction

Disparate source handling is the problem that is being faced by the IT industry.
Handling such complicated problems, more often identified as the red threats to the
project (in the initial stages of the project) has become inevitable.
This document states useful principles, guidelines, best practices for the
successful implementation of the project, which would result in very efficient,
maintainable and robust systems for ETL and EAI architecture.
Having a common development standard will ultimately lead to greater
consistency in design, and coding patterns, introduce best practices and consistency in
coding and naming standards, reduce ongoing maintenance costs, improve readability of
software, making it easier for developers to understand new code more quickly, and
integrate controls into the data movement process to ensure data quality and integrity.
This document can be used to identify the points where one needs to develop
specific standards to the projects. This document will not provide naming conventions,
but would state the places where those kinds of processes should fit in.
The document is divided into two parts. One part discusses on the general best
practices and the second part discusses the tool specific best practices. The second part
focuses on the ETL and EAI tool called Ab Initio.

2 General Best Practices


Data handling with files

 During the design of the system, data file transmission software and standards
should be decided for all internal and external file transmissions.
 The file retention and purging strategy must be agreed upon before the load
process is designed.
 Native database connections should be used to relational databases like Oracle or
UDB. They are much faster than the third party drivers
 Parallelism strategy and the level of parallelism must be decided during the design
time
 Any source files or intermediate files that are being used or created should be
compressed after the process is done. The strategy must be planned during the
design phase.
 Very large data files should be split into multiple files and loaded in parallel.
During the design process the level of parallelism must be decided.
 If the sources are being taken from disparate sources, then the character set must
be converted to the native operating system character set before the sources is
being processed. During the design phase, one need to identify the place for this
conversion.
 The records that are being rejected, records that cannot be processed must be
captured in a separate file and must be stored in a separate area. The process of
dealing these files must be decided during the design process.
 At the end of the transaction process, a check process must run which should
ensure the number of records that are being produced is equal to the number of
records consumed.
 Use the checkpoint feature of the ETL tool to enable restart-ability of the graph.
 Use the parallelism feature of the ETL tool as much as possible to attain
maximum throughput.

Data in transaction Processing:

The transaction processing is the most crucial processing part of the ETL or EAI
architecture. Apart from the business rules implementation, most projects design fail to
address some important data quality issues.

 The NULL values, blanks and other invalid values must definitely have a default
value, if it cannot be derived from any other available data field. NULL
processing may result in data errors and unnecessary validations in the source
code. The replacements of NULL values, blanks and invalid values must be
replaced preferably at the starting points of data processing. The detailed design
documents must specify the default values.
 Incoming static values have to be translated in accordance with the reference data
sources.
 Trailing blanks, Control Characters, Escape sequences must be removed from
character fields.
 It is also necessary to store the source data, so bad records could be traced back if
needed.
 The data that is being processed and being published must be in a standard
format.
 Data elements that represent various code values should be brought to an agreed
upon uniform value base. For examples, codes representing data indicators should
be represented as Y/N or 0/1 or T/F.
 Character fields must be either in UPPER or lower cases only.
 DEDUP of records is mandatory
 Details of the task being performed by the object should be entered in the
comments or description box of each and every object in the graph. This would
serve as in-line documentation for the code.
 Any process, which throws an error, must stop immediately. “Abort on First
Reject” must be used throughout the graph.
Intermediate Data handling
ETL or EAI processing can consume large amounts of temporary disk space. .
As such, capacity planners must include temporary disk space in any overall capacity-
planning estimate. Accurate space estimates will prevent ETL processes from failing due
to temporary disk shortages.
In EAI Systems, following are the places where temporary systems are being
used.
 Disk space for sorting
 ETL tool space for lookup cache files or joins.
 Space for performing aggregations.
 Intermediate steps in graphs.
 Checkpoints and phasing in graphs.

Following are the guidelines for handling the temporary files


 Never use sort or any operations using the hard disk when the data is being
written to the disk
 Keep the temporary space isolated from the permanent disk space. It could be a
separate logical space or physical devices. This will allow better control over the
temporary files. The same rule applies to the database also.

Exception Handling Guidelines


ETL processes are designed to run to completion successfully. But a process may
encounter various conditions that can cause either a fatal or non-fatal error condition. A
fatal error will cause the process to abort, while a process can continue after a non-fatal
error. Error handling should be addresses as part of the business requirements. ETL
developers must have a clear understanding of the types of conditions that are acceptable
and unacceptable.

General Error Handling


Fatal and Non-Fatal Errors:
A Fatal error is the result of lack of space, power failure, which
means the parameters that are not involved with the business logic.
A Non-fatal error is result of business conditions or validation
failure.
When a fatal error is encountered, the process will stop. Often this
will happen after inserts or updates have already taken place. The process
must be designed to be restartable (see section Restartability and Recovery
Guidelines) after the error is corrected. The steps necessary when
encountering a fatal error are:
 Log the error.
 Notify the process owner.
 Analyze the cause of the error.
 Correct the error condition.
 Restart the process from the point of failure.

Non-fatal errors will not cause the process to stop. It will continue
processing data and run to completion.

 Write the record to the error log


 Log the error and the error code.
 At the end of the process, inform the process owner

Data Handling

Invalid data handling or data exceptions are non-fatal errors that result out of bad
data. They are caused because of Invalid data, NULL values, invalid data types, corrupt
data, inconsistent data or violation of business rule.

When data exceptions are found, processing should be in place to:


 Data rows must be rejected
 Those records that are rejected must be loaded into a flat file with appropriate
identifications and information regarding the time and data of reject
 In a real time environment, they must be collected and processed at some point
in time.
 Limit how many rejected rows are allowed in the current process.
 Provide a means to reprocess the rejected rows if necessary.

If not the above, then a standard exception handling procedure must be used to
process the rejected records and the procedure must be consistent throughout.
By providing the standard exception routines, the development will be benefited by
speeding up the development as the components can be reused.

Restartability and Recovery

The application that is being developed must adhere to common development principles,
where the re-startability and recovery occupies an important part.
Following are some of the important points:
 Every job should be re-startable and recoverable
 Jobs must not redundantly process data
 Logs must be generated at run time to ensure that the job can start from the
latest crash
 Data must be written at the places where there are resource intense places for
better performance
 If restart is not possible, at least the process with the help of the log files must
be able to skip the committed records.
PART - II

Best Practices and Naming Standards in - Ab Initio


3 ETL Tool Ab Initio
Ab Initio’s core products are:

 Co>Operating System: data processing engine and the foundation for all Ab
Initio applications, allowing distributed and parallel execution, platform
independent data transport, checkpointing and process monitoring.

 Component Library: reusable metadata-driven building blocks providing key


functionality, such as sorting, data transformation, and high-speed database
loading and unloading.

 Graphic Development Environment (GDE): ‘drag and drop’ dataflow


development environment, where components are configured with point-and-click
operations and connected into executable flowcharts.

 Enterprise Meta Environment (EME): metadata management facility that


provides storage for all aspects of a data processing system, from design
information to operations data. The EME stores the applications data, including
data formats and business rules, enabling impact/dependency analysis, versioning
and audit tracking.

 Mainframe Data Gateway (MDG): Manages to transport data and executes


some processes in the Mainframe Systems.

MDG and EME are optional products for any development. There are many more
products of Ab Initio, which are not been talked of in this document.
Real Time Integration is the new requirement for the data warehouse environment. It
keeps the analytical environment to up-to-date by integrating the operational transaction
data directly into ETL processes. Ab Initio provides connectors to real time message
queue systems, such as MQSeries, JMS, and connectors to Internet protocols such as
HTTP, TCP are available within Ab Initio to provide subscription/publication services,
and the continuous flow components are built-in to provide real time processing.

Message based middle ware will bring additional benefits such as guaranteed
asynchronous transaction with ACID properties, seamless integration of heterogeneous
systems, and unified access points and mechanisms.
4 General Guidelines for Development of a Graph
Following are the general development guidelines.

 All graphs should contain clearly labels that can be used to comment the
graphs.
o One label should specify the title of the graph.
o One label should specify the detailed descriptions of the functionality
of the graph
o One label should specify the creation dates and author of the graph and
revision history.
o Use these labels any where necessary to explain the process
 Graphs naming standards must be established. For example, the naming
standard of the graph could be
<project_name>.<module_name>.<process_name>.<short_process
_description>.mp
Example: CICS.premium_processing.cleanse_records.mp
 The graph name may not exceed 30 characters and must be always in lower
case.
 All the transformations, source & target file URLs, layouts, DML URLs must
be passed as parameters.
 The DML s in the graphs must be passed as URLs.
 The graph should not refer the same DML twice by specifying the path. They
should always be propagated.

Component Specific Guidelines:

 The general component naming conventions has the following structure

<Component_Name_Abbrevation>-<Component Description in terms of


business>-<KEYS if any>-<other information>

Example:
Sort- Sorting for dedup – Keys: EMP_ID – Sort on Ascending Order

Input File /Output File /Intermediate File


These are some of the basic components of an Ab Initio graph. They represent the
input and output dataset of a graph respectively.

The data file names will be used to create the labels using the following rules:
 The label name will be lower case
 The complete file name in lower case will be provided as the “Component
Information”
 If the file name includes a datestamp such as “02012004”, “new” will be
used to replace the actual dates beginning with a dot.
 If the file is attached to any port other than the out or in, it will end with
the name of the port beginning with a dot.

Lookup File
Lookup files are similar to lookup tables. Every lookup file has a key and the
lookup will be performed using the key.

The naming standard for the Lookup File is as follows:


 The label name will be lower case
 The complete file name in lower case will be provided as the “Component
Information”
 If the file name includes a datestamp such as “02012004”, “new” will be
used to replace the actual dates beginning with a dot.

Please note: Lookup files are read into memory and can be used to better the
performance of a graph. At the same time, they should not be used
indiscriminately since they stay in memory for the duration of the execution. As a
general rule, the total size of all lookup files in a graph should not be more than
300MB.

Filter by Expression
This component is used to select records from a data stream by specifying the
selection criteria as an expression.

The naming standard for the Filter by Expression component is as follows:


 The general component naming standards apply.
 The filter criteria written out is used as the “Component Information”.

Example: Filter By Expression – premium_pay_date > 333 days

Reformat
This is the basic transformation component used to manipulate a data stream in
Ab Initio. A reformat can have one or more outputs.

The naming standard for the Reformat component is as follows:


 The general component naming standards apply.
 The transform written out is used as the “Component Information”.

Example: Reformat – Fill Missing Values


Please note: As a best practice do not use the ‘select’ parameter. Perform filters
utilizing the Filter By Expression component.

Join
The Join component is used to match records between two or more data streams
utilizing a specified key. A Join belongs to one of the following types: inner, full
outer, and explicit.

The naming standard for the Join component is as follows:


 The general component naming standards apply.
 The transform written out is used as the “Component Information”.
 The join type is specified as “additional information”: inner for inner;
outer for full outer, and explicit for explicit. The join type will be
prefaced with a hyphen.

Example: Join – Employee with Dept – outer

Please note: A join can either use sorted data or perform in memory joins. The
following rules should be applied to determine the type:
 If the data coming in is sorted, use a sorted join.
 If all the data files (except one) are smaller than 300MB in total size, use
the in-memory join with the largest file specified as the driver, else use the
sorted join.

As a best practice do not use the “select” parameter. Perform filters utilizing the
Filter By Expression component. Be advised: If the total size of the intermediary
results held in the Join holds in memory exceeds the number of bytes specified in
the max-core parameter, temporary files are written to disk. If the max-core is set
to low, the component executes more slowly than expected. Conversely, if the
max-core is set to high, the Join may utilize too many machine resources and slow
performance.

Rollup
The Rollup component is used to summarize a data stream using a specified key
in Ab Initio.

The naming standard for the Rollup component is as follows:


 The general component naming standards apply.
 Wherever possible, the name of the data stream should also be included as
part of the component information.
 The key used to perform the rollup must be specified as the component
information. The key must be prefaced with the word “for.”
Example: Rollup – Amount spend on Sponsors

Please note: A rollup can either use sorted data or perform in-memory joins. The
following rules should be applied to determine the type:
 If the data coming in is sorted use a sorted join.
 If the data file is smaller than 250MB, use in-memory rollup else use
sorted rollup.

Similar to Join components, if the total size of the intermediary results held in the
Rollup holds in memory exceeds the number of bytes specified in the max-core
parameter, temporary files are written to disk. If the max-core is set to low, the
component executes more slowly than expected. Conversely, if the max-core is
set to high, the Rollup may utilize too many machine resources and slow
performance.

Aggregate
The Aggregate component is used to summarize groups of data records by
specified keys.

The naming standard for the Aggregate component is as follows:


 The general component naming standards apply.
 Wherever possible, the name of the data stream should also be included as
part of the component information.
 The key used to perform the rollup must be specified as the component
information. The key must be prefaced with the word “for.”

Example: Aggregate – Total Transactions for Account

Please note: When an Aggregate is run in parallel, it is frequently proceeded by


Partition By Key, which distributes records with the same key value to the same
partition. Similar to Join components, if the total size of the intermediary results
held in the Aggregate holds in memory exceeds the number of bytes specified in
the max-core parameter, temporary files are written to disk. If the max-core is set
to low, the component executes more slowly than expected. Conversely, if the
max-core is set to high, the Aggregate may utilize too many machine resources
and slow performance.

Normalize
The Normalize component is used to generate multiple data records from each
input data record.

The naming standard for the Normalize component is as follows:


 The general component naming standards apply.
 Wherever possible, the name of the data stream should also be included as
part of the component information.
 The key used to perform the rollup must be specified as the component
information. The key must be prefaced with the word “for.”

Example: Normalize – Money spend by sales men

Please note: As a best practice do not use the ‘select’ parameter. Perform filters
utilizing the Filter By Expression component.

Sort
The sort component is used to sort a data stream using a specified key in Ab
Initio.

The naming standard for the Sort component is as follows:


 The general component naming standards apply.
 Wherever possible, the name of the data stream should also be included as
part of the “Component Information”.
 The key used to perform the sort must be specified as the “Component
Information”. The key must be prefaced with the word “by.”

Example: Sort – Customer by Emp ID

Please note: As a general principle, the max core setting for sorts should be
150MB per CPU. Please be prudent in setting the value if there are many sorts in
the same phase of the graph.

Dedup Sorted
The Dedup Sorted component is used to remove duplicates from a sorted data
stream.

The naming standard for the Dedup Sorted component is as follows:


 The general component naming standards apply.
 Wherever possible, the name of the data stream should also be included as
part of the “Component Information”.
 The key used to perform the deduplication must be specified as the
“Component Information”. The key must be prefaced with the word
“using.”

Example: Dedup Sorted – Account using acct num

Please note: As a best practice, the dup port should be captured. If the duplicates
are not processed separately they should be written to a file.
Input Table/Output Table/Update DB Table
Ab Initio has a suite of database components for performing database operations.
These include Input Table, Output Table, and Update DB Table components,
which are used to unload, load, or update/insert a table respectively.

The naming standard for Input Table/Output Table/Update DB Table components


are as follows:
 The general component naming standards apply.
 The <table name> must be used as the “Component Information.” If the
table owner contains a “$,” do not use the “$.”
 For Update DB Table or Output Table components, “additional
information” should be provided as to whether the component performs an
insert only, update only or an upsert.

Example: Input Table - Account


Output Table – Account - insert
Update DB Table – Account - update

Please note: Always land the data to a file before performing any database
operation as this aids restartability. All log, error, and reject ports must be
connected to a file if the reject threshold is greater than zero (0). For Update DB
Table components, please be advised, SQL scripts must not have “;” at the end.

Partition and De-partition


The partition and de-partition components are used to parallelize a data stream in
Ab Initio.

The naming standard for the Partition and De-partition components are as
follows:
 The general component naming standards apply.
 Wherever possible, the name of the data stream should also be included as
part of the “Component Information”.
 The key used to perform the sort must be specified as the “Component
Information”. The key must be prefaced with the word “by.”
 If an expression is used, it is written out and used as the “Component
Information”.

Example: Partition By Key – trans by trans_id

Please note: All partition components must be followed by an explicit or implied


gather to prevent re-partitioning from increasing the level of parallelism. For
example, if a 4 way MFS is partitioned to a new 4 way MFS the resulting flow
will be 16 way if a gather is not done following the partition component. As a
general rule, each sort should not use more than one-tenth the memory available
in the server. Please be prudent in setting the value if there are many sorts in the
same phase of the graph. Furthermore, the layout of the component must be
propagated from neighbors.

Run Program
The Run Program component is used as a wrapper to non - Ab Initio executables
in Ab Initio.

The naming standard for the Run Program component is as follows:


 The general component naming standards apply.
 If applicable, the name of the executable must be used as the “Component
Information”.
 If only commands are executed, a brief description of the command must
be used as the “Component Information”.

Example: Run Program – script_name.ksh


Run Program – move job control date file

Please note: The layout of the component must exist within the run directory.

Run SQL
The Run SQL component is used as a wrapper to SQL scripts in Ab Initio.

The naming standard for the Run SQL component is as follows:


 The general component naming standards apply.
 The name of the SQL script must be used as the “Component
Information”.

Example: Run SQL –sql_script_name.sql

Please note: The layout of the component must exist within the run directory.

Transformation Naming Standards:

 The naming standards of the transformation should be


<component_name>.<process_description>.xfr
 All the file names must be in small letters
 All the transformations must be stored in the server and should
not be embedded
File Naming Standards and extensions

Ab Initio File Type Suffix


Graph .mp
Shell script .ksh
Dml .dml
Transform .xfr
Data base connectivety file .dbc
Config file .cfg
Data file .dat
log file .lg
error file .err
reject file .rej
Ab Initio sql script .sql

Variable Naming Conventions and Best Practices:

 All the global variables must be initialized and reinitialized in the


transformation logic
 All the global variables must be with the prefix as gb_
Example: let integer(1) gb_count;
 All the local variables must be declared with lc_ as prefix
Example: let integer(1) lc_count;
 Following are the prefix that needs to be used to declare the variables

Data Type Prefix


Integer in_
Decimal dc_
Date dt_
String st_

Exception Handling in Ab Initio:

 If a component is in “NEVER ABORT” mode, reject, error and log ports


must be captured
 All the non-fatal errors must be written to the log and the records must be
written to the error port
 Use of “ABORT ON FIRST REJECT” is encouraged

You might also like