Kiran Abinitio

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 66

DWH CONCEPTS

1) What is Data warehouse && why Data warehouse?

A data warehouse is a relational database that is designed for query and


analysis rather than for transaction processing. It usually contains historical data derived
from transaction data, but it can include data from other sources. It separates analysis
Workload from transaction workload and enables an organization to consolidate data
from several sources.

Data warehousing new definition by Inman:

Data warehouse is a subject oriented, integrated, time-variant and non-volatile data.

Subject oriented: Data that gives information about a particular subject instead of about
a company's ongoing operations. For example, to learn more about your company's sales
data, you can build a warehouse that concentrates on sales. Using this warehouse, you
can answer questions like "Who was our best customer for this item last year?"

Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.

Time-variant: All data in the data warehouse is identified with a particular time period.

Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed. This enables management to gain a consistent picture of the business.

2) Difference between OLTP and OLAP?

OLTP OLAP
1) Online transactional processing 1) Online Analytical processing
2) Contains only current data 2) Contains current + historical data
3) Insert, update and delete operations can 3) Select operations can perform here.
Perform
4) Data is volatile in nature 4) Data is non-volatile in nature
5) Many joins will be there 5) Few joins will be there
6) The data is in normalized form 6) The data is in de normalized form.
7) The aggregation data is very rare 7) The aggregation data is common here.
3) What is data mart?

Data mart is the subset of Data warehouse. A data warehouse that is


designed for a particular line of business, such as sales, marketing, or finance. In a
dependent data mart, the data can be derived from an enterprise-wide data warehouse. In
an independent data mart, data can be collected directly from sources.

4) What is data mining?

Data Mining is used for the estimation of future. For example if we take a
company/business organization by using the concept of Data Mining we can predict the
future of business in terms of Revenue (or) Employees (or) Customers (or) Orders etc.

Generally data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information -
information that can be used to increase revenue cuts costs or both.

5) What are dimension and fact tables??

Dimension table:

A dimension table can provide additional and descriptive information (dimension)


of the field of a fact table.

If I want to know the number of resources used for a task, my fact table will store the
actual measure (of resources) while my Dimension table will store the task and resource
details.

Some of examples of dimension tables

Customer dimension --- Which will contain all the details of customer.

Product dimension --- Which will contain all the details of product.

Store dimension --- Which will contain all the details of store.

Time dimension --- Which will contain all the details of time.

What was the cost of no consumable goods sold in the northeast region in 1999?
What was the cost of kitchen products in New York City in the third quarter of 1999?
Fact Table:

Fact table will consist of facts or measures. All the primary keys of the
dimension tables are used as foreign keys in fact table. Fact table consists of two sections
key and measures.

We have 3 types of facts (measures)

a) Additive measures: Additive facts are the facts that participate in arithmetic
Calculations.
b) Non Additive measures: Non additive facts are the facts that do not participate in
Arithmetic calculations.
c) Semi Additive facts: Semi additive facts are the facts that will participate in
Arithmetic calculations of some columns.

6) What are Junk and DEGENERATE dimensions?

Junk dimension:

Junk dimensions are dimensions that contain miscellaneous data (like flags and
indicators) that do not fit in the base dimension table.

Degenerate dimension:

A degenerate dimension is data that is dimensional in nature but stored in a fact table.

Best example is "Invoice Number" "bill number" "PO Number" these all are degenerated
dimensions in the Transaction Tables they can be maintained in the fact table itself
instead of crating separate dimensions for this.

7) What is fact less fact table? Different types of facts??

Fact table without measures is called fact less fact table.

Fact less fact table useful connecting two fact tables.Factless facts can be used as a
bridge tables or to store measures that are used directly without aggregation(for e.g.
max() min() attributes)

8) What is Normalisation? Different types of Normalization?

Normalization is the process of splitting the data into two or more tables from
single table.

First normal form: A table is said to be in first normal form if all the columns data is
Atomic or Unique.

Second normal form: A table is in second normal form if all its non_key fields fully
dependent on the whole key.

This means that each field in a table must depend on the entire key.
Those that do not depend upon the combination key, are moved to another table on whose
key they depend on.

Third normal form: A table is said to be in third normal form if all the non key fields of
the table are independent of all other non key fields of the same table.

9) What are slowly changing dimensions (SCD) and types of SCD’s?

SCD are of 3 Types.

SCD1: Contains only current data.


SCD2: Contains current + full historical data.
SCD3: Contains current + some time history data.

10) What is star and snow flake schemas?

A fact table surrounded by N number of dimension tables and the


structure looks like star, so we will call this type of arrangement as Star schema.

Difference between snow flake and star schemas

Star--facts and dimension tables are demoralized in nature.


Snow--dimensions are normalised in nature and facts are demoralized in nature.

Advantages of star schema:

No need to use complicated joins.


Queries results fastly

Advantages of snow flake schema:

We can easily do modification directly in the tables.

Dis advantages of snow flake schema:

We have to use complicated joins since we have more tables.


There will be some delay in processing the Query.
11) What is ODS?

ODS stands for Operational Data Store.

It is the final integration point in the ETL process before loading the data into the Data
Warehouse. It can support DSS processing.

12) What is Granularity?

In simple terms, level of granularity defines the extent of detail. As an example, let
us look at geographical level of granularity. We may analyze data at the levels of
COUNTRY, REGION, TERRITORY, CITY and STREET.

In this case, we say the highest level of granularity is STREET.

13) What is surrogate key in DWH??

Surrogate key is a substitution for the natural primary key in Data Warehousing.
It is just a unique identifier or number for each row that can be used for the primary key
to the table.

If we have many sources, it is better to create new surrogate key instead of using
the natural primary keys of different sources.
Ab Initio

Ab Initio is a Latin word, which means starting from the beginning or


from first principles.

Ab Initio software helps you build large-scale data processing applications and run them
in parallel environments. Ab Initio software consists of three main programs:

Co>Operating System:
It is the server for Ab Initio, where all the graphs will run here. It
provides a distributed model for process execution; file management, process monitoring,
check-pointing, and debugging.

Current Co>operating system version is 2.15.8.7 (Simply 2.15)

To get the version of the co>operating system, execute the command m_env –v (or)
m_env –version in command prompt or UNIX (command line).

Graphical Development Environment (GDE):

GDE is the users work area where we will create all our graphs.

Current GDE version is 1.15.11.1 (Simply you can say 1.15)

To get the version of the GDE, Go to HELP  About Graphical Development


Environment in GDE screen.

Enterprise Meta Environment (EME):


EME is the repository where all the objects are stored here.
It is also used in maintaining version controlling and dependency analysis.

Architecture of Ab Initio:
EME
GDE
Graphical Development Environment

Enterprise
Meta
Environment
Co>operating System

Component
Organizer(F9)

Native Operating System

Run Settings Validation:

Before going to work with GDE it is better practice to validate the run
settings. That means we have to check the user/password credentials and the server (host)
name and co>operating system details are valid or not.

Go to Settings  Manage Connections


Basic Terms In Ab Initio:

Dataset: In Simple terms dataset is a file. A file can be main frame file or any fixed or
delimited files.

You can also think of dataset as a table in database world.

Component: Component is Ab Initio Program.


There are various components like REFORMAT, SORT, JOIN, MERGE, DEDUP
SORTED, ROLLUP, SCAN, PARTITION BY KEY etc…

Port: A port is a connection point for the input or output to a component.

We have different types of ports like input port, output port, reject port, error port and log
port.

Flow: A flow carries a stream of data between components in a graph. Flows connect
components via ports. Ab Initio supplies four kinds of flows with different patterns:
straight, fan-in, fan-out, and all-to-all.

Graph: A graph is a diagram that defines the various processing stages of a task and the
streams of data as they move from one stage to another. Visually, stages are represented
by components and streams are represented by flows. The collection of components and
flows comprise an Ab Initio graph.

Field: A filed is Equivalent to column of a table in Database world. Filed is also called
variable which holds value.

Key: Key is the unique identifier for that group. Key is used many places in Ab Initio
Development, We use key to sort the data, Join two files, Rollup data etc.

Symbols:

Boxes for processing and Data Transforms (Components)

Arrows for Data Flows between Processes

Cylinders for serial I/O files

Divided cylinders for parallel I/O files (Multi Files)

Grid boxes for database tables

To do cues:
To-do cues are yellow highlighted areas with question marks that require
action.

The following figure shows a graph with to-do cues:

Before you start acting on to-do cues, connect all the flows in the graph. This will remove
some of the to-do cues through propagation. Remove the remaining to-do cues from the
dataset components first, then from the program components (working from the outside
in).
The following table lists the different types of to-do cues and how to resolve them.

To-do cue Problem Resolution

?? Lay The component has no layout. For Double-click the layout indicator and
outs program components, layout is set select the layouts you want.
either by propagation or manually.

? Para The component's required parameters Double-click the square box and fill in
mete or file locations lack values. the missing
rs or Specifyingcomponentparameters.html -
file 1044042parameters or file locations.
locat
ion

out* Port The record format for the port is not Connect the port to another port with a
nam set. Record formats are usually known record format, or double-click
es propagated, but sometimes are set the port and add the new record format.
manually.
Status Indicators while running graph:
For quick diagnosis of the pass/fail state of components and graphs, the GDE displays
status indicators when you run a graph. The status is shown with colored LEDs, as
follows:

A white circle means unstated (or unopened).

A gray circle means conditionalzed out.

A green circle means running.

A blue circle means done (success).

A red circle means error.

Types of files:

We have the different types of files present in Ab Initio as mentioned below.


1) Input file
2) Output file
3) Intermediate file
4) Lookup file
Introduction to Multi files:
A multifile is a parallel file that is composed of individual files on
different disks and nodes. The URL starts with mfile.Multi file will look like as shown
below.

Multi file will contain 2 partitions.


1) Control partition: Contains the actual data paths.
2) Data partition: Contains the actual data.

Multi File System (MFS) is created using a command called m_mkfs, and deleted using
m_rmfs

When you open a multi file you will get the data partition paths as shown below.

> cat testpart.dat

"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_001/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_002/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_003/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_004/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_005/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_006/mfs_6way/ main/testpart.dat"

When you open the actual data path you can able to see the data.

tail /ai/id/abinitio/ln1/data/parts/mfs_6way_001/mfs_6way/ main/testpart.dat


03/18/20110041108192 910G52011
03/18/20110038705844 910G52011
03/18/20110037797016 910G52011
03/18/20110038606661 913G52011
03/18/20110039221155 917G52011
03/18/20110130599749 979G31011
03/18/20110130076086 980G31011
03/18/20110130393630 980G31011
03/18/20110026112508 980G31011
03/18/20110139731897 984G31011

Introduction to the directory structure:

When ever we create a sandbox the following directories will be create


default.

DML- Directory that stores record format files


XFR- Directory that stores transform files
RUN- Directory where your graphs run
DB - Directory that stores database configuration file (dbc)
MP- Directory that stores GDE graphs
PLAN- Directory that stores plan files

How to create sandbox:


Go to ProjectCreate sandbox
Data Manipulation Language (or) Record format:

DML is an acronym for Data Manipulation Language in Ab Initio World. It is the Ab


Initio programming language you use to define record formats. DML is equivalent to
DDL in traditional databases. DML tells Ab Initio how to interpret your data.

Delimited.dml

record
decimal('|') cust_id;
string(‘|') last_name;
string(‘|') first_name;
string(‘|') street_addr;
string(‘|') state;
decimal(‘|') zip;
string(‘|') gender;
decimal('\n') income;
end

Example of data:

297457|Alex|Neil Steven|149 Inkwell St.|KY|40541|M|0073900


901288|Andrus|Tom|165 Eboli St.|WY|60313|M|0492500
662197|Bannon|Jeffrey C|21 Compuserve St.|CO|70307|M|0140200

record
decimal(6) cust_id; // Customer ID
string(18) last_name; // Last name
string(16) first_name; // First name
string(26) street_addr; // Street address
string(2) state; // State
decimal(5) zip; // Zipcode
string(1) gender; // Gender (M = male; F = female)
decimal(7) income; // Income (in dollars)
string(1) newline;
end

Example Of data

297457Alex Neil Steven 149 Inkwell St. KY40541M0073900


901288Andrus Tom 165 Eboli St. WY60313M0492500
662197Bannon Jeffrey C 21 Compuserve St. CO70307M0140200

Relationship of Ab Initio with UNIX:


In Ab Initio, we will connect to the UNIX server through the run settings (manage
connections).All the files (input/output),objects(dml, xfr, mp, run, db, sql) are stored in
the UNIX server.

For Ab Initio we will use UNIX as server where co>operating system is installed. We can
create all the things through UNIX where we can do it through Ab Initio except graphs.

How to connect to GDE:


Layout:

Layout is the path supplied to co>operating system where the program


components/datasets will be executed with the following information.

--The locations of files


--The number and locations of the partitions of multi files
--The number of the partitions of program components and the locations where they
Execute

The below 6 options are there for layout.

Propagate from neighbors:--defaultone, sets the selected component's layout from the
layouts of neighboring components that are connected by straight flows

Component--This option will be used if we want to use another components layout.

URL--Specifies the path of partition(control)

Custom--Specifies the paths of partitions (data partitions)

Hosts--Specifies the names of the hosts where the partition of the component runs.

.database--This option will enable only for table components. The below 4 options are
present for database.

Default layout--The GDE generates appropriate layout depends on the table


type (i/p,o/p and update). if the table is partitioned 8 ways the data
is unloaded paralley.
Serial--Use to load a table serially rather than in parallel
Table--Specifies an existing table
Explicit args--In Explicit args, enter arguments to the m_db_layout command

Table and program components have layout tab.

Watcher file lay out will be the predecessor component layout.

Watchers:

A watcher automatically adds intermediate files to flows so you can view the
data that passes through the flows.

The debugger places watcher files in the layout of the component downstream of the
watcher.

To add watcherGo to DebugEnable debuggerRight click on the flowAdd


watcher on flow.

To delete the watcher files Debug > Delete Watcher Datasets (or) m_rm -f -rmdata
<GDE-WATCHER-xxx>

Phases and checkpoints:

A phase is a stage of a graph that runs to complete before the start of the next
phase. By dividing a graph into phases, you can make the best use of resources such as
memory, disk space, and CPU cycles.

A checkpoint is a point at which the Co>Operating System saves all the


information it would need to restore a job to its state at that point. In case of failure, you
can recover completed phases of a job up to the last completed checkpoint.

When you set a phase in the GDE, by default it has a checkpoint at its phase break

If the execution of the graph successfully passes the 1st check point, the Co>Operating
System saves all the information it needs to restore the job to its state at that checkpoint.

As the execution of the graph successfully passes each succeeding checkpoint, the
Co>Operating System:

Deletes the information it has saved to be able to restore the job to its state at the
preceding checkpoint.
Deletes the temporary files it has written in the layouts of the components in all phases
since the preceding checkpoint

Commits the effects on the file system of all phases since the preceding checkpoint

Input file:

Represents one file, many files, or a multifile as an input to your graph

Description:

Label--The appropriate name we have to give..


File type: i/p, o/p, intermediate or lkp.--add to catalog option (if we want to use a file
as lookup in the successor components of a graph we have to click on this
option)
Data location--URL (or) partitions.

Access:

File handling: Click the check box, if you want to delete the file after input phase
Completes.

File protection: user group others (All will have default read and write)

Parameters:

If we specify add to catalog we have to give key names.

Ports:

Propagate from neighbors


Use another port in graph
Use file
Embeded

Propagation: propagate through (check box)

In a REFORMAT component that has multiple output ports labeled out0, out1,
out2, and so on, selecting Propagate Through affects the out0 port, but not the other outn
ports. Similarly, in a JOIN component that has multiple input ports labeled in0, in1, and
so on, selecting Propagate Through affects the in0 port, but not the other inn ports.

Output file:
Represents one file, many files, or a multifile as an output from your graph

File handling:

If the file does not exists (checkbox)

.create file--default
.fail

If the file exists (check box)

Delete and recreate file--default


Appended to file
Fail

Check box, dont rollback file if job fails---Keeps the file state after a job fails. Useful for
retaining log files.

By default, this option is not selected.


File protection:

Default is read and write for all users, groups and others.

Only we can change access to output and intermediate files.

Intermediate files:

File handling:

All options in o/p file + checkbox, delete file after input phase completes in access tab.

Lookup files:

Lookup File can make graph processing quicker and more efficient. Lookup
File represents one or more serial files or a multifile.The amount of data associated with a
Lookup File should be small enough to be held in main memory.

Lookup File is not connected to other graph components, but its associated data is
accessible from other components.

Parameters of a lookup file:

Key--Name of the key field against which the component matches its arguments.
Record Format--The record format you want the component to use when returning data
Records

Lookup functions:

Lookup--Returns a data record from a lookup file.

lookup_local--Returns a data record from a partition of a lookup file.

lookup_count--Returns the number of matching data records in a lookup file.

lookup_count_local--Returns the number of matching data records in a partition of a


Lookup file.

lookup_match--Searches for a specified expression in a lookup file.

lookup_match_local--Looks for a specified match in a local partition of a lookup file.

Input table:

Data base configuration file (.dbc) is used to connect to database through GDE.

.dbc file:

dbms: oracle
db_version: Oracle version number
db_home: Oracle home directory ($ORACLE_HOME)
db_name: Oracle instance name ($ORACLE_SID)
db_nodes: Machine on which Oracle server or client software runs
User: user name of database connection
encrypted_password: encrypted pwd of database.

Parameters:

Ora-interface: Default is api(only api mode will present for input table)

field_type_preference:

Delimited (default) -- Causes data fields in the generated DML record format to be
delimited (the delimiter is specified by the column delimiter field) for all data types.

Variable: Specifies that if a given column is variable in database, the corresponding


field in the record input by Input Table will be variable-length.
Fixed: Causes the generated DML record format to be fixed width for all data types.All
fields in the record input by Input Table will be padded out to some fixed length, based
on the definitions in the source database. For a data type such as CLOB, the database
package will attempt to estimate some reasonable size for the fixed-size field in the
Input Table record.

column_delimiter: The column delimiter need to specify here when we click


field_type_preference as delimited.

Default column_delimiter is "\x01"

Output table:

Access:

.Table handling

If the table does not exists

--create table
--fail

.If the table exists

--Truncate table before loading


--Add new rows.

Parameters:

When using oracle database, the below parameters will enable for output table.

ora_interface: Default is utility


field_type_preference: Same as in input table as mentioned above.
column_delimiter: Same as in input able as mentioned above.
num_errors: maximum number of errors allowed.
rows_per_commit: Number of rows processed before committing the records.

When using ora_interface api:

Commit number: Number of rows to process before committing the records


Commit table: Database table in which to store commit progress. Used in api
mode only. This is internally used table.
Syntax to create commit table

m_db create_commit_table config_file \

-table commit_table_name \

-index index_name

[ -drop ] [ -print_ddl ]

Commit table contains the job id, vertex (partition name), done, rows_processed,
username, job start time.

Job id: The job id parameter allows you to specify a unique job identifier for each run of
the graph.

The Co>Operating System uses this value to identify the graph's run state. Thus, for each
fresh run of the graph, the job identifier value must be fresh. But whenever a graph is
restarted after a previously failed run, the job identifier must be the same as it was before
(so that the restarted run can be correctly associated with the state of the uncompleted
previous run).

Parallelism:

The power of Ab Initio software to process large quantities of data is based on its
use of parallelism (doing more than one thing at the same time)

The Co>Operating System uses three types of parallelism:

1) Component parallelism
2) Pipeline parallelism
3) Data parallelism

1) Component parallelism: Component parallelism occurs when program


components execute simultaneously on different branches of a graph. we can supply data
for component in 2 ways.

a) Multiple separate datasets, one for each branch of the graph


b) A single dataset divided into partitions.

The more branches a graph has, the greater the component parallelism. If a graph
has only one branch, component parallelism cannot occur.

De-partition components, join and fuse components breaks component parallelism.


Filter by expression, partition components, replicate and Reformat (count>1)
Increases component parallelism

2) Pipeline parallelism: Pipeline parallelism occurs when several connected


program components on the same branch of a graph execute simultaneously.

Sort, all transform components using in-memory option, intermediate files and
phases will break the pipe line parallelism.

3) Data parallelism: Data parallelism occurs when a graph separates data into
multiple divisions, allowing multiple copies of program components to operate on the
data in all the divisions simultaneously.

Partition components will increase data parallelism.

Partition components:

Broadcast:

Broadcast arbitrarily combines all records it receives into a single flow and
writes a copy of that flow to each of its output flow partitions. To avoid unpredictable
results, make the record format of the in and out ports identical;

Broadcast does the following:

1) Reads records from all flows on the in port.


2) Combines the records arbitrarily into a single flow.
3) Copies all the records to all the flow partitions connected to the out port.

Partition by expression:

Partition by Expression distributes records to its output flow partitions


according to a specified DML expression.

Parameters:

Function---Specify a DML expression using a field or fields from the input record
Format.

The expression must evaluate to a number between 0 and the number of flows
connected to the out port minus 1.

Partition by Expression does the following:


1) Reads records in arbitrary order from the flows connected to the in port.
2) Distributes the records to the flows connected to the out port, according to the
Expression in the function parameter

Partition by key:

Partition by Key distributes records to its output flow partitions according


to key values.

Parameters:

Key---Need to specify the key field name on which data distributes.

Partition by Key does the following:

1) Reads records in arbitrary order from the in port.


2) Distributes records to the flows connected to the out port, according to the key
parameter, writing records with the same key value to the same output flow

Partition by Percentage:

Partition by Percentage distributes a specified percentage of the total


number of input records to each output flow.

Parameters:

Percentages---List of percentages expressed as integers from 1 to 100, separated by


Spaces.

Partition by Percentage does the following:

1) Reads records from the in port.


2) Writes a specified percentage of the input records to each flow on the out
port.

Partition by Range:

Partition by Range distributes records to its output flow partitions


according to the ranges of key values specified for each partition.

Parameters:

Key---Name(s) of the field(s) containing the key values to determine which partition to
add each input record to.
Partition by Range does the following:

1) Reads splitter records from the split port, and assumes that these records are sorted
according to the key parameter.
2) Determines whether the number of flows connected to the out port is equal to n (where
n–1 represents the number of splitter records). If not, Partition by Range writes an error
message and stops execution of the graph.

3) Reads records in arbitrary order from the flows connected to the in port.
4) Distributes the records to the flows connected to the out port according to the values of
the key field(s), as follows:

. Assigns records with key values less than or equal to the first splitter record to
the first output flow
. Assigns records with key values greater than the first splitter record, but less
than or equal to the second splitter record, to the second output flow, and so on

Partition by Round Robin:

Partition by Round-robin distributes blocks of records evenly to each


output flow in round-robin fashion.

Parameters:

Block-size---Number of records you want the component to distribute to one flow before
Distributing the same number to the next flow

Default is 1.

Partition by Round-robin does the following

1) Reads records from the in port.


2) According to the block_size parameter, distributes blocks of records to its
output flows in the order in which the flows are connected

Partition with load balance:

Partition with Load Balance distributes records to its output flow


partitions by writing more records to the flow partitions that consume records faster.

Partition with Load Balance does the following:

1) Reads records in arbitrary order from the flows connected to its in port.
2) Distributes those records among the flows connected to its out port by sending
more records to the flows that consume records faster.
3) Writes records until each flow's output buffer fills up.
De-Partition Components:

The departition components combine multiple flow partitions or multiple straight


flows into a single flow.

Concatenate:

Concatenate appends multiple-flow partitions of records one after another

Concatenate does the following:

1) Reads all records from the first flow connected to the in port (counting from
top to bottom on the graph) and copies them to the out port.
2) Reads all records from the next flow connected to the in port and appends
them to the records from the previously processed flow.
3) Repeats Step 2 for each subsequent flow connected to the in port.

Gather:
Gather combines records from multiple flow partitions arbitrarily.

Gather does the following:

1) Reads records from the flows connected to the in port.


2) Combines the records arbitrarily.
3) Writes the combined records to the out port.

Interleave:
Interleave combines blocks of records from multiple flow partitions in round-
robin fashion.
You can use Interleave to undo the effects of PARTITION BY ROUND-
ROBIN.

Parameters:
Block-size--- Number of records Interleave reads from each flow before reading the same
number of records from the next flow.

Default is 1.

Interleave does the following:

1) Reads the number of records specified in the block size parameter from the first
flow connected to the in port.
2) Reads the number of records specified in the block size parameter from the next
flow, and so on.
3) Writes the records to the out port.

Merge:
Merge combines records from multiple flow partitions that have been sorted
according to the same key specifier, and maintains the sort order.

Parameters:

Key---Name(s) of the key field(s) and the sequence specifier(s) you want Merge to use to
maintain the order of data records while merging them.

Graph Parameters:

We have 2 types of graph parameters available in Ab initio.

1) Input (or) formal parameters.


2) Local parameters.

Input parameters:

Any input parameter at the graph level get it's value from outside the graph, when
it runs.
If we click input check box (in edit-->parameters (shortcut is F11)) then that
parameter will become input parameter.

Local parameters:

A local parameter at the graph level gets its value from inside the graph, when
you're defining it or working on the graph. In other words, you set a graph's local
parameter's value when you're editing the graph in the GDE — not when you're running
the graph.

If we un click input check box (in edit-->parameters (shortcut is F11)) then that
parameter will become local parameter.

Order of evaluation of parameters:

When you run a graph, parameters are set and references evaluated by the
Co>Operating System in the following order:

1) The host setup script is run.


2) Common project (sandbox) parameters are evaluated.
3) Project (sandbox) parameters are evaluated.
4) The project-start.ksh script is run.
5) Input parameters are evaluated.
6) Graph (local) parameters are evaluated.
7) The graph Start Script is run.
8) DML evaluation occurs.

Conditional components:

You can make any component or sub graph conditional by specifying a conditional
expression that the Co>Operating System evaluates at runtime to determine whether to
run the component or sub graph.

To add condition tab, go to settings--->preferences--->Advanced--->components


--->click on display condition tab check box--->ok

In the parameters we have to specify, expression that evaluates to either 1 or 0, or "True"


or "False".

The following is the set of valid false values:

"0" (the string) or 0 (the integer)


"false" or "False"
"F" or "f"
Any other string or integer value is considered to be true.

We have 2 types of disable methods

a) Remove completely---The Co>Operating System removes the component and any


to a required port of another component, the Co>Operating System also removes that
component and its flows, continuing the process until it reaches an optional port.

b) Replace with flow (default)--The Co>Operating System disables the component and
uses the next downstream flow to connect

Sub graphs:

A Sub graph is piece of a graph contain components and flows. We can create a
sub graph in 2 ways. We can easily identify sub graph having double-line border.

1) Creating the new sub graph from scratch.


2) Creating the sub graph from components in an existing graph.

To create a new sub graph from scratch:


1) From the Insert menu, choose Empty Sub graph.

The New Empty Sub graph dialog opens.

2) In the New graph name box, enter the name you want to use for this sub graph.
You can use alphanumeric characters, numbers, and special characters such
as underscores in the name. Non-alphanumeric characters are eliminated when you
deploy the graph.

3) Click OK.
An empty sub graph icon, labeled with the name you specified, is
displayed in the workspace.

4) Double-click the empty sub graph


.
The sub graph expands into a drawing canvas.

5) Create the sub graph on the canvas by inserting components from the Component
Organizer and connecting them with flows.

6) Add ports to the sub graph.

7) (Optional) Create parameters for the sub graph.

To create a sub graph from components in an existing graph:

1) In an existing graph, select the components that the sub graph should contain.
2) From the GDE menu bar, choose Edit > Sub graph > Create from Selection.
The Create Sub graph From Selection dialog opens.

(Or)

Right click on the any of the selected component, Subgraph > Create from Selection

3) In the New graph name box, enter a name for the sub graph.

Sort Components:

Sort:

Sort sorts and merges records. You can use Sort to order records before you send
them to a component that requires grouped or sorted records.
The in port of the Sort component accepts fan-in and all-to-all flows (partitioned data).
Sort automatically performs a gather on its in port, so there's no need to gather data
before sorting it.
If the sort key field contains NULL values, the NULL records are listed first with
ascending sort order and last with descending sort order.

Parameters:

Key: Name(s) of the key field(s) and the sequence specifier(s) you want the component to
Use when it orders records.
Max-core: Maximum memory usage in bytes.
Default is 100663296 (96 MB).
When the component reaches the number of bytes specified in the max-core parameter, it
sorts the records it has read and writes a temporary file to disk.

Runtime behavior of SORT:


Sort does the following:
1. Reads the records from all flows connected to the in port until it reaches the
number of bytes specified in the max-core parameter.
2. Sorts the records and writes the results to a temporary file on disk.
Sort stores temporary files in the working directories specified by its layout.
3. Repeats Steps 1 and 2 until it has read all records.
4. Merges all temporary files, maintaining the sort order.
5. Writes the result to the out port.

To calculate how many temporary files are created we will use the formula..

3*input file size/max-core size.

Sort Within Groups:

Sort within Groups refines the sorting of records already sorted


according to one key specifier.
It sorts the records within the groups formed by the first sort according to a second key
specifier.
If the sort key field contains NULL values, the NULL records are listed first with
ascending sort order and last with descending sort order.

Parameters:
Major-key: Specifies the name of the key, according to which the component assumes the
input has already been ordered.

Minor-key: Name(s) of the key field(s) you want the component to use when it refines
the sorting of records.
Max-core: Maximum memory usage in bytes before the component starts writing the
excess to disk.
Default is 10485760 (10 MB).

Allow-unsorted: Set to True to allow input not sorted according to the major-key
parameter.
When you set allow-unsorted to True, the boundary between groups of records occurs
when any change occurs in the value(s) of the field(s) specified in the major-key
parameter.
Default is False.

SAMPLE:
Sample selects a specified number of records at random from one or more
input flows.

Parameters:
Sample-size: Number of records you want to send to output.

Partition by Key and Sort:


Partition by Key and Sort partitions records by key values
and then sorts the records within each partition.

Partition by Key and Sort is a sub graph that contains two components, Partition by Key
and Sort.
Parameters:

Key: Name(s) of the key field(s) you want the component to use when it repartitions and
sorts records.
Max_core: Maximum memory usage in bytes.
The default value of Max_core is 100663296 (96 MB).
When the component reaches the number of bytes specified in the Max_core parameter, it
sorts the records it has read and writes a temporary file to disk.

Transform Components:

Reformat:
Reformat changes the format of records by dropping fields, or by using DML
expressions to add fields, combine fields, or transform the data in the records.

Parameters:
Count: Sets the number of:
out ports
reject ports
error ports
transform parameters
Default is 1.

Transform (n): Name of a file containing a transform function, or a transform string


corresponding to an out port; n represents the number of an out port.
Transform functions for Reformat should have one input and one output.
Select: Filter for records before reformatting.
Reject-threshold: The component's tolerance for reject events. We have the following 3
options.
Abort on first reject (default): The component stops execution of the graph at the first
reject event it generates.

Never abort: The component does not stop execution of the graph, no matter how many
reject events it generates.

Use limit/ramp: The component uses the settings in the limit and ramp parameters to
determine how many reject events to allow before it stops execution of the graph.
Ramp: Rate of toleration of reject events in the number of records processed.
When the reject-threshold parameter is set to Use ramp/limit, the component uses the
value of the ramp and limit parameters in a formula to determine the component's
tolerance for reject events.
Default is 0.0.
Limit: A number representing reject events.
When the reject-threshold parameter is set to Use ramp/limit, the component uses the
values of the ramp and limit parameters in a formula to determine the component's
tolerance for reject events.
Default is 0.
Output-index: When you specify a value for this parameter, each input record goes to
exactly one transform-output port pair. For example, suppose there are 100 input records
and two output ports. Each output port receives between 0 and 100 records. According to
the transform function you specify for output-index, the split can be 50/50, 60/40, 0/100,
99/1, or any other combination that adds up to 100.

Output-indexes: The component calls the specified transform function for each input
record. The transform function uses the value of the input record to direct that input
record to particular transform-output ports.
The expected output of the transform function is a vector of numeric values. The
component considers each element of this vector as an index into the output transforms
and ports.

As an example of how the component uses the transform function specified in the output-
indexes parameter, consider the following:
out :: output_indexes(in) =
begin
out :1: if (in.kind == "a") [vector 0, 1, 2];
out :2: if (in.kind == "b") [vector 2, 3, 4];
out : : [vector 5];
end;

When you specify this transform function for the output-indexes parameter, it directs the
input record to the transform functions for:
Ports 0, 1, and 2 if the field in.kind is "a"
Ports 2, 3, and 4 if the field in.kind is "b"
Port 5 if otherwise
If the value returned by the output_index function does not correspond to an
output port, the corresponding record is ignored.

Logging: Specifies whether or not you want certain component events logged.

Log input: Specifies how often you want the component to send an input record to its log
port. The logging parameter must be set to True for this parameter to be available.
For example, if you select 100, the component sends every 100th input record to its log
port.

Log output: Specifies how often you want the component to send an output record to its
log port. The logging parameter must be set to True for this parameter to be available.
For example, if you select 100, the component sends every 100th output record to its log
port.

Log reject: Specifies how often you want the component to send a reject record to its log
port. The logging parameter must be set to True for this parameter to be available.
For example, if you select 100, the component sends every 100th reject record to its log
port.

Ded up sorted:
Dedup Sorted separates one specified record in each group of records
from the rest of the records in the group.
Parameters:
Key: Name(s) of the key field(s) you want the component to use when determining
groups of data records.
Select: Filters the records according to the expression before the component separates
duplicates.
Keep: Specifies which records the component keeps to write to the out port. we have 3
options for this parameter.
a) First (default): Keeps the first record of a group
b) Last — keeps the last record of a group
c) unique-only — Keeps only records with unique key values
The component writes the remaining records of each group to the dup
port.
Check-sort: Specifies whether you want processing to abort on the first record that is out
of sorted order.
True causes processing to abort on the first record out of order.
Default is true.
Reject threshold: Similar to reformat component, check reformat for details.
Logging: Similar to reformat component, check reformat for details.

Filter By Expression:
Filter by Expression filters records according to a DML
expression, which specifies the selection criteria.
Parameters:
select_expr: Filters records according to the DML expression specified.
Reject threshold: Similar to reformat component, check reformat for details.
Logging: Similar to reformat component, check reformat for details.

Filter by Expression does the following:

1) Reads data records from the in port.


2) Applies the expression in the select_expr parameter to each record. If the expression
returns:

Non-0 value — Filter by Expression writes the record to the out port.
0 — Filter by Expression writes the record to the deselect port. If you do not
connect a flow to the deselect port, Filter by Expression discards the records.
NULL — Filter by Expression writes the record to the reject port and a descriptive
error message to the error port.

Multi Reformat:
Multi Reformat changes the format of records flowing from one to 20
pairs of in and out ports by dropping fields or by using
DML expressions to add fields, combine fields, or transform data in the records.
Parameters:
All parameters are same as reformat except default count is 2 in this component.

Join:
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Parameters:

Count: An integer n specifying the total number of inputs (in ports) to join. The number
of input ports determines the number of the following ports and parameters:
Unused ports
reject ports
error ports
Record-required parameters
Dedup parameters
Select parameters
Override-key parameters

Default is 2.

Sorted-input: When set to Inputs must be sorted, the component requires sorted input,
and the maintain-order parameter is not available.

Default is Inputs must be sorted.


When set to in memory: Input need not be sorted, the component accepts
unsorted input and permits the use of the maintain-order parameter.

Key: Name(s) of the field(s) in the input records that must have matching values for Join
to call the transform function.

Transform: Either the name of the file containing the transform function, or a transform
string. The transform function takes the number of input arguments specified in the count
parameter.

Join-type: We have 3 types of joins.

Inner join (default) — sets the record-required parameters for all ports to True.
The GDE does not display the record-required parameters, because they all have the same
value.

Outer join — sets the record-required parameters for all ports to False. The GDE
does not display the record-required parameters, because they all have the same value.
Explicit — allows you to set the record-requiredn parameter for each port
individually.

Record-required (n): This parameter is available only when the join-type parameter is set
to Explicit. Suppose we have two input ports in0 and in1 and mentioned in0 as true and
in1 false (left outer join) means all the records from in0 and only matching records from
in1 will pass to the transform. Similarly if we specify in0 as false and in1 as true (right
outer join) then all the records from in1 and only matching records from in0 will pass to
the transform.

Dedup (n): we have two options for this parameter.

.do not dedup this input.(default)


.dedup this input before joining.

If we set the dedupn parameter to dedup this input before joining, to remove
duplicates from the corresponding in (n) port before joining.

Select (n): Filters for records before a join function. One per inn port; n represents the
number of an in port. If you use select (n) with dedup(n), the Join component performs
the select first, then removes the duplicate records that made it through the select.

Override-key (n): Alternative name(s) for the key field(s) for a particular inn port.

There is one override-key (n) parameter per in (n) port. The n corresponds to
the number of an in port.
Max-memory: Maximum memory usage in bytes before the component writes temporary
files to disk. Available only when the sorted-input parameter is set to Inputs must be
sorted. The default value is 8MB.

Driving: The driving input is the largest input, as specified by the driving parameter. All
other inputs are read into memory.
The driving parameter is available only when the sorted-input parameter is set to In
memory: Input need not be sorted.
Default driving port in0.
Max-core: Available only when the sorted-input parameter is set to in memory: Input
need not be sorted. If the total size of the intermediate results that the component holds in
memory exceeds the number of bytes specified in the max-core parameter, the
component writes temporary files to disk. Default value is 64MB.

Reject thresold: Same as specified in Reformat.


Logging: Same as specified in Reformat.

SCAN:
For every input record, Scan generates an output record that includes a running
cumulative summary for the group to which the input record belongs.

Parameters:
Sorted-input: When set to in memory: Input need not be sorted, the component accepts
ungrouped input, and requires the use of the max-core parameter.
When set to Input must be sorted or grouped, the component requires grouped input, and
the max-core parameter is not available.

Default is Input must be sorted or grouped

Key: The key column to group the records.

Transform: Either the name of the file containing the types and transforms functions, or a
transform string

This is a simple example how to get the second highest salary in a group.

type temporary_type =
record
decimal(5) sal;
end;

/* This function is optional. */


/*Initialize temporary*/
temp::initialize(in)=
begin
temp.sal ::0;
end;

/*Do computation*/
temp::scan(temp,in)=
begin
temp.sal :: temp.sal+in.salary;
end;

/*Create output record*/


out::finalize(temp,in)=
begin
out.salary :: temp.sal;
out.* :: in.*;
out.rank :: next_in_sequence();
end;

out::output_select(out)=
begin
out :: 2;
end;

Max-core: Maximum memory usage in bytes.

Available only when the sorted-input parameter is set to In memory: Input need not be
sorted.

Default is 10 MB

Reject thresold: Same as specified in Reformat.


Logging: Same as specified in Reformat.

Run Time behavior of Scan:


At runtime, Scan does the following:
1) Input selection: If you have not defined the input_select function in your transform,
Scan processes all records.

If you have defined the input_select function, Scan filters the input records accordingly.
If the function evaluates to 0 for a particular record, Scan does not process the record. In
other words, the record does not appear on any output port.

If the function produces NULL for a particular record, scan:

Writes the record to the reject port


Writes a descriptive error message to the error port

If you do not connect flows to the reject or error ports, Scan discards the information. If
the function evaluates to anything other than 0 or NULL for a particular record, Scan
processes the record.

2) Temporary initialization: Scan passes the first record in each group to the initialize
transform function. The initialize transform function creates a temporary record for the
group, with record type temporary_type.

3) Computation: Scan calls the scan transform function for each record in a group,
including the first, using that record and the temporary record for the group as arguments.
The scan transform function returns a new temporary record.

4) Finalization: Scan calls the finalize transform function once for every input record.
Scan passes the input record and the temporary record that the scan function returned to
the finalize transform function. The finalize transform function produces an output record
for each input record.

5) Output selection: If you have not defined the output_select transform function, Scan
writes all output records to the out port. If you have defined the output_select transform
function, Scan filters the output records. The output_select transform function takes a
single argument — the record produced by finalization — and returns a value of 0 (false)
or non-0 (true). Scan ignores records for which output_select returns 0; it writes all others
to the out port.

Normalize:
Normalize generates multiple output records from each of its input records.
You can directly specify the number of output records for each input record, or you can
make the number of output records dependent on a calculation.

Parameters:

Transform: The name of the file containing the types and transform functions, or a
transform string.

Length: This transform function specifies the number of times the normalize function will
be called for the current record. This function takes the input record as an argument.

A simple example of input dml for normalize is as below.

record
string(7) customer_id;
decimal(2) ntrans;
record
date("YYYY.MM.DD") transdate;
decimal(8.2) amount;
end transactions[ntrans];
string(1) newline;
end

Rollup:
Rollup evaluates a group of input records that have the same key, and then
generates records that either summarize each group or select certain information from
each group.

Sorted-input: When set to in memory: Input need not be sorted, the component accepts
ungrouped input, and requires the use of the max-core parameter.
When set to Input must be sorted or grouped, the component requires grouped input, and
the max-core parameter is not available.
Default is Input must be sorted or grouped.

Key: Name(s) of the key field(s) the component can use to group or define groups of
records. If the value of the key-method parameter is Use key specifier, you must specify a
value for the key parameter.
Transform: Either the name of the file containing the types and transform functions, or a
transform string.
Max-core: Maximum memory usage in bytes.
Available only when the sorted-input parameter is set to In memory: Input need not be
sorted.
If the total size of the intermediate results the component holds in memory exceeds the
number of bytes specified in the max-core parameter, the component writes temporary
files to disk.
Default is 67108864 (64 MB).
Reject-threshold: Same as Reformat. Please check reformat for details
Logging: Same as Reformat. Please check reformat for details

We have the following functions available in Rollup.

Count: Counts the records for a group.


Sum: Sums the records for a group.
Min: Picks the minimum value of that group.
Max: Picks the maximum value of that group.
Avg: Picks the average value of that group.
Product: Picks the product of that group
First: Picks the first record of that group
Last: Picks the last record of that group.

String Functions:

decimal_lpad:

This function pads the left side of the string with the specified character to
match the specified character length.
This function also discards leading and trailing non decimal characters, and
you can specify either of two characters to represent the decimal point.

If str contains a minus sign, the function moves the minus sign to the left of
any padding, or to the left of the first digit.

decimal_lpad("42", 5) "00042"


decimal_lpad("42 ", 5, "#")"###42"
decimal_lpad("-42", 5)"-0042"

decimal_lrepad :

Same as decimal_lpad but also trim the leading zeros.

m_eval 'decimal_lrepad("00042",5,"#")'"###42"

decimal_strip:

Returns a decimal from a string that has been trimmed of leading zeros and
non-numeric characters.

decimal_strip("-0042 ")"-42"
decimal_strip("+$42")"42"
decimal_strip("garbage")"0"

string_downcase:

Returns a string with any uppercase letters converted to lowercase.

string_downcase("abcXYZ")"abcxyz"

string_upcase:

Returns a string with any lowercase letters converted to uppercase.

string_upcase("abcXYZ")"ABCXYZ"

string_lrtrim:

Returns a string trimmed of leading and trailing blank characters.

string_lrtrim(" abc ")"abc"


string_lrtrim("John")"John"
string_ltrim:

Returns a string trimmed of leading blank characters.

m_eval 'string_ltrim(" abc ")'"abc "

string_trim:

Returns a string trimmed of trailing blank characters.

m_eval 'string_trim(" abc ")'" abc"

string_lpad:

Returns a string of a specified length, padded with a given character.

string_lpad("abc", 5)" abc"


string_lpad("abc", 5, "#")"##abc"
m_eval 'string_lpad(" abc ",6,"#")'"# abc "

string_lrepad:

Returns a string of a specified length, trimmed of leading and trailing


blanks and left-padded with a given character.

string_lrepad("abc", 5)" abc"


string_lrepad(" abc ", 5, "#")"##abc"

string_pad:

Returns a right-padded string.

string_pad("abc", 5)"abc "


string_pad("abc", 5, "#")"abc##"
m_eval 'string_pad(" abc ",6,"#")'" abc #"

string_index:

Returns the index of the first character of the first occurrence of a string
within another string.
string_index("abcdefgh eieio", "e")5
string_index("John Smith", ":")0
string_index("qwerty", "")1
m_eval 'string_index("","abc")0

string_rindex:

Returns the index of the first character of the last occurrence of a string within
another string.

string_rindex("abcdefgh eieio", "e")12


string_rindex("John Smith", "&")0
string_rindex("qwerty", "")7
m_eval 'string_rindex("","abc")'0

String_substring:

Returns a substring of a string. The returned substring can be of a


specific length and can begin at a specific character.

If start is less than 1, the function sets the value of start to 1.


If length is less than 1, the function sets the value of length to 0.
If start is greater than the length of the string, the function returns "".
If length is greater than the length of the string, the function returns just the
available characters.
The function returns NULL if any argument is NULL.

string_substring("John Smith", 8, 5) "ith"


string_substring("abcdefgh", 4, 3)  "def"
string_substring("qqqqqqqq", 9, 3)  ""
string_substring("abcdefgh", 3, 12)  "cdefgh"
string_substring("abcdefgh", -3, 4)  "abcd"

String_concat:

Returns the concatenation of multiple string arguments. There must be at


least two and no more than 25 strings. The function returns NULL if any argument is
NULL.

string_concat("abcd", "efgh") "abcdefgh"


string_concat("John", "o", "Smith") "John Smith"
string_concat("xyz", "") "xyz"
re_replace:

Returns a string after replacing all substrings matching a regular expression.

m_eval 're_replace("123abccdf","[0-9]+","")' "abccdf"


m_eval 're_replace("abcd?#?","[$,#?]","")' "abcd"
m_eval 're_replace("abcd?#?","[a-z]","")' "?#?"

Inquiry and error functions:

first_defined:

Returns the first defined (non-NULL) value of from two to 26 arguments

first_defined("test","jan","feb","mar")"test"
first_defined(NULL,"jan","feb")"jan"
first_defined(NULL,42,43,44)42
m_eval 'first_defined(NULL,NULL,NULL,"JJJJ","jan","feb")'"JJJJ"

force_error:

Raises an error and sends the specified message to the component's error or
similar port.force_error can be used to filter records having particular values or to handle
records with badly formatted data.

e.g) if (trans_code > 1000) force_error("Value is out of bounds.")

is_defined:

Tests whether an expression results in a non-NULL value

This function returns:

1 if expr evaluates to a non-NULL value


0 if expr evaluates to NULL
Error if the result of evaluating expr results in error

is_defined(123)1
is_defined(NULL)0

is_null:
Tests whether an expression results in a NULL value
This function returns:

1 if the expression evaluates to NULL


0 if the expression does not evaluate to NULL
Error if evaluation of the expression results in error

is_null(123)0
is_null(NULL)1

is_error:

Tests whether an error occurs while evaluating an expression

This function returns:

1 if evaluating the expression results in error


0 otherwise

is_error(a + b)0
is_error(1 / 0) 1

is_valid:

Tests whether the result of an expression is valid

The is_valid function returns:

1 if the result of evaluating expr is a valid data item


0 if the result of evaluating expr is invalid or evaluates to NULL

is_valid(1)1
is_valid((date("YYYYMMDD"))"19960504")1
is_valid((date("YYYYMMDD"))"19920231")0
is_valid((date("YYYYMMDD"))"abcdefgh") 0

next_in_sequence():
Returns a sequence of integers on successive calls, starting with
1.Each partition of each component has its own private sequence. The sequence of
numbers returned is shared among all callers within a component. For example, calls to
next_in_sequence from different transform functions associated with different outputs of
a Reformat component affect one another.

next_in_sequence() 1
next_in_sequence() 2
next_in_sequence() 3

Date Functions:

now():
Returns the current local date and time. The returned date time result is in the
default format YYYYMMDDHH24MISSnnnnnn.

now1():
Returns a datetime value from the system local date and time. The first time a
component calls now1, the function returns the value returned from the system function
local time. Every other time the component calls now1, the function returns the same
value it returned on the first call. The returned datetime result is in the default format
YYYYMMDDHH24MISSnnnnnn.

Database components:

Run SQL:

Run SQL executes SQL statements in a database and writes confirmation


messages to the log port. You can use Run SQL to perform database operations.

Parameters:

DBConfigFile: Name of the database table configuration file which will contain
Database details
SQLFile: Name of the file containing SQL statements to be executed against the
Database or the SQL statement can be embedded here
Interface: API (default) or utility--only utility mode is used when we have to
Execute PL/SQL or stored procedures

ignore_drop_table_errors:

True — Suppress and ignore errors generated by the database when the SQL
attempts to drop a nonexistent table.
False (default) — do not ignore such errors.

num_errors: The maximum errors allowed.

Update table:
Update Table executes UPDATE, INSERT, and DELETE statements in
embedded SQL format (or a stored procedure) to modify a table in a database, and writes
status information to the log port.

Parameters:

DBConfigFile: Name of the database table configuration file which will contain
Database details

UpdateSqlFile: Reference to an SQL statement (UPDATE, INSERT, or


DELETE) to be executed against the specified table.

InsertSqlFile: Reference to an SQL statement (UPDATE, INSERT, or DELETE)


To be executed against the specified table

Join with DB:

Join with DB joins records from the flow or flows connected to its in port
with records read directly from a database, and outputs new records containing data based
on, or calculated from, the joined records.

Parameters:

DBConfigFile: Name of the database table configuration file which will contain
Database details

select_sql: The SELECT statement to perform for each input record.

Transform: The transform package used to compute the output record based on
the input record and the result from the select_sql statement.

e.g) out::join_with_db(in,query_result)=
begin
end;

query_result will generate automatically based on select_sql and data base


configuration file.

Validate Components:

Generate records:
Generate Records generates a specified number of records that you can use
as test data.

Parameters:

num_records: Number of records to generate.


command_line: Specifies options that modify the values generated for
particular fields. The options are listed as below where field represents the generated field
to which the option applies. More than one option can apply to a field.

-default field--Specifies the default value for field. Do not use this option
if the field does not have a default value.
-expression field 'expression'---Evaluates expression and stores the
resulting value in field. Use a single-quoted string to enclose expression.

-expression cust_greeting 'string_concat("hello ", "there")'

-minimum field minimum--Limits the smallest generated value of field to


minimum.
-maximum field maximum--Limits the largest generated value of field to
maximum
-sequential field minimum--This option writes a sequence of numeric or
date values for field starting with minimum.

Below is a simple example to generate the test data having the below dml.

record
decimal(3) id=100;
decimal(3) id1;
decimal(3) id2;
string(10) name;
end

command_line : -default id -expression name


'string_concat("sures","hkuma")' -minimum id1 110 -maximum id1 150 -sequential
id2 10

Create Data:

CREATE DATA produces a series of one or more output records every time
its graph is run. The records in the series can be identical or distinct. This component is
useful for generating data to test graphs, particularly in cases where you need a flow with
exactly one fixed record on it.

Parameters:
Num-records: The number of records to create.
Identical-records: evaluate transform for each record (default)
evaluate transform once and use that value for all records.
Transform: Contains one function, create (index), which takes an integer(8)
value index and returns a record of type suitable for the out port. For example, consider a
CREATE DATA out port defined as follows:

record
decimal(6) customer_id ;
string(28) name;
string(1) gender;
string(26) street_addr;
string(2) state;
string(1) newline;
end

If we specify the transform like below.

out::create(index)=
begin
out.customer_id :: index;
out.gender :1: if(index% 3) "M";
out.gender :2: "F";
out.name :: "ABC";
out.street_addr :: "";
out.state :: "AP";
out.newline :: "";
end;

customer_id values will come in sequence starting from the index1.


For gender for every 3 rd record we will get "F" and for rest we will get
"M".

Rests of the columns are hard coded, so same values will come to output.

Check Order:

Check Order tests input records to determine whether the records are
sorted according to the key specifier in the key.

Parameters:
If Check Order detects an incorrectly sorted record, it writes a single-line error
message to the out port. And then:

If the number of incorrectly sorted records does not exceed the number specified
in the limit parameter, Check Order continues with execution of the graph.
If the number of incorrectly sorted records exceeds the number specified in the limit
parameter, Check Order stops execution of the graph.

Parameters:
Key: Name(s) of the key field(s) and the sequence specifier(s) you want the
component to use when it verifies the order of data records.
Limit: Maximum number of incorrectly ordered records before the component
stops execution of the graph.

Compare Records:

Compare Records reads records from two flows, compares the


records one by one, and writes one-line text reports of mismatching records to the out
port.

Parameters:

Limit: Maximum number of mismatching records before the component


stops execution of the graph. (default is 0)

Validate Records:

Validate Records separates valid records from invalid records by


using is_valid function internally.

Parameters:

reject-thresold: Same as Reformat


logging: Same as Reformat.

Compress Components:
Compress:
Compress uses the UNIX compress utility to reduce the volume of data in a
flow. Use Deflate instead of Compress.

Deflate:
Deflate reduces the volume of data in a flow by writing compressed records to
its out port.

Parameters:

Compression: Specifies how much to compress the data. Choose from the following:
0 — No compression
1 — Causes Deflate to run fastest, but compresses least
2 through 8 — Intermediate choices
9 — Causes deflate to run slowest, but compresses most
Default value is 1, which causes Deflate to run fastest, but compress the least.

Uncompress:
Uncompress uncompressed data that was compressed by the Compress
component or the G Zip component.

Inflate:
Inflate decompresses data that is in certain compressed formats. Inflate cannot
decompress data that is not in gzip- or Z-compressed format — such as data compressed
by the UNIX pack utility. To uncompress such data, use the Uncompress component.

Introduction to Sandbox and Check-in and Check-out:

A sandbox is a special directory (folder) containing a certain


minimum number of specific sub directories for holding AbInitio graphs and related files.
The sub directories have standard names that indicate their function.
The sandbox directory can have any name. The sandbox's
properties (including its parameters) are recorded in various special and hidden files that
lie in this top level directory (in the following illustration, it's the my_project directory):
The data files are not usually stored in sandboxes. Sandboxes are for organizing and
storing metadata, not the data itself.
Check-In:

air project import /Projects/<Projectname>/InfoCatalog/<sandboxname> -


basedir /ai/src/AbInitio/users/dev/<user_id>/sand/ <sandboxname> -dry-run -files -
<<'EOF'

Check-Out:
air project export /Projects/<Projectname>/InfoCatalog/<sandboxname> -
basedir /ai/src/abinitio/users/dev/<user_id>/sand/ <sandboxname> -dry-run -files -
<<'EOF'

If we want to do the same through GDE

Go to Project check in (or) check out

Check in Window:
It will ask per check in comment
Check-out wizard:
Tagging:

Tagging is the process of combining the couple of objects into one tag. To
create the tag use the below commands

air tag create –exact tag-name <filepath>

To list out all the objects in a tag use the below command

air tag ls –p <tag-name>

To add a object to the existing tag

air tag add tag-name


Air Commands:

To find the versions of an object use below command

air object versions /Projects/lesson/mp/JoinCustomers.mp


The output might look like the following, showing the untagged versions 737 and 747 as
well:
Version Date User Tag Comment
737 2005-05-13 11:54:02 asmith
738 2005-05-16 14:55:50 asmith tag1
738 2005-05-16 14:55:50 asmith tag2
738 2005-05-16 14:55:50 asmith tag3
747 2005-05-16 15:10:40 asmith
748 2005-05-16 15:10:56 asmith tag4

To lock object use below command

air lock set /Projects/lesson/mp/JoinCustomers.mp

To un lock object use below command.

air lock release /Projects/lesson/mp/JoinCustomers.mp

To show lock of objects use below command

air lock show –user <username>

air lock show –all

To view EME object:

air object cat /Projects/lesson/dml/a-customers.dml


Performance Tips:

1) Over-reliance on databases: There are many things that you can (and should) do
outside the database. For example, operations involving heavy computation are usually
better done with components, in the graph, rather than in a database.

Sorting will almost always be faster when you use the SORT component rather than
sorting in the database.

2) Using the phases at correct places: We have to plan limited number of components in
1 phase.

3) Bad placement of phase breaks: Wherever a phase break occurs in a graph, the data
in the flow is written to disk; it is then read back into memory at the beginning of the
next phase. For example, putting a phase break just before a FILTER BY
EXPRESSION is probably a bad idea: the size of the data is probably going to be
reduced by the component, so why write it all to disk just before that happens?
4) Use the sort components when ever required only as it breaks pipeline parallelism.
5) Instead of using filter by expression use reformat,join(select)
6) Use the parallelism technique in initial stages of graph.
7) Use filter components in starting of the graph, means reduce the unnecessary data in
Initial stage of the graph and process the required data to successor components.
8) Assign the driving port of join correctly.
9) Use look up file rather than join, if one input file data is huge and other is one is
Small.
10) If you need to process an Input File in several different ways, connect multiple
flows directly to the Input File, rather than through a Replicate component. If you
can connect directly to a file, the downstream components will be able to read the
data independently, which can improve the performance.
11) Log only information that will be regularly examined.

Common errors we will get while running graphs:

1) Broken pipe error: The error message "broken pipe" means that a downstream
Component has gone away, so the flow is broken. For example, the database might
have run out of memory making database components in the graph unavailable.

2) No space left on device: This error message indicates that the file system that holds
/var/abinitio (the default location for AB_WORK_DIR) has run out of space. To
Solve this problem, you need to free up some space in the file system, usually by
Deleting some files.
3) Null value is assigned to not null column
4) The records are out-of-order.
5) Metadata error.
6) The depths are not equal.
7) Too many open files.
8) The user name and password are not correct.
9) Dml mismatches.
10) The input file is not present in the path.

Advanced Topics:

PDL:

PDL (the Parameter Definition Language) is a simple set of rules and notations for
referencing parameter values.

1. To use PDL in a graph, you must enable Dynamic Script Generation.

2. In the GDE, choose Settings > Graph Settings.


3. This opens the Graph Settings dialog.
4. Select Script in the left pane of the dialog.
5. In the Script Generation drop-down list, select Dynamic (rather than GDE 1.13
Compatible).

If we use PDL instead of using SHELL parameter, it will reduce the process of
contacting UNIX server every time. As PDL is Ab Initio parameter language it will
perform faster than other parameters.

Meta Programming:

The DML Meta programming functions allow you to dynamically


derive types and transforms from other types and transforms within a graph. You can use
Meta programming functions to:
 Create a graph whose output type is automatically based on the input type.
 Create a graph to convert non-DML data, such as data in spreadsheets, to DML.
 Create a transform that is built from the value of a parameter.
 Create a transform and accompanying output record that identifies valid or invalid
record fields.
Component Folding:
Component folding can reduce graph start up time and memory
usage. Component folding is a feature of the Co>Operating System that reduces the
number of processes created by your graphs and can enhance performance. It effectively
combines several separate graph components into a single group during run time. Each
partition of this group runs as a single process, conserving memory and streamlining
execution.

Dependency Analysis:
Dependency analysis is the process by which the EME examines
a project in its entirety and traces how data is transformed and transferred, field by field,
from component to component, within and between graphs and across projects. Once you
invoke an analysis, the results are viewable in a Web browser, where you can see:

 How operations that occur later (downstream) in a graph are affected by


components earlier (upstream) in the graph
 What data is operated on by each component, and what the operations are
 All details of every component in the project
 What happens to each field in a data set throughout a graph and across graphs
and projects
 All graphs that use a particular data set
 Issues that might prevent the graph from working in a different environment

The results of dependency analysis are available in the EME Web Interface as
shown below.
PLANS:
Combinations of tasks are called a plan. We have the following tasks available
in AbInitio.

Program Task: We can call the programs (scripts) in this task.


Graph Task: We can call the Graphs and pset’s in this task.
Plan Task: We can call the plans in this task.
Conditional Task: We will specify the condition like conditional components.
Sub plans: Similar to Sub graphs.

PSETS:
Psets are the files which will contain the input values. For generic graphs we
will create different psets instead of creating many graphs.

You might also like