Kiran Abinitio
Kiran Abinitio
Kiran Abinitio
Subject oriented: Data that gives information about a particular subject instead of about
a company's ongoing operations. For example, to learn more about your company's sales
data, you can build a warehouse that concentrates on sales. Using this warehouse, you
can answer questions like "Who was our best customer for this item last year?"
Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed. This enables management to gain a consistent picture of the business.
OLTP OLAP
1) Online transactional processing 1) Online Analytical processing
2) Contains only current data 2) Contains current + historical data
3) Insert, update and delete operations can 3) Select operations can perform here.
Perform
4) Data is volatile in nature 4) Data is non-volatile in nature
5) Many joins will be there 5) Few joins will be there
6) The data is in normalized form 6) The data is in de normalized form.
7) The aggregation data is very rare 7) The aggregation data is common here.
3) What is data mart?
Data Mining is used for the estimation of future. For example if we take a
company/business organization by using the concept of Data Mining we can predict the
future of business in terms of Revenue (or) Employees (or) Customers (or) Orders etc.
Generally data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information -
information that can be used to increase revenue cuts costs or both.
Dimension table:
If I want to know the number of resources used for a task, my fact table will store the
actual measure (of resources) while my Dimension table will store the task and resource
details.
Customer dimension --- Which will contain all the details of customer.
Product dimension --- Which will contain all the details of product.
Store dimension --- Which will contain all the details of store.
Time dimension --- Which will contain all the details of time.
What was the cost of no consumable goods sold in the northeast region in 1999?
What was the cost of kitchen products in New York City in the third quarter of 1999?
Fact Table:
Fact table will consist of facts or measures. All the primary keys of the
dimension tables are used as foreign keys in fact table. Fact table consists of two sections
key and measures.
a) Additive measures: Additive facts are the facts that participate in arithmetic
Calculations.
b) Non Additive measures: Non additive facts are the facts that do not participate in
Arithmetic calculations.
c) Semi Additive facts: Semi additive facts are the facts that will participate in
Arithmetic calculations of some columns.
Junk dimension:
Junk dimensions are dimensions that contain miscellaneous data (like flags and
indicators) that do not fit in the base dimension table.
Degenerate dimension:
A degenerate dimension is data that is dimensional in nature but stored in a fact table.
Best example is "Invoice Number" "bill number" "PO Number" these all are degenerated
dimensions in the Transaction Tables they can be maintained in the fact table itself
instead of crating separate dimensions for this.
Fact less fact table useful connecting two fact tables.Factless facts can be used as a
bridge tables or to store measures that are used directly without aggregation(for e.g.
max() min() attributes)
Normalization is the process of splitting the data into two or more tables from
single table.
First normal form: A table is said to be in first normal form if all the columns data is
Atomic or Unique.
Second normal form: A table is in second normal form if all its non_key fields fully
dependent on the whole key.
This means that each field in a table must depend on the entire key.
Those that do not depend upon the combination key, are moved to another table on whose
key they depend on.
Third normal form: A table is said to be in third normal form if all the non key fields of
the table are independent of all other non key fields of the same table.
It is the final integration point in the ETL process before loading the data into the Data
Warehouse. It can support DSS processing.
In simple terms, level of granularity defines the extent of detail. As an example, let
us look at geographical level of granularity. We may analyze data at the levels of
COUNTRY, REGION, TERRITORY, CITY and STREET.
Surrogate key is a substitution for the natural primary key in Data Warehousing.
It is just a unique identifier or number for each row that can be used for the primary key
to the table.
If we have many sources, it is better to create new surrogate key instead of using
the natural primary keys of different sources.
Ab Initio
Ab Initio software helps you build large-scale data processing applications and run them
in parallel environments. Ab Initio software consists of three main programs:
Co>Operating System:
It is the server for Ab Initio, where all the graphs will run here. It
provides a distributed model for process execution; file management, process monitoring,
check-pointing, and debugging.
To get the version of the co>operating system, execute the command m_env –v (or)
m_env –version in command prompt or UNIX (command line).
GDE is the users work area where we will create all our graphs.
Architecture of Ab Initio:
EME
GDE
Graphical Development Environment
Enterprise
Meta
Environment
Co>operating System
Component
Organizer(F9)
Before going to work with GDE it is better practice to validate the run
settings. That means we have to check the user/password credentials and the server (host)
name and co>operating system details are valid or not.
Dataset: In Simple terms dataset is a file. A file can be main frame file or any fixed or
delimited files.
We have different types of ports like input port, output port, reject port, error port and log
port.
Flow: A flow carries a stream of data between components in a graph. Flows connect
components via ports. Ab Initio supplies four kinds of flows with different patterns:
straight, fan-in, fan-out, and all-to-all.
Graph: A graph is a diagram that defines the various processing stages of a task and the
streams of data as they move from one stage to another. Visually, stages are represented
by components and streams are represented by flows. The collection of components and
flows comprise an Ab Initio graph.
Field: A filed is Equivalent to column of a table in Database world. Filed is also called
variable which holds value.
Key: Key is the unique identifier for that group. Key is used many places in Ab Initio
Development, We use key to sort the data, Join two files, Rollup data etc.
Symbols:
To do cues:
To-do cues are yellow highlighted areas with question marks that require
action.
Before you start acting on to-do cues, connect all the flows in the graph. This will remove
some of the to-do cues through propagation. Remove the remaining to-do cues from the
dataset components first, then from the program components (working from the outside
in).
The following table lists the different types of to-do cues and how to resolve them.
?? Lay The component has no layout. For Double-click the layout indicator and
outs program components, layout is set select the layouts you want.
either by propagation or manually.
? Para The component's required parameters Double-click the square box and fill in
mete or file locations lack values. the missing
rs or Specifyingcomponentparameters.html -
file 1044042parameters or file locations.
locat
ion
out* Port The record format for the port is not Connect the port to another port with a
nam set. Record formats are usually known record format, or double-click
es propagated, but sometimes are set the port and add the new record format.
manually.
Status Indicators while running graph:
For quick diagnosis of the pass/fail state of components and graphs, the GDE displays
status indicators when you run a graph. The status is shown with colored LEDs, as
follows:
Types of files:
Multi File System (MFS) is created using a command called m_mkfs, and deleted using
m_rmfs
When you open a multi file you will get the data partition paths as shown below.
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_001/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_002/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_003/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_004/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_005/mfs_6way/ main/testpart.dat"
"file://d00/ai/id/abinitio/ln1/data/parts/mfs_6way_006/mfs_6way/ main/testpart.dat"
When you open the actual data path you can able to see the data.
Delimited.dml
record
decimal('|') cust_id;
string(‘|') last_name;
string(‘|') first_name;
string(‘|') street_addr;
string(‘|') state;
decimal(‘|') zip;
string(‘|') gender;
decimal('\n') income;
end
Example of data:
record
decimal(6) cust_id; // Customer ID
string(18) last_name; // Last name
string(16) first_name; // First name
string(26) street_addr; // Street address
string(2) state; // State
decimal(5) zip; // Zipcode
string(1) gender; // Gender (M = male; F = female)
decimal(7) income; // Income (in dollars)
string(1) newline;
end
Example Of data
For Ab Initio we will use UNIX as server where co>operating system is installed. We can
create all the things through UNIX where we can do it through Ab Initio except graphs.
Propagate from neighbors:--defaultone, sets the selected component's layout from the
layouts of neighboring components that are connected by straight flows
Hosts--Specifies the names of the hosts where the partition of the component runs.
.database--This option will enable only for table components. The below 4 options are
present for database.
Watchers:
A watcher automatically adds intermediate files to flows so you can view the
data that passes through the flows.
The debugger places watcher files in the layout of the component downstream of the
watcher.
To delete the watcher files Debug > Delete Watcher Datasets (or) m_rm -f -rmdata
<GDE-WATCHER-xxx>
A phase is a stage of a graph that runs to complete before the start of the next
phase. By dividing a graph into phases, you can make the best use of resources such as
memory, disk space, and CPU cycles.
When you set a phase in the GDE, by default it has a checkpoint at its phase break
If the execution of the graph successfully passes the 1st check point, the Co>Operating
System saves all the information it needs to restore the job to its state at that checkpoint.
As the execution of the graph successfully passes each succeeding checkpoint, the
Co>Operating System:
Deletes the information it has saved to be able to restore the job to its state at the
preceding checkpoint.
Deletes the temporary files it has written in the layouts of the components in all phases
since the preceding checkpoint
Commits the effects on the file system of all phases since the preceding checkpoint
Input file:
Description:
Access:
File handling: Click the check box, if you want to delete the file after input phase
Completes.
File protection: user group others (All will have default read and write)
Parameters:
Ports:
In a REFORMAT component that has multiple output ports labeled out0, out1,
out2, and so on, selecting Propagate Through affects the out0 port, but not the other outn
ports. Similarly, in a JOIN component that has multiple input ports labeled in0, in1, and
so on, selecting Propagate Through affects the in0 port, but not the other inn ports.
Output file:
Represents one file, many files, or a multifile as an output from your graph
File handling:
.create file--default
.fail
Check box, dont rollback file if job fails---Keeps the file state after a job fails. Useful for
retaining log files.
Default is read and write for all users, groups and others.
Intermediate files:
File handling:
All options in o/p file + checkbox, delete file after input phase completes in access tab.
Lookup files:
Lookup File can make graph processing quicker and more efficient. Lookup
File represents one or more serial files or a multifile.The amount of data associated with a
Lookup File should be small enough to be held in main memory.
Lookup File is not connected to other graph components, but its associated data is
accessible from other components.
Key--Name of the key field against which the component matches its arguments.
Record Format--The record format you want the component to use when returning data
Records
Lookup functions:
Input table:
Data base configuration file (.dbc) is used to connect to database through GDE.
.dbc file:
dbms: oracle
db_version: Oracle version number
db_home: Oracle home directory ($ORACLE_HOME)
db_name: Oracle instance name ($ORACLE_SID)
db_nodes: Machine on which Oracle server or client software runs
User: user name of database connection
encrypted_password: encrypted pwd of database.
Parameters:
Ora-interface: Default is api(only api mode will present for input table)
field_type_preference:
Delimited (default) -- Causes data fields in the generated DML record format to be
delimited (the delimiter is specified by the column delimiter field) for all data types.
Output table:
Access:
.Table handling
--create table
--fail
Parameters:
When using oracle database, the below parameters will enable for output table.
-table commit_table_name \
-index index_name
[ -drop ] [ -print_ddl ]
Commit table contains the job id, vertex (partition name), done, rows_processed,
username, job start time.
Job id: The job id parameter allows you to specify a unique job identifier for each run of
the graph.
The Co>Operating System uses this value to identify the graph's run state. Thus, for each
fresh run of the graph, the job identifier value must be fresh. But whenever a graph is
restarted after a previously failed run, the job identifier must be the same as it was before
(so that the restarted run can be correctly associated with the state of the uncompleted
previous run).
Parallelism:
The power of Ab Initio software to process large quantities of data is based on its
use of parallelism (doing more than one thing at the same time)
1) Component parallelism
2) Pipeline parallelism
3) Data parallelism
The more branches a graph has, the greater the component parallelism. If a graph
has only one branch, component parallelism cannot occur.
Sort, all transform components using in-memory option, intermediate files and
phases will break the pipe line parallelism.
3) Data parallelism: Data parallelism occurs when a graph separates data into
multiple divisions, allowing multiple copies of program components to operate on the
data in all the divisions simultaneously.
Partition components:
Broadcast:
Broadcast arbitrarily combines all records it receives into a single flow and
writes a copy of that flow to each of its output flow partitions. To avoid unpredictable
results, make the record format of the in and out ports identical;
Partition by expression:
Parameters:
Function---Specify a DML expression using a field or fields from the input record
Format.
The expression must evaluate to a number between 0 and the number of flows
connected to the out port minus 1.
Partition by key:
Parameters:
Partition by Percentage:
Parameters:
Partition by Range:
Parameters:
Key---Name(s) of the field(s) containing the key values to determine which partition to
add each input record to.
Partition by Range does the following:
1) Reads splitter records from the split port, and assumes that these records are sorted
according to the key parameter.
2) Determines whether the number of flows connected to the out port is equal to n (where
n–1 represents the number of splitter records). If not, Partition by Range writes an error
message and stops execution of the graph.
3) Reads records in arbitrary order from the flows connected to the in port.
4) Distributes the records to the flows connected to the out port according to the values of
the key field(s), as follows:
. Assigns records with key values less than or equal to the first splitter record to
the first output flow
. Assigns records with key values greater than the first splitter record, but less
than or equal to the second splitter record, to the second output flow, and so on
Parameters:
Block-size---Number of records you want the component to distribute to one flow before
Distributing the same number to the next flow
Default is 1.
1) Reads records in arbitrary order from the flows connected to its in port.
2) Distributes those records among the flows connected to its out port by sending
more records to the flows that consume records faster.
3) Writes records until each flow's output buffer fills up.
De-Partition Components:
Concatenate:
1) Reads all records from the first flow connected to the in port (counting from
top to bottom on the graph) and copies them to the out port.
2) Reads all records from the next flow connected to the in port and appends
them to the records from the previously processed flow.
3) Repeats Step 2 for each subsequent flow connected to the in port.
Gather:
Gather combines records from multiple flow partitions arbitrarily.
Interleave:
Interleave combines blocks of records from multiple flow partitions in round-
robin fashion.
You can use Interleave to undo the effects of PARTITION BY ROUND-
ROBIN.
Parameters:
Block-size--- Number of records Interleave reads from each flow before reading the same
number of records from the next flow.
Default is 1.
1) Reads the number of records specified in the block size parameter from the first
flow connected to the in port.
2) Reads the number of records specified in the block size parameter from the next
flow, and so on.
3) Writes the records to the out port.
Merge:
Merge combines records from multiple flow partitions that have been sorted
according to the same key specifier, and maintains the sort order.
Parameters:
Key---Name(s) of the key field(s) and the sequence specifier(s) you want Merge to use to
maintain the order of data records while merging them.
Graph Parameters:
Input parameters:
Any input parameter at the graph level get it's value from outside the graph, when
it runs.
If we click input check box (in edit-->parameters (shortcut is F11)) then that
parameter will become input parameter.
Local parameters:
A local parameter at the graph level gets its value from inside the graph, when
you're defining it or working on the graph. In other words, you set a graph's local
parameter's value when you're editing the graph in the GDE — not when you're running
the graph.
If we un click input check box (in edit-->parameters (shortcut is F11)) then that
parameter will become local parameter.
When you run a graph, parameters are set and references evaluated by the
Co>Operating System in the following order:
Conditional components:
You can make any component or sub graph conditional by specifying a conditional
expression that the Co>Operating System evaluates at runtime to determine whether to
run the component or sub graph.
b) Replace with flow (default)--The Co>Operating System disables the component and
uses the next downstream flow to connect
Sub graphs:
A Sub graph is piece of a graph contain components and flows. We can create a
sub graph in 2 ways. We can easily identify sub graph having double-line border.
2) In the New graph name box, enter the name you want to use for this sub graph.
You can use alphanumeric characters, numbers, and special characters such
as underscores in the name. Non-alphanumeric characters are eliminated when you
deploy the graph.
3) Click OK.
An empty sub graph icon, labeled with the name you specified, is
displayed in the workspace.
5) Create the sub graph on the canvas by inserting components from the Component
Organizer and connecting them with flows.
1) In an existing graph, select the components that the sub graph should contain.
2) From the GDE menu bar, choose Edit > Sub graph > Create from Selection.
The Create Sub graph From Selection dialog opens.
(Or)
Right click on the any of the selected component, Subgraph > Create from Selection
3) In the New graph name box, enter a name for the sub graph.
Sort Components:
Sort:
Sort sorts and merges records. You can use Sort to order records before you send
them to a component that requires grouped or sorted records.
The in port of the Sort component accepts fan-in and all-to-all flows (partitioned data).
Sort automatically performs a gather on its in port, so there's no need to gather data
before sorting it.
If the sort key field contains NULL values, the NULL records are listed first with
ascending sort order and last with descending sort order.
Parameters:
Key: Name(s) of the key field(s) and the sequence specifier(s) you want the component to
Use when it orders records.
Max-core: Maximum memory usage in bytes.
Default is 100663296 (96 MB).
When the component reaches the number of bytes specified in the max-core parameter, it
sorts the records it has read and writes a temporary file to disk.
To calculate how many temporary files are created we will use the formula..
Parameters:
Major-key: Specifies the name of the key, according to which the component assumes the
input has already been ordered.
Minor-key: Name(s) of the key field(s) you want the component to use when it refines
the sorting of records.
Max-core: Maximum memory usage in bytes before the component starts writing the
excess to disk.
Default is 10485760 (10 MB).
Allow-unsorted: Set to True to allow input not sorted according to the major-key
parameter.
When you set allow-unsorted to True, the boundary between groups of records occurs
when any change occurs in the value(s) of the field(s) specified in the major-key
parameter.
Default is False.
SAMPLE:
Sample selects a specified number of records at random from one or more
input flows.
Parameters:
Sample-size: Number of records you want to send to output.
Partition by Key and Sort is a sub graph that contains two components, Partition by Key
and Sort.
Parameters:
Key: Name(s) of the key field(s) you want the component to use when it repartitions and
sorts records.
Max_core: Maximum memory usage in bytes.
The default value of Max_core is 100663296 (96 MB).
When the component reaches the number of bytes specified in the Max_core parameter, it
sorts the records it has read and writes a temporary file to disk.
Transform Components:
Reformat:
Reformat changes the format of records by dropping fields, or by using DML
expressions to add fields, combine fields, or transform the data in the records.
Parameters:
Count: Sets the number of:
out ports
reject ports
error ports
transform parameters
Default is 1.
Never abort: The component does not stop execution of the graph, no matter how many
reject events it generates.
Use limit/ramp: The component uses the settings in the limit and ramp parameters to
determine how many reject events to allow before it stops execution of the graph.
Ramp: Rate of toleration of reject events in the number of records processed.
When the reject-threshold parameter is set to Use ramp/limit, the component uses the
value of the ramp and limit parameters in a formula to determine the component's
tolerance for reject events.
Default is 0.0.
Limit: A number representing reject events.
When the reject-threshold parameter is set to Use ramp/limit, the component uses the
values of the ramp and limit parameters in a formula to determine the component's
tolerance for reject events.
Default is 0.
Output-index: When you specify a value for this parameter, each input record goes to
exactly one transform-output port pair. For example, suppose there are 100 input records
and two output ports. Each output port receives between 0 and 100 records. According to
the transform function you specify for output-index, the split can be 50/50, 60/40, 0/100,
99/1, or any other combination that adds up to 100.
Output-indexes: The component calls the specified transform function for each input
record. The transform function uses the value of the input record to direct that input
record to particular transform-output ports.
The expected output of the transform function is a vector of numeric values. The
component considers each element of this vector as an index into the output transforms
and ports.
As an example of how the component uses the transform function specified in the output-
indexes parameter, consider the following:
out :: output_indexes(in) =
begin
out :1: if (in.kind == "a") [vector 0, 1, 2];
out :2: if (in.kind == "b") [vector 2, 3, 4];
out : : [vector 5];
end;
When you specify this transform function for the output-indexes parameter, it directs the
input record to the transform functions for:
Ports 0, 1, and 2 if the field in.kind is "a"
Ports 2, 3, and 4 if the field in.kind is "b"
Port 5 if otherwise
If the value returned by the output_index function does not correspond to an
output port, the corresponding record is ignored.
Logging: Specifies whether or not you want certain component events logged.
Log input: Specifies how often you want the component to send an input record to its log
port. The logging parameter must be set to True for this parameter to be available.
For example, if you select 100, the component sends every 100th input record to its log
port.
Log output: Specifies how often you want the component to send an output record to its
log port. The logging parameter must be set to True for this parameter to be available.
For example, if you select 100, the component sends every 100th output record to its log
port.
Log reject: Specifies how often you want the component to send a reject record to its log
port. The logging parameter must be set to True for this parameter to be available.
For example, if you select 100, the component sends every 100th reject record to its log
port.
Ded up sorted:
Dedup Sorted separates one specified record in each group of records
from the rest of the records in the group.
Parameters:
Key: Name(s) of the key field(s) you want the component to use when determining
groups of data records.
Select: Filters the records according to the expression before the component separates
duplicates.
Keep: Specifies which records the component keeps to write to the out port. we have 3
options for this parameter.
a) First (default): Keeps the first record of a group
b) Last — keeps the last record of a group
c) unique-only — Keeps only records with unique key values
The component writes the remaining records of each group to the dup
port.
Check-sort: Specifies whether you want processing to abort on the first record that is out
of sorted order.
True causes processing to abort on the first record out of order.
Default is true.
Reject threshold: Similar to reformat component, check reformat for details.
Logging: Similar to reformat component, check reformat for details.
Filter By Expression:
Filter by Expression filters records according to a DML
expression, which specifies the selection criteria.
Parameters:
select_expr: Filters records according to the DML expression specified.
Reject threshold: Similar to reformat component, check reformat for details.
Logging: Similar to reformat component, check reformat for details.
Non-0 value — Filter by Expression writes the record to the out port.
0 — Filter by Expression writes the record to the deselect port. If you do not
connect a flow to the deselect port, Filter by Expression discards the records.
NULL — Filter by Expression writes the record to the reject port and a descriptive
error message to the error port.
Multi Reformat:
Multi Reformat changes the format of records flowing from one to 20
pairs of in and out ports by dropping fields or by using
DML expressions to add fields, combine fields, or transform data in the records.
Parameters:
All parameters are same as reformat except default count is 2 in this component.
Join:
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Parameters:
Count: An integer n specifying the total number of inputs (in ports) to join. The number
of input ports determines the number of the following ports and parameters:
Unused ports
reject ports
error ports
Record-required parameters
Dedup parameters
Select parameters
Override-key parameters
Default is 2.
Sorted-input: When set to Inputs must be sorted, the component requires sorted input,
and the maintain-order parameter is not available.
Key: Name(s) of the field(s) in the input records that must have matching values for Join
to call the transform function.
Transform: Either the name of the file containing the transform function, or a transform
string. The transform function takes the number of input arguments specified in the count
parameter.
Inner join (default) — sets the record-required parameters for all ports to True.
The GDE does not display the record-required parameters, because they all have the same
value.
Outer join — sets the record-required parameters for all ports to False. The GDE
does not display the record-required parameters, because they all have the same value.
Explicit — allows you to set the record-requiredn parameter for each port
individually.
Record-required (n): This parameter is available only when the join-type parameter is set
to Explicit. Suppose we have two input ports in0 and in1 and mentioned in0 as true and
in1 false (left outer join) means all the records from in0 and only matching records from
in1 will pass to the transform. Similarly if we specify in0 as false and in1 as true (right
outer join) then all the records from in1 and only matching records from in0 will pass to
the transform.
If we set the dedupn parameter to dedup this input before joining, to remove
duplicates from the corresponding in (n) port before joining.
Select (n): Filters for records before a join function. One per inn port; n represents the
number of an in port. If you use select (n) with dedup(n), the Join component performs
the select first, then removes the duplicate records that made it through the select.
Override-key (n): Alternative name(s) for the key field(s) for a particular inn port.
There is one override-key (n) parameter per in (n) port. The n corresponds to
the number of an in port.
Max-memory: Maximum memory usage in bytes before the component writes temporary
files to disk. Available only when the sorted-input parameter is set to Inputs must be
sorted. The default value is 8MB.
Driving: The driving input is the largest input, as specified by the driving parameter. All
other inputs are read into memory.
The driving parameter is available only when the sorted-input parameter is set to In
memory: Input need not be sorted.
Default driving port in0.
Max-core: Available only when the sorted-input parameter is set to in memory: Input
need not be sorted. If the total size of the intermediate results that the component holds in
memory exceeds the number of bytes specified in the max-core parameter, the
component writes temporary files to disk. Default value is 64MB.
SCAN:
For every input record, Scan generates an output record that includes a running
cumulative summary for the group to which the input record belongs.
Parameters:
Sorted-input: When set to in memory: Input need not be sorted, the component accepts
ungrouped input, and requires the use of the max-core parameter.
When set to Input must be sorted or grouped, the component requires grouped input, and
the max-core parameter is not available.
Transform: Either the name of the file containing the types and transforms functions, or a
transform string
This is a simple example how to get the second highest salary in a group.
type temporary_type =
record
decimal(5) sal;
end;
/*Do computation*/
temp::scan(temp,in)=
begin
temp.sal :: temp.sal+in.salary;
end;
out::output_select(out)=
begin
out :: 2;
end;
Available only when the sorted-input parameter is set to In memory: Input need not be
sorted.
Default is 10 MB
If you have defined the input_select function, Scan filters the input records accordingly.
If the function evaluates to 0 for a particular record, Scan does not process the record. In
other words, the record does not appear on any output port.
If you do not connect flows to the reject or error ports, Scan discards the information. If
the function evaluates to anything other than 0 or NULL for a particular record, Scan
processes the record.
2) Temporary initialization: Scan passes the first record in each group to the initialize
transform function. The initialize transform function creates a temporary record for the
group, with record type temporary_type.
3) Computation: Scan calls the scan transform function for each record in a group,
including the first, using that record and the temporary record for the group as arguments.
The scan transform function returns a new temporary record.
4) Finalization: Scan calls the finalize transform function once for every input record.
Scan passes the input record and the temporary record that the scan function returned to
the finalize transform function. The finalize transform function produces an output record
for each input record.
5) Output selection: If you have not defined the output_select transform function, Scan
writes all output records to the out port. If you have defined the output_select transform
function, Scan filters the output records. The output_select transform function takes a
single argument — the record produced by finalization — and returns a value of 0 (false)
or non-0 (true). Scan ignores records for which output_select returns 0; it writes all others
to the out port.
Normalize:
Normalize generates multiple output records from each of its input records.
You can directly specify the number of output records for each input record, or you can
make the number of output records dependent on a calculation.
Parameters:
Transform: The name of the file containing the types and transform functions, or a
transform string.
Length: This transform function specifies the number of times the normalize function will
be called for the current record. This function takes the input record as an argument.
record
string(7) customer_id;
decimal(2) ntrans;
record
date("YYYY.MM.DD") transdate;
decimal(8.2) amount;
end transactions[ntrans];
string(1) newline;
end
Rollup:
Rollup evaluates a group of input records that have the same key, and then
generates records that either summarize each group or select certain information from
each group.
Sorted-input: When set to in memory: Input need not be sorted, the component accepts
ungrouped input, and requires the use of the max-core parameter.
When set to Input must be sorted or grouped, the component requires grouped input, and
the max-core parameter is not available.
Default is Input must be sorted or grouped.
Key: Name(s) of the key field(s) the component can use to group or define groups of
records. If the value of the key-method parameter is Use key specifier, you must specify a
value for the key parameter.
Transform: Either the name of the file containing the types and transform functions, or a
transform string.
Max-core: Maximum memory usage in bytes.
Available only when the sorted-input parameter is set to In memory: Input need not be
sorted.
If the total size of the intermediate results the component holds in memory exceeds the
number of bytes specified in the max-core parameter, the component writes temporary
files to disk.
Default is 67108864 (64 MB).
Reject-threshold: Same as Reformat. Please check reformat for details
Logging: Same as Reformat. Please check reformat for details
String Functions:
decimal_lpad:
This function pads the left side of the string with the specified character to
match the specified character length.
This function also discards leading and trailing non decimal characters, and
you can specify either of two characters to represent the decimal point.
If str contains a minus sign, the function moves the minus sign to the left of
any padding, or to the left of the first digit.
decimal_lrepad :
m_eval 'decimal_lrepad("00042",5,"#")'"###42"
decimal_strip:
Returns a decimal from a string that has been trimmed of leading zeros and
non-numeric characters.
decimal_strip("-0042 ")"-42"
decimal_strip("+$42")"42"
decimal_strip("garbage")"0"
string_downcase:
string_downcase("abcXYZ")"abcxyz"
string_upcase:
string_upcase("abcXYZ")"ABCXYZ"
string_lrtrim:
string_trim:
string_lpad:
string_lrepad:
string_pad:
string_index:
Returns the index of the first character of the first occurrence of a string
within another string.
string_index("abcdefgh eieio", "e")5
string_index("John Smith", ":")0
string_index("qwerty", "")1
m_eval 'string_index("","abc")0
string_rindex:
Returns the index of the first character of the last occurrence of a string within
another string.
String_substring:
String_concat:
first_defined:
first_defined("test","jan","feb","mar")"test"
first_defined(NULL,"jan","feb")"jan"
first_defined(NULL,42,43,44)42
m_eval 'first_defined(NULL,NULL,NULL,"JJJJ","jan","feb")'"JJJJ"
force_error:
Raises an error and sends the specified message to the component's error or
similar port.force_error can be used to filter records having particular values or to handle
records with badly formatted data.
is_defined:
is_defined(123)1
is_defined(NULL)0
is_null:
Tests whether an expression results in a NULL value
This function returns:
is_null(123)0
is_null(NULL)1
is_error:
is_error(a + b)0
is_error(1 / 0) 1
is_valid:
is_valid(1)1
is_valid((date("YYYYMMDD"))"19960504")1
is_valid((date("YYYYMMDD"))"19920231")0
is_valid((date("YYYYMMDD"))"abcdefgh") 0
next_in_sequence():
Returns a sequence of integers on successive calls, starting with
1.Each partition of each component has its own private sequence. The sequence of
numbers returned is shared among all callers within a component. For example, calls to
next_in_sequence from different transform functions associated with different outputs of
a Reformat component affect one another.
next_in_sequence() 1
next_in_sequence() 2
next_in_sequence() 3
Date Functions:
now():
Returns the current local date and time. The returned date time result is in the
default format YYYYMMDDHH24MISSnnnnnn.
now1():
Returns a datetime value from the system local date and time. The first time a
component calls now1, the function returns the value returned from the system function
local time. Every other time the component calls now1, the function returns the same
value it returned on the first call. The returned datetime result is in the default format
YYYYMMDDHH24MISSnnnnnn.
Database components:
Run SQL:
Parameters:
DBConfigFile: Name of the database table configuration file which will contain
Database details
SQLFile: Name of the file containing SQL statements to be executed against the
Database or the SQL statement can be embedded here
Interface: API (default) or utility--only utility mode is used when we have to
Execute PL/SQL or stored procedures
ignore_drop_table_errors:
True — Suppress and ignore errors generated by the database when the SQL
attempts to drop a nonexistent table.
False (default) — do not ignore such errors.
Update table:
Update Table executes UPDATE, INSERT, and DELETE statements in
embedded SQL format (or a stored procedure) to modify a table in a database, and writes
status information to the log port.
Parameters:
DBConfigFile: Name of the database table configuration file which will contain
Database details
Join with DB joins records from the flow or flows connected to its in port
with records read directly from a database, and outputs new records containing data based
on, or calculated from, the joined records.
Parameters:
DBConfigFile: Name of the database table configuration file which will contain
Database details
Transform: The transform package used to compute the output record based on
the input record and the result from the select_sql statement.
e.g) out::join_with_db(in,query_result)=
begin
end;
Validate Components:
Generate records:
Generate Records generates a specified number of records that you can use
as test data.
Parameters:
-default field--Specifies the default value for field. Do not use this option
if the field does not have a default value.
-expression field 'expression'---Evaluates expression and stores the
resulting value in field. Use a single-quoted string to enclose expression.
Below is a simple example to generate the test data having the below dml.
record
decimal(3) id=100;
decimal(3) id1;
decimal(3) id2;
string(10) name;
end
Create Data:
CREATE DATA produces a series of one or more output records every time
its graph is run. The records in the series can be identical or distinct. This component is
useful for generating data to test graphs, particularly in cases where you need a flow with
exactly one fixed record on it.
Parameters:
Num-records: The number of records to create.
Identical-records: evaluate transform for each record (default)
evaluate transform once and use that value for all records.
Transform: Contains one function, create (index), which takes an integer(8)
value index and returns a record of type suitable for the out port. For example, consider a
CREATE DATA out port defined as follows:
record
decimal(6) customer_id ;
string(28) name;
string(1) gender;
string(26) street_addr;
string(2) state;
string(1) newline;
end
out::create(index)=
begin
out.customer_id :: index;
out.gender :1: if(index% 3) "M";
out.gender :2: "F";
out.name :: "ABC";
out.street_addr :: "";
out.state :: "AP";
out.newline :: "";
end;
Rests of the columns are hard coded, so same values will come to output.
Check Order:
Check Order tests input records to determine whether the records are
sorted according to the key specifier in the key.
Parameters:
If Check Order detects an incorrectly sorted record, it writes a single-line error
message to the out port. And then:
If the number of incorrectly sorted records does not exceed the number specified
in the limit parameter, Check Order continues with execution of the graph.
If the number of incorrectly sorted records exceeds the number specified in the limit
parameter, Check Order stops execution of the graph.
Parameters:
Key: Name(s) of the key field(s) and the sequence specifier(s) you want the
component to use when it verifies the order of data records.
Limit: Maximum number of incorrectly ordered records before the component
stops execution of the graph.
Compare Records:
Parameters:
Validate Records:
Parameters:
Compress Components:
Compress:
Compress uses the UNIX compress utility to reduce the volume of data in a
flow. Use Deflate instead of Compress.
Deflate:
Deflate reduces the volume of data in a flow by writing compressed records to
its out port.
Parameters:
Compression: Specifies how much to compress the data. Choose from the following:
0 — No compression
1 — Causes Deflate to run fastest, but compresses least
2 through 8 — Intermediate choices
9 — Causes deflate to run slowest, but compresses most
Default value is 1, which causes Deflate to run fastest, but compress the least.
Uncompress:
Uncompress uncompressed data that was compressed by the Compress
component or the G Zip component.
Inflate:
Inflate decompresses data that is in certain compressed formats. Inflate cannot
decompress data that is not in gzip- or Z-compressed format — such as data compressed
by the UNIX pack utility. To uncompress such data, use the Uncompress component.
Check-Out:
air project export /Projects/<Projectname>/InfoCatalog/<sandboxname> -
basedir /ai/src/abinitio/users/dev/<user_id>/sand/ <sandboxname> -dry-run -files -
<<'EOF'
Check in Window:
It will ask per check in comment
Check-out wizard:
Tagging:
Tagging is the process of combining the couple of objects into one tag. To
create the tag use the below commands
To list out all the objects in a tag use the below command
1) Over-reliance on databases: There are many things that you can (and should) do
outside the database. For example, operations involving heavy computation are usually
better done with components, in the graph, rather than in a database.
Sorting will almost always be faster when you use the SORT component rather than
sorting in the database.
2) Using the phases at correct places: We have to plan limited number of components in
1 phase.
3) Bad placement of phase breaks: Wherever a phase break occurs in a graph, the data
in the flow is written to disk; it is then read back into memory at the beginning of the
next phase. For example, putting a phase break just before a FILTER BY
EXPRESSION is probably a bad idea: the size of the data is probably going to be
reduced by the component, so why write it all to disk just before that happens?
4) Use the sort components when ever required only as it breaks pipeline parallelism.
5) Instead of using filter by expression use reformat,join(select)
6) Use the parallelism technique in initial stages of graph.
7) Use filter components in starting of the graph, means reduce the unnecessary data in
Initial stage of the graph and process the required data to successor components.
8) Assign the driving port of join correctly.
9) Use look up file rather than join, if one input file data is huge and other is one is
Small.
10) If you need to process an Input File in several different ways, connect multiple
flows directly to the Input File, rather than through a Replicate component. If you
can connect directly to a file, the downstream components will be able to read the
data independently, which can improve the performance.
11) Log only information that will be regularly examined.
1) Broken pipe error: The error message "broken pipe" means that a downstream
Component has gone away, so the flow is broken. For example, the database might
have run out of memory making database components in the graph unavailable.
2) No space left on device: This error message indicates that the file system that holds
/var/abinitio (the default location for AB_WORK_DIR) has run out of space. To
Solve this problem, you need to free up some space in the file system, usually by
Deleting some files.
3) Null value is assigned to not null column
4) The records are out-of-order.
5) Metadata error.
6) The depths are not equal.
7) Too many open files.
8) The user name and password are not correct.
9) Dml mismatches.
10) The input file is not present in the path.
Advanced Topics:
PDL:
PDL (the Parameter Definition Language) is a simple set of rules and notations for
referencing parameter values.
If we use PDL instead of using SHELL parameter, it will reduce the process of
contacting UNIX server every time. As PDL is Ab Initio parameter language it will
perform faster than other parameters.
Meta Programming:
Dependency Analysis:
Dependency analysis is the process by which the EME examines
a project in its entirety and traces how data is transformed and transferred, field by field,
from component to component, within and between graphs and across projects. Once you
invoke an analysis, the results are viewable in a Web browser, where you can see:
The results of dependency analysis are available in the EME Web Interface as
shown below.
PLANS:
Combinations of tasks are called a plan. We have the following tasks available
in AbInitio.
PSETS:
Psets are the files which will contain the input values. For generic graphs we
will create different psets instead of creating many graphs.