Ab Initio Questionnaire Final
Ab Initio Questionnaire Final
Ab Initio Questionnaire Final
Can we use Gather component to reduce data parallelism, and how does Gather differ
from Merge component ? ...............................................................................................21
In What scenario the record-requiredn parameter in the join component is required? ..21
Lookup functionality in Ab Initio. Briefly explain? ......................................................21
Explain the functionality of the Meta Pivot component? ..............................................21
What are the numeric results generated by the Parse function in a Read Raw
component and how are they interpreted? .....................................................................21
Explain Rollup component and Scan component? ........................................................22
Which component can we use to aggregate flow of records and then send them to
another computer? ..........................................................................................................22
Can we use any of the Ab Initio Components to perform Database Operations? .........22
Briefly explain about the ordered attribute? ..................................................................22
Describe when to use Lookup instead of Join? ..............................................................22
Briefly explain about max-core parameter? What happens if max-core parameter is set
too low and if it is set too high? .....................................................................................22
Explain Fan-in, Fan-out and All-To-All flows? ............................................................22
What do u mean by Degree of Parallelism?...................................................................23
Explain Repartitioning? .................................................................................................23
What are the Departitioning components in Ab Initio? .................................................23
If you need to do basic reformatting in parallel which partition component do we
use? ................................................................................................................................23
If you need bing parallel data into a serial dataset in a particular order, What
departition component would you use? .........................................................................23
Explain the ordering of data in the departitioning components used in Ab Initio? .......24
Which departitioning component is used to create an ordered serial flow? ..................24
Suggest some methods to avoid deadlocks during departitioning? ...............................24
What is a Layout? Is it Serial or Parallel?......................................................................24
What is a serial layout and a Parallel Layout? ...............................................................24
What are the different Layout options available in AB Initio Components? ................24
Which command do we use to create an empty multifile? ............................................24
Explain the role of Host process and Agent process in running a Graph? ....................24
What do you mean by multifile and multidirectory? .....................................................25
Name few multifile commands? ....................................................................................25
What do you mean by a Layout Marker?.......................................................................25
Name some of the components in AB Initio used for testing and Validation?..............25
What is the difference between is_valid, is_defined and is_blank functions? .............26
What do u mean by a multistage component ? Name any Multistage component ? ....27
Is Denormalize a multistage component if yes What are the different stages occurring
in this component? .........................................................................................................27
Explain the function of major key and minor key in a sort within group component? .27
When is it a bad idea to replace join with use of a lookup file in a reformatting
component? ....................................................................................................................28
Explain the function of lookup_next? ............................................................................28
Explain the function of Lookup_local?..........................................................................28
What is the purpose of using conditional components and how do u set the conditional
components in GDE? .....................................................................................................28
What is Abinitio
Abinitio is an ETL tool used for data loading from various source systems to a warehouse or a
data mart. Abinitio means from the beginning.
User Applications
Functions of EME?
Enterprise Meta Environment is a high performance object oriented storage system that
inventories and manages various kinds of information associated with Ab Initio applications. It
provides storage for all aspects of your data processing system, from design information to
operations data.. The EME also stores data formats and business rules. It acts as the hub for
data and definitions. Integrated Meta data management provides a global and consolidated view
of the structure and meaning of applications and data.
• Source control
• Secure storage
• Conflict management
• Version history
• Differencing
• Documentation
• Abinitio Metadata
• Non Abinitio Metadata
• Analysis
• Impact - Downstream
• Dependency - Upstream
• Job status
• Job Completion
• Tracking Information
• Lifecycle management
• Promotion – Dev to test to prod
• Migration – Project to project
What is a Sandbox?
A sandbox is a collection of graphs and related files that are stored in a single directory tree, and
treated as a group for purposes of version control, navigation, and migration.
A sandbox can be a file system copy of a repository project (EME project)
Important Questions
Input File represents data records read as input to a graph from one or multiple serial files or
from a multifile.
Intermediate File represents one or multiple serial files or a multifile of intermediate results that a
graph writes during execution, and saves for your review after execution.
Lookup File represents one or multiple serial files or a multifile of data records small enough to
be held in main memory, letting a transform function retrieve records much more quickly than it
could retrieve them if they were stored on disk.
Output File represents data records written as output from a graph into one or multiple serial files
or a multifile.
SAS Input File SAS Input File represents data records read as input to a graph from a SAS
dataset.
SAS Output File SAS Output File represents data records written as output from a graph into a
SAS dataset.
Each database component, like all Ab Initio graph components, has a set of parameters that can
be accessed either by double-clicking the component (or right-clicking it and selecting Properties
from the drop-down menu) to reach the component properties dialog box, and then clicking the
Parameters tab.
The set of parameters offered for almost any given database component is dynamically
dependent on certain other values, namely:
The DBMS to which the component is connecting (i.e., the database specified by the dbms field
in the component's database configuration file)
(Secondarily, but only in some cases) the value specified in the component's dbms
_interface parameter
For example, if you set up an Output Table component with a configuration file that specifies
teradata as its dbms, you will find that the component has many more parameters than an Output
Table set up for a dbms of, say, db2uss. Moreover, you will find that this parameter set changes
yet again if you go on to specify a ter_interface ("Teradata interface") of Multiload instead of
Tpump. Similar variations can be seen with other combinations of database component and
DBMS.
The parameter tables in the individual section for each component give listings of:
The non-database-specific parameters (i.e., the set of parameters which are always available for
the component, no matter which DBMS is specified in the configuration file)
The database-specific parameters in separate tables, one for each supported DBMS
match required : Controls how input records for which no rows are returned from the select_sql
statement are handled, but only when execute_on_miss is not specified:
True (the default) — The input record is sent to the unused port, and the join_with_db transform
function is not evaluated.
False — The join_with_db transform function is evaluated with a NULL value for query_result
execute_on_miss
A statement executed when no rows are returned by the select_sql statement. The
execute_on_miss statement should be an INSERT (or possibly an UPDATE); after it is executed,
the select_sql is executed a second time. If no results are generated on the second attempt, the
input record is rejected. A database commit is by default performed after each execution of
execute_on_miss, but you can change this by setting the commit_number parameter.
do_query
if you define the optional transform function do_query, join WITH DB calls this function for each
input record first. If the function returns True, the normal transform processing is done: the
select_sql statement is executed, and so on. If it returns False, no query is performed, and the
join_with_db function is called with a value of NULL for the query_result
The value of maximum_matches determines the maximum number of returned rows that are
used by the transform. The transform function is evaluated once for each row returned, within
maximum_matches.
The default value is -1, which specifies that all returned rows be used.
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or the number of output records can
depend on some calculation.
You need to provide the length transform function which tells who many times to execute the
normalize transform function.
If you have not defined temporary_type, it does not call the finalize function.
In addition, you have the option of defining a rollup operation in a Denormalize Sorted transform.
This allows you to roll up and denormalize in the same operation. If you do this, you will also need
to define a temporary_type record to hold the data that is being rolled up.
denorm function is where actual denorm happens. Depending on the max size of vector you can
assign the update. As long as count is less than 20 (which is the fixed length of the vector defined
in your output record format), update will get a 1 value (true) and elt will be appended to the
vector
Types of Parallelism
Component Parallelism
A graph with multiple processes running simultaneously on separate data uses component
parallelism.
Data Parallelism
A graph that deals with data divided into segments and operates on each segment
simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data
parallelism. To support this form of parallelism, AbInitio software provides Partition components to
segment data, and Departition components to merge segmented data back together.
Pipeline Parallelism
A graph with multiple components running simultaneously on the same data uses pipeline
parallelism.
Each component in the pipeline continuously reads from upstream components, processes data,
and writes to downstream components. Since a downstream component can process records
previously written by an upstream component, both components can operate in parallel.
To make serial file parallel, you need to partition the data using Partitioning components
To run the component in parallel, the layout must be $MFS type.
ROLLUP/JOIN/SCAN/SORT requires data to be partitioned by the key.
When key consists of composite key (multiple fields), partitioning is just required on high level key
Normalize: When we want to create separate data records from a single record, we need to use
this. Generally data from mainframe comes in vectors and we need to use Normalize to separate
them into multiple records.
De-Normalize: When we want to create a single data record from 1 or more records. This has
reverse effect of Normalize.
Example of denormalise
Redefine Format: Use this component when you need to change the record layout.
De-partition:
Concat reads all data from a flow at a time and combines them with next flow
Gather reads all data from all flows randomly and combines them
Merge reads all data from all flows and maintains the sorted order of data
Input Table:- Input Table unloads data records from a database into an Ab Initio graph, allowing
you to specify as the source either a database table, or an SQL statement that selects data
records from one or more tables.
Output Table:- Output Table loads data records from a graph into a database, letting you specify
the records' destination either directly as a single database table, or through an SQL statement
that inserts data records into one or more tables
Update Table:- Update Table executes UPDATE, INSERT or DELETE statements in embedded
SQL format to modify a table in a database, and writes status information to the log port.
RunSQL:- Executes SQL statements in a database.
InputFile- Represents one file, many files, or a multifile as an input to your graph.
OutputFile Represents one file, many files, or a multifile as an output from your graph.
Lookup File:- Lookup Files are components containing shared data. Use lookup files with the
DML lookup functions to access records according to a key.
Concatenate:- Appends multiple flow partitions of data records one after another.
Gather:- Combines data records from multiple flow partitions arbitrarily.
Merge:- Combines data records from multiple flow partitions that have been sorted according to
the key specifier, and maintains the sort order.
Run Program:- Executes a standard UNIX or Windows NT program.
Trash:- Ends a flow by discarding all input data records
Partition by Key:- Distributes data records to its output flow partitions according to key values.
Partition by Percentage:- Distributes a specified percentage of the total number of input data
records to each output flow.
Partition by RoundRobin:- Distributes data records evenly to each output flow in round-robin
fashion.
Use the Interleave component to reverse the effects of Partition by Round-robin.
Sort:- Orders your data according to a key specifier.
Dedup Sorted:- Separates one specified data record in each group of data records from the rest
of the records in the group.
Filter by Expression:- Filters data records according to a specified DML expression.
Join:- Performs inner, outer, and semi-joins with multiple flows of data records. It maximizes
performance by loading input data records into main memory.
Reformat:- Changes the record format of your data by dropping fields or by using DML
expressions to add fields, combine fields, or modify the data.
Rollup:- Generates data records that summarize groups of data records. Rollup in Memory
maximizes performance by keeping intermediate results in main memory.
Denormalize Sorted:- Consolidates groups of related records in the input to a single data record
with a vector field for each group, and optionally computes summary fields for each group.
It's the inverse of Normalize.
Normalize:- Generates multiple output data records from each input data record. Normalize can
separate a data record with a vector field into several individual records, each containing one
element of the vector.
Leading Records:- Leading Record copies a specified number of records from its in to its out
port, counting from the first record in the input file.
record
string(6) first_name;
string(10) last_name;
decimal(3) age;
date("YYYYMMDD") date_of_hire;
end
This format describes a record with four fields: first_name, last_name, age, and date_of_hire.
Each field has a different type.
Grouping and Ordering metadata (in key specify)
For data records that are input to a component, the description of the way input records are
ordered, partitioned, or grouped.
For data records which a component is to output, the description of the way the component
should order, partition, or group output records.
Computational metadata (transform functions in .xfr files)
A collection of rules and other logic that drive data transformation in Transform components.
union
Field-type field-name;
[field-type field-name;….]
End
Vector :- A vector type is a compound data object containing a sequence of elements, all of
which are the same type. The same element type can be any base type or another compound
type. There are four kinds of DML vectors
Fixed-size vectors
Delimited vectors
Data-sized vectors
Self-sized vectors
Record :- A record type is compound data type containing a sequence of named fields, each of
which can be a different DML base or compound type.
record
field-type field-name;
[field-type field-name…..]
end
Example
The following specifier describes a flow in which the first record has the data field store_name,
and all subsequent records have the data fields item_name, price, and quantity.
record
if (get_flow_state() == 0)
string(7) store_name;
else
begin
string(6) item_name;
decimal(7,2) price;
decimal(3,0) quantity;
end
double next_flow_state() = 1;
end
Syntax
type reinterpret_as ( type, x )
Arguments
type A type.
x An expression.
This function evaluates x and interprets it as type type, and returns the result.
Example
A component determines the values that are passed to the transform function and interprets the
results of the transform function. Most commonly, transform functions express record reformatting
logic. In general, however, you can use transform functions to encapsulate a wide variety of
computational knowledge that you can use in data cleansing, record merging, and record
aggregation.
Just like a user-defined function in any other programming language, transform functions can call
other transform functions.
Syntax
The following is the syntax of a transform function:
The list of local variable definitions, if any, must precede the list of statements, if any. The list of
statements, if any, must appear before the list of rules.
The following is an example of a transform function. It takes an input record with fields named a,
b, and c, and produces an output record with fields named x, y, and z using the rules shown.
out :: trans1(in) =
begin
out.x :: in.a;
out.y :: in.b + 1;
out.z :: in.c + 2;
end;
If a rule results in NULL, DML tries the rule with the next highest priority. If all rules result in
NULL, DML assigns to each output its default value or sets its value to NULL if it can hold NULL.
However, if not all outputs have a default value or can hold NULL, the transform function results
in error.
Example 1
The following set of rules assigns the ssn field of the output variable master by trying several
different things, in sequence. First, DML attempts to copy the value from source1. If source1
results in NULL, then DML tries to copy the value from source2. If that results in NULL, then DML
copies the value 999999999.
Example 2
The following set of rules computes the partcode and src fields of the output variable new by
following a sequence of rules.
If the expression try_partcode1(old.partcode) succeeds, the rule assigns the resulting value and
1, respectively, to the fields partcode and src of new.
Example
Type myrec =
Record
String(5) x=’X’;
Decimal(5) y= -1;
String (10) z;
End;
What are Packages ? Which all components can make use of Packages?
Packages are a high-level view of a transformation, and contain included files, global variables,
named types, helper functions, and transformation functions. Packages are available for
Reformat and Join as well as the multistage transform components.
What are the values we give for the sorted input parameter of an Aggregate
component?
In memory: Input need not be sorted and
Input must be sorted or grouped
When set to In memory: Input need not be sorted, Aggregate accepts ungrouped input, and
requires the use of the max-core parameter. When set to Input must be sorted or grouped,
Aggregate requires grouped input, and the max-core parameter is not available. Default is Input
must be sorted or grouped.
Explain the conditions on which data moves to First, New, and Old ports in
an Assign Keys component?
1. If the natural key value of the input record matches the natural key value of any record on
the key port, Assign Keys:
Assigns the surrogate key value of the record on the key port to the surrogate key field of the
input record
2. If the natural key value of the input record does not match the natural key value of any
record on the key port, and this is the first occurrence of this natural key value, Assign Keys:
Creates a new value for the surrogate key field of the input record
Sends the input record to both the new and the first port
3. If the natural key value of the input record does not match the natural key value of any
record on the key port, and this is not the first occurrence of this natural key value, Assign Keys:
Assigns the surrogate key value from the record on the first port that has the same natural key
value as the input record (see 2 above) to the surrogate key field of the input record
Sends the input record to the new port, but not the first port
The ramp parameter contains a float that represents a rate of reject events in the number of
records processed.
The limit parameter contains an integer that represents a number of reject events.
The component stops the execution of the graph if the number of reject events exceeds the result
of the following formula:
Otherwise, Fuse reads one record from each still-unfinished input port and a NULL from each
finished input port.
Can we use Gather component to reduce data parallelism, and how does
Gather differ from Merge component ?
You can use Gather to:
Reduce data parallelism, by connecting a single fan-in flow to the in port
Gather component is not key based , Merge component is key based. In Gather component
ordering is unpredictable . In Merge component Datas are in sorted order. Gather is used for
unordered departitioning and repartitioning. Merge is used for creating ordered serial flow.
The setting of the record-required parameter determines whether or not Join calls the transform
function for a particular combination of key values in the data records on the in ports.
Although a Lookup File appears in the graph without any connected flows, its contents are
accessible from other components in the same or later phases.
What are the numeric results generated by the Parse function in a Read
Raw component and how are they interpreted?
The Parse function returns a Positive number, Zero and a Negative number.
A Positive number indicates that the function should be called again if there is remaining input.
Zero indicates that the function should not be called again and that any remaining input should be
discarded.
A negative number indicates that the parser cannot parse the data and the function should
terminate with an error.
Rollup generates data records that summarize groups of data records.Rollup gives you more
control over record selection, grouping, and aggregation than Aggregate.Rollup requires grouped
input.
Scan generates a series of cumulative summary records - such as year-to-date totals - for groups
of data records. Scan requires grouped input.
Which component can we use to aggregate flow of records and then send
them to another computer?
Example:
Departition components. This flow pattern is used to merge data divided into many segments
back into a single segment, so other programs can access the data.
When you connect a component running in parallel to any component via a fan-in flow, the
number of partitions of the original component must be a multiple of the number of partitions of
the component receiving the fan-in flow. For example, you can connect a component running 9
ways parallel to a component running 3 ways parallel, but not to a component running 4 ways
parallel.
Fan-out flow:- Fan-out flows connect components with a small number of partitions to
components with a larger number of partitions. The most common use of fan-out is to connect
flows from partition components. This flow pattern is used to divide data into many segments for
performance improvements. When you connect a Partition component running in parallel to
another component running in parallel via a fan-out flow, the number of partitions of the
component receiving the fan-out flow must be a multiple of the number of partitions of the
Partition component. For example, you can connect a Partition component with 3 partitions via a
fan-out flow to a component with 9 partitions, but not to a component with 10 partitions.
All-to-All flow :- All-to-all flows typically connect components with different numbers of
partitions. Data from any of the upstream partitions is sent to any of the downstream partitions.
The most common use of all-to-all flows is to repartition data as in the following example.
Explain Repartitioning?
To repartition data is to change the degree of parallelism or the grouping of partitioned data. For
instance, if you have divided skewed data, you can repartition using a Partition by Key connected
to a Gather with an All-to-all flow. If the data is sorted on one key field and you want to sort on
another, you can repartition with a Partition by Key connected to a Sort with an all-to-all flow.
If you need bing parallel data into a serial dataset in a particular order,
What departition component would you use?
Merge Component.
A layout is a list of host and directory locations, usually given by the URL of a multifile. If the
locations are not in a multifile, the layout is a list of URLs called a custom layout.
A program component's layout lists the hosts and directories in which the component runs. A
dataset component's layout lists the hosts and directories in which the data resides. Layouts are
set on the Properties Layout tab.
The layout defines the level of parallelism. Parallelism is achieved by partitioning data and
computation across processors.
Explain the role of Host process and Agent process in running a Graph?
Script is invoked, creating Host process. Host process in turn spawns Agent Process. Agent
processes create Component processes on each processing computer. Component processes
do their jobs. Component processes communicate directly with datasets and each other to move
data around. As each Component process finishes with its data, it exits with success status.
When all of an Agent’s Component processes exit, the Agent informs the Host process that those
components are finished. The Agent process then exits. When all Agents have exited, the Host
process informs the GDE that the job is complete. The Host process then exits.
Multifile:
A multifile is a parallel file that is composed of individual files on different disks and/or
nodes. The individual files are partitions of the multifile. Each multifile contains one control
partition (A control partition is the file in a multifile that contains the locations (URLs) of the
multfile's data partitions) and one or more data partitions (A data partition is a file in a multifile that
contains data). Multifiles are stored in distributed directories called multidirectories.
The data in a multifile is usually divided across partitions by one of these methods:
Random or round robin partitioning
Partitioning based on ranges or functions, or
Replication or broadcast, in which each partition is an identical copy of the serial data.
Name some of the components in AB Initio used for testing and Validation?
Check Order
Compare Checksums
Compare Records
Compute Checksum
Generate Random Bytes
Generate Records
Validate Records
This function returns the value 1 if the given string contains only blank characters or is a zero-
length string. Because every character in a zero-length string is a blank, this function returns 1 for
zero-length strings. The value 0 otherwise.
Syntax
int is_blank ( string s )
Arguments
s A string to test.
Examples
is_blank("") 1
is_blank("abc") 0
is_blank("ooo") 1
is_blank("ooo.ooo") 0
Syntax
int is_defined ( x )
Arguments
x An expression to test.
Examples
is_defined(123) 1
is_defined(NULL) 0
Syntax
int is_valid ( x )
Arguments
Note the distinction between being valid and being defined: a defined item has data, but a valid
item has valid data. For example, a decimal field filled with alphabetic characters is defined but
not valid.
DML calls is_valid field functions only when is_valid is called on the record, not every time it
reads or creates a data record of that type.
Examples
is_valid(1) 1
is_valid("oao") 1
is_valid(234.234) 1
is_valid((decimal(8))"ooo") 0
is_valid((decimal(8))"oao") 0
Multiple pieces of information may be conveyed from stage to stage by having multiple fields in
the temporary type
Example : - Rollup.
Yes.
Input Selection
Initialization
Rollup
Denormalize
Finalization
Output Selection
Explain the function of major key and minor key in a sort within group
component?
Sort within Groups assumes input records are sorted according to the major-key parameter. Sort
within Groups reads data records from all the flows connected to the in port until it either reaches
the end of a group or reaches the number of bytes specified in the max-core parameter. When
Sort within Groups reaches the end of a group, it does the following:
What is the purpose of using conditional components and how do u set the
conditional components in GDE?
The GDE supports Conditional Components where a shell expression determines, at runtime,
whether or not to include certain components.
To turn on this feature:
File -> Preferences -> “Parameters” section of dialog
Check “Conditional Components”
The “Condition” tab appears on all components
The Condition expression is a shell expression that, if it evaluates to “0” or “false”, will cause the
component to not exist. Any other value means the component will exist.Make sure the shell
expression returns the string “0” or “false”, not numerical 0.
In Local Parameter :
value is saved as part of graph.
value is same every time graph is run
In Formal Parameter :
Whenever we plan to use the same record structure (DML) or Transform functions (XFR), we can
store that in .DML and .XFR respectively. By doing this, we have the option of re-usability.
When should I use in-memory Rollup or Join versus sorted input and a Sort
component?
Whenever the data to be sorted is in small amounts then no need of using the sort component
The data can be sorted inmemory of rollup or join
But when the data amount is large then it is better to sort data first and then send to the rollup r join
components
Whereas in join if the inputs tobe sorted key and join key is same “inputs sorted”
Then the join will accepts the sorted data even if it is in small amounts
First sort component has to sort then sends to join
What does the error message "straight flows may only connect ports
having equal depths" mean?
This error message appears when you have two components connected by a straight flow and they
are running at different levels of parallelism.
What does the error message "Trouble writing to socket: No space left on
device" mean?
Your work directory, $AB_WORK_DIR, is full.
What is layout?
The layout of a component is a list of hostname/pathname locations which defines two things:
Where the component runs
The number of ways parallel the component runs or the depth of parallelism of the component.
A component's layout is a list of hostname/pathname locations which specifies where (on What
computer and in What directory) the component runs and the number of ways parallel the component
runs. For example, if you set your layout to refer to a 4-way multifile, the Co>Operating System starts
four copies of the same executable, resulting in four parallel processes. Each one of the processes
runs in one partition of the multifile.
What does the error message "too many open files" mean, and how do I fix
it?
The "too many open files" error messsage occurs most commonly because the value of the max-core
parameter of the Sort component is set too low. In these cases, increasing the value of the max-core
parameter solves the problem.
What does the error message "Failed to allocate <n> bytes" mean and how
do I fix it?
The "failed to allocate" type of error message is generated when an Ab Initio process has exceeded its
limit for some type of memory allocation.
What is the difference between API mode and utility mode in database
components?
API and utility are two possible interfaces to databases from Ab Initio software; their uses may differ
depending upon the database in question.
begin
raw_blocksize = read_string(2);
while (is_defined(blocksize))
begin
read_string(2);
copy_data(blocksize - 4);
raw_blocksize = read_string(2);
end
out :: 0;
end;
end;
Temporary work area: When executing a graph, abinito uses temporary space by default
/var/abinitio. To change this set AB_WORK_DIR variable
AIR Commands:
1. air ls /Projects/…/<project_name> display all objects in that project
2. air lock set /Projects/…/<object> sets lock on that object
3. air lock show –user <user_id> shows all locks issued by the user
4. air lock release –object <object> release lock on a particular object
5. air rm <object> removes object from EME
6. air dependency-analysis performs dependency analysis
Explain PSETS:
PSETS are required for generic graphs
PSETS define the instance of a generic graph while storing the Run-time parameter values used
at run time.
PSETS makes dependency analysis possible on generic graphs
PSETS also makes tracking of run time statistics of generic graphs
Common Functions:
Is_valid() : Checks the validity. Data has to match its datatype
Example: is_valid(“2005-02-31”) fails as Feb has no 31
Is_defined(): checks if the data element has value.
If_defined(“2005-02-31) succeeds
Is_error(): reverse effect of is_defined()
FILE MANAGEMENT
use m_ prefix to perform unix operations on Multifiles
eg: m_touch : creates empty multi-file
m_mv : Moves mutifile from one place to another
m_rm : To remove multi-file
What version of Ab Initio graph you are using? What flavor of Unix you are
running?
GDE 1.14, CO-OP 2.14, UNIX Solaris
Where do you save your files? In local machine or server?
Server
If an xfr file changes, how do you figure out which dependent graphs have
been changed? (Ans - search for that xfr in ksh scripts)
What is re-partition?
Repartition means changing the degree of parallelism
It also is used to change the grouping of records within the partitions of partitioned data
What is sandbox?
It’s a filesystem copy of EME project. And is the development area for a project.
How do you find top 5 employees basis of their salary (using graph only)?
(Ans- use rollup & increment a variable)
- If I have a project parameter set, and I also have a variable exported in the
start script. What takes more priority?
Variable exported in start script
cd my_sandbox/run ./my_graph.kshRun the .mp (graph) file directly, using the air sandbox run
command.
For example:
1) autosys
2) wrapper script
3) start and end script at graph level
4) conditional dml
What are the various ways of unloading the data from a table
unload using MFS set in layout opens x number of sessions, where x is the degree of
parallelism this is the fastest way, if the table is not heavily partitioned
unload using ablocal database default set in layout and writing a ablocal_expr in parameters
will help in unloading the data faster if the table is lightly partitioned – best used in dimension
tables
unload using SERIAL ordinary way of unloading, better to add oracle’s parallel hint inorder to
increase the speed – parallel hint should be based on dba’s approval based on system’s
performance on a regular basis
Where as broad cast sends the same input to multiple flows of ouput but can perform like a
partition component – serial to mfs, mfs to mfs etc.
What is a max core parameter. It is the amount of memory which is needed to process a
component(Abinitio decides the default value, or we can also change it based on the number of
graphs running parallely at the same time)
How will you include a .xfr in a reformat to use a function defined in the .xfr
file
include "~$XFR/xyz.xfr";
use the function as same as any predefined function in Abinitio as this include statement.
If max-core is set so high that the computer's swap space is exhausted, you can cause your own
graph, and possibly other applications, or even the computer itself, to fail. This has the worst
possible effect on performance
figure out maximum graph memory usage, minus max-core components and lookup files
estimate aavailable physical memory
Subtract graph req. memeory from the avaialble memory
take 1/2 or 3/4 of this for safety margin
divide by# of partitions to get max-core
components: max-core is per partition
What is ABLOCAL and how can I use it to resolve failures when unloading
in parallel?
Answer
Some complex SQL statements contain grammar that is not recognized by the Ab Initio parser
when unloading in parallel. In this case you can use the ABLOCAL construct to prevent the
INPUT TABLE component from parsing the SQL (it will get passed through to the database). It
also specifies which table to use for the parallel clause.
What does the error message "Failed to allocate <n> bytes" mean?
This error message is generated when an Ab Initio process has exceeded its limit for some type
of memory allocation. The entire computer is out of swap space.
What does the error message "Too many open files" mean?
This error message occurs most commonly when the value of the max-core parameter of the
SORT component is set too low. In these cases, increasing the value of the max-core parameter
solves the problem.
What does the error message "Trouble writing to socket: No space left on
device" mean?
This error message means your work directory (AB_WORK_DIR) is full.
NOTE: Any jobs running when AB_WORK_DIR fills up are unrecoverable
What is layout?
The layout of a component is a list of hostname/pathname locations that defines two things:
Where the component runs
How many ways parallel the component runs (the depth of parallelism of the component)
What is the difference between m_rollback and m_cleanup and when would
I use them?
m_rollback has the same effect as an automatic rollback — using the jobname.rec file, it rolls
back a job to the last completed checkpoint, or to the beginning if the job has not completed any
checkpoints. The m_cleanup commands are used when the jobname.rec file doesn't exist and
you want to remove temporary files and directories left by failed jobs.
Can I access different databases from the same graph — for example,
unload from Oracle and load into Teradata?
Yes, you can access different databases from within the same graph. Each component that
references a database needs its own database configuration file.
Why does the number of digits vary between database and DML types?
For example, the database type NUMBER(18) for Oracle databases becomes decimal(19) in
DML.
Databases use different conventions than DML for specifying the number of significant digits. For
example, the database type NUMBER(18) in Oracle databases allows 18 digits plus a sign.
However, because DML's decimal type requires a separate character for the sign, this type
requires a 19-character decimal
How do I check the status of a graph in the end script and perform an
appropriate action based on that status (such as doing cleanup or
sending an e-mail)?
You can verify the status of the graph in the end script using the variable $mpjret. If the graph
succeeded, the value of the variable is 0; otherwise, it is non-zero.
This file exists only if you are using version 1.11 or a later version of the GDE. It contains the
user's private values for any parameters in .air-project-parameters that have the Private Value
flag set. It has the same format as the .air-project-parameters file.
Having these values in a separate file keeps them from being checked in with the other
parameter values. It also allows you to change the values without having to lock the parameters
file.
When you edit a value (in the project parameters editor) for a parameter that has the Private
Value flag checked, the value is stored in the .air-sandbox-overrides file rather than the .air-
project-parameters file.
.air-sandbox-overrides-stdenv
This file exists only if you are using a version of the GDE earlier than version 1.11. This file
contains the user's personal overrides of any parameters in .air-project-parameters. It has the
same format as the .air-project-parameters file.
You add a parameter to this file by specifying ";U" in a parameter's Description cell in the
sandbox parameter editor.
.air-sandbox-overrides-default
This file exists only if you are using a version of the GDE earlier than version 1.11. This file
contains the default values of any parameter that has been overridden by the user. It has the
same format as the .air-project-parameters file. Do not edit this file.
Scripts
ab_project_setup.ksh
By dotting in this script you can set up a sandbox environment for the project outside the sandbox
(there is no need to use it within the sandbox and you should not use it there). This allows
deployed scripts to be run outside their sandbox.
The system configuration file is named abinitiorc, and is usually set up by the Co>Operating
System administrator.
When the Co>Operating System runs a process that needs a value for a configuration variable, it
looks for that value in the following places, in the order listed:
Example of
AB_WORK_DIR: /export/appl/abinitio/ab_work_dir
AB_UMASK : 002
AB_OUTPUT_FILE_UMASK : 002
AB_DOT_WORK_CREATION_MODE : 0775
AB_AGENT_COUNT : 1
AB_CONNECTION : ssh
AB_SSH_NEEDS_PASSWORD : false
AB_SSH : /usr/SYSADM/bin/sshi
MFS_MOUNT_POINTS:/export/appl
AB_AIR_ROOT://dsysadm-etl01/export/appl/abinitio/repository/repo
AB_HOME
ab_job
Specifies the base name of the job file for the current job
If you have not specified a value for the AB_JOB configuration variable, the GDE bases the
default value for AB_JOB on the name of the graph when it generates the script for that graph.
This means that the default value for AB_JOB for a particular graph will always be the same
AB_JOB_PREFIX
To avoid problems with multiple instances of a graph being run concurrently in the same
directory, you can make the AB_JOB value unique by exporting the AB_JOB_PREFIX
configuration variable. For example, if the AB_JOB_PREFIX value is ABC, use the following
command:
export AB_JOB_PREFIX=ABC
AB_WORK_DIR
is a configuration variable whose value is the directory used for Ab Initio scratch space and for
maintaining information related to graph recovery.
AB_DATA
the configuration variable AB_DATA_DIR provides a directory for temporary data files that lack a
specific, writable path in their layouts, and thus it overrides the behavior above. AB_DATA
Utility Commands
Catalog management commands
The Co>Operating System catalog management commands provide a means of working with
catalogs of lookup files. A catalog is a list of lookup files. Each lookup file has:
A unique name
A primary key
The URL of a data file
The URL of the record format associated with the data file
Cleanup commands
Example of m_cleanup_du
find /var/abinitio/ -name "*.[hn]lg" -user e3umjk \ -mtime +7 | xargs m_cleanup_du -j
Data commands
The Co>Operating System data utility commands are:
m_dump
This DML utility prints information about data records, their record formats, and the
evaluations of expressions
Printing particular records
This example prints five data records, numbers 5, 6, 7, 8, and 9:
$ m_dump xyz.dml xyz.dat -start 5 -end 9
m_eval
This DML utility evaluates DML expressions and displays their derived types. Use it to
quickly test and evaluate simple, multiple, cast, and other expressions that you want to
use in a graph.
Example
m_eval '(date("YYYYMMDD")) (today() - 10)'
m_eval '((decimal(5)) 42)' "ooo42"
Example of m_mkfs
m_mkfs //pluto.us.com/usr/ed/mfs3 \
//pear.us.com/p/mfs3-0 \
//pear.us.com/q/mfs3-1 \
//plum.us.com/D:/mfs3-2
Example of m_mkdir
m_mkdir //pluto.us.com/usr/ed/mfs3/cust
XXurl
Manipulates filenames, file paths, URLs, and parts of URLs in various ways that might be useful
for graph developers and users.
Other commands
Other Co>Operating System commands are:
m_cs : shows the names of the character sets.
m_password - The m_password command uses AES encryption with a 256-bit key.
m_env - general-purpose utility for obtaining information about Ab Initio configuration variable
settings in an environment.
m_queue
m_view_errors
m_ftp
m_db commands
m_db convert
------------
Converts .cfg files to .dbc formats and .dbc to .cfg formats.
m_db convert -cfg <cfg-filename>
m_db convert -dbc <dbc-filename>
m_db create
------------
Creates the database table with DDL equivalent to the metadata-string.
If the table already exists, an error will be returned unless the -existence_ok flag is supplied.
m_db create <dbc-name> -dml '<metadata-string>' -table <tablename> [-existence_ok]
m_db create <dbc-name> -dml_file <dml-file> -table <tablename> [-existence_ok]
m_db create_commit_table
------------------------
Creates a table used internally by abinitio for checkpointing information.
m_db describe
-------------
m_db describe -configs
Lists brief descriptions of all dbc files found.
m_db describe -oledb_providers
Describes OLE DB Providers for Windows NT.
m_db describe -oledb_data_sources
Describes OLE DB data sources for Windows NT.
m_db describe -supported_databases
Describes databases supported by this package.
m_db env
--------
Prints the value of environment variables used by this database vendor.
m_db env <dbc-name>
m_db find
----------
Shows which dbc file will be used for this dbc file name.
m_db find <dbc-name>
m_db genctl
------------
Generates a load control file based on the DML supplied.
m_db genctl <dbc-name> [<options>] -dml '<metadata-string>' -table <tablename> -component
<name>
m_db genctl <dbc-name> -dml_file <dml-filename> -table <tablename> -component <name>
<options>: [-unique_identifier <uid>] [-dsname <dsname>]
<options>: [-utproc <utproc>]
** Valid component names: db-db2uss-load
m_db genddl
------------
Generates database table creation DDL equivalent to the metadata-string.
m_db genddl <dbc-name> -dml '<metadata-string>' -table <tablename>
m_db genddl <dbc-name> -dml_file <dml-filename> -table <tablename>
m_db gendml
-----------
Generates appropriate metadata for that table/insert/select or expression.
m_db gendml <dbc-name> [<options>] -table <tablename>
m_db gendml <dbc-name> [<options>] -select '<sql-select-statement>'
m_db gendml <dbc-name> [<options>] -insert '<sql-insert-statement>'
m_db gendml <dbc-name> [<options>] -tables <texpr> [<options2>]
m_db gendml <dbc-name> [<options>] -views <texpr> [<options2>]
m_db gendml <dbc-name> [<options>] -aliases <texpr> [<options2>]
m_db gendml <dbc-name> [<options>] -all_objects <texpr> [<options2>]
m_db load
----------
Loads data to a database table or insert statement from stdin or a file or string.
m_db load <dbc-name> -dml '<metadata-string>' -table <tablename> [-data <string> || -data_file
<filename>]
m_db load <dbc-name> -dml_file <dml-filename> -table <tablename> [-data <string> || -data_file
<filename>]
m_db load <dbc-name> -dml '<metadata-string>' -insert '<sql-insert-statement>' [-data <string> ||
-data_file <filename>]
m_db load <dbc-name> -dml_file <dml-filename> -insert '<sql-insert-statement>' [-data <string> ||
-data_file <filename>]
m_db print
-----------
Prints out the configuration settings for this config file.
m_db print <dbc-name> -value <tagname>
m_db print <dbc-name> -all [-verbose]
m_db run
-------------
Runs SQL against the database.
m_db run <dbc-name> -sql '<sql-string>'
m_db truncate
--------------
Truncates the given table.
m_db truncate <dbc-name> -table <tablename>
m_db unload
-------------
Unloads data from database table, select or expression to stdout.
m_db unload <dbc-name> [<options>] -table <tablename>
m_db unload <dbc-name> [<options>] -select '<sql-select-statement>'
m_db unload <dbc-name> [<options>] -tables <texpr> [<options2>]
m_db unload <dbc-name> [<options>] -views <texpr> [<options2>]
m_db unload <dbc-name> [<options>] -aliases <texpr> [<options2>]
m_db unload <dbc-name> [<options>] -all_objects <texpr> [<options2>]
Examples: m_db unload mydb.dbc -table 'fred.mytable'
m_db version
--------------
Tries to get the database version from the database if it
can, and if not, it will return the version from the .dbc file.
m_db version <dbc-name>
Paging (or having very little free physical memory, which means you're close to paging) is
result of phases that have too many components trying to run at once or too much data
parallelism
In Case of low volume processing go serial. When this is true, the graph's startup time
will be disproportionately large in relation to the actual run time. Can the application
process more data per run? Maybe it could use READ MULTIPLE FILES, for example, to
read many little files per run, instead of running so many times. ADHOC MULTIPLE
FILES
Bad placement of phase breaks: Wherever a phase break occurs in a graph, the data in
the flow is written to disk; it is then read back into memory at the beginning of the next
phase. For example, putting a phase break just before a FILTER BY EXPRESSION is
probably a bad idea: the size of the data is probably going to be reduced by the
component, so why write it all to disk just before that happens?
Too many sorts: The SORT component breaks pipeline parallelism and causes additional
disk I/O to happen.
Tracking Performance
Track the graph performance thru GDE, system utlities
In general, completely fixed format records take less CPU to process than variable length
records.
Drop fields that aren’t needed as soon as possible. This is often done “for free” in
transform components.
“Flatten” out conditional fields as soon as possible.Often, conditional fields are used to
store multiple record types in a single format. Split these into separate processing
streams as soon as possible. Join them back at the end of the graph, if required.
Join, Rollup, and Scan can operate either in-memory or on sorted data.If your data does
not fit in memory, and you need to do multiple joins or rollups on the same key, it will be
most efficient to sort once and set the rollups and joins to expect sorted input.
Best performance will be gained when components have enough max-core to avoid
spilling to disk.If max-core can’t accommodate components’ memory needs, data will spill
to disk.
Then you need disk space to accommodate the entire component memory requirement.
Use Rollup or Filter by Expression as soon as possible if they will reduce the number of
records being processed.
Join as early as possible if this will reduce the number of records being processed.
Join as late as possible if this will increase the number of records or the width of records
being processed.
When joining a very small dataset to a very large dataset, it may be more efficient to
broadcast the small dataset or use it as a lookup file rather than repartition and re-sort
the large dataset.
Lookup Files benefit from fixed record formats.
When creating lookup files from source data, drop any fields which aren’t keys or lookup
results.
Use lookup multifiles and lookup_local() for very large datasets.
Reducing the number of components may save startup costs, but may sometimes make
a graph harder to understand.
What are various versions of GDE you have worked on and what
are the major differences?
1.12,1.13,1.14 co op 2.12,2.13 and 2.14
1.12
Rollup capabilities were enhanced : you
You can now define local variables, statements, key_change, input_select, output_select and still
use aggregation functions
Replay
The GDE allows you to save the tracking information of a run to a file, and later play it back,
optionally at a higher speed
1.13
Enabling the Debugger allows you to insert breakpoints into your transform functions.
1.14
Checkin is simpler and faster. In GDE Version 1.14, translation of graphs for analysis
purposes is no longer performed within the GDE; the options for controlling analysis
have been removed.
Sandbox view
What are the common tools/components which you have created in Abinitio
Explain about OTS load graphs
Audit entries components (explain the job dependency and job audit table)
(number_of_partitions()*next_in_sequence()+this_partition())-number_of_partitions()+ 1 +
temp_today_date_decimal;