Abinitio Training: Medium Complex Components

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Abinitio Training

Medium Complex Components


> Data Aggregation

0345Smith
0345Smith Bristol
Bristol 56
56 Bristol
Bristol 63
63
0212Spade
0212Spade London
London 88 Compton
Compton 12
12
0322Jones
0322Jones Compton
Compton 12
12 London
London 31
31
0492West
0492West London
London 23
23 New
New York
York 42
42
0121Forth
0121Forth Bristol
Bristol 77
0221Black
0221Black New
New York
York 42
42

Confidential ©2012 Syntel, Inc.


> Data Aggregation of Sorted/Grouped Input

0345Smith
0345Smith Bristol
Bristol 56
56
0121Forth
0121Forth Bristol
Bristol 77 Bristol
Bristol 63
63
0322Jones
0322Jones Compton
Compton 12
12 Compton
Compton 12
12
0212Spade
0212Spade London
London 88
0492West
0492West London
London 23
23 London
London 31
31
0221Black
0221Black New
New York
York 42
42 New
New York
York 42
42

Confidential ©2012 Syntel, Inc.


> Built-in Functions for Rollup

 The following aggregation functions are predefined and are only


available in the rollup component:

avg max
count min
first product
last sum

Confidential ©2012 Syntel, Inc.


> Rollup Wizard

Note the use of an aggregation function in the expression

Confidential ©2012 Syntel, Inc.


ROLL UP

 Rollup evaluates a group of input records that have the


same key, and then generates records that either
summarize each group or select certain information from
each group.

Confidential ©2012 Syntel, Inc.


Two ways to use ROLLUP

 You can use a Rollup component in two ways,


depending on how you define the transform
parameter:
 Define a transform that uses a template rollup
function. This is called template mode and is
most often used when you want to aggregate the
data.
 Create a transform using an expanded Rollup
package. This is called expanded mode and
allows for rollups that do not necessarily use
regular aggregation functions.

Confidential ©2012 Syntel, Inc.


Template Mode

 In the transform parameter, you specify an aggregation


function that describes how the data should be “rolled up,”
or summarized, usually in some cumulative way
 For example ,count and aggregate.

Confidential ©2012 Syntel, Inc.


Expanded Mode

 Expanded mode provides more control over the


data rollup
 specify transformations that are not possible with
template mode
 With an expanded Rollup package, you must
define the following items:
 DML type named temporary_type
 initialize function that returns a temporary_type
record
 rollup function that takes two input arguments
(an input record and a temporary_type record)
and returns an updated temporary_type record

 finalize function that returns an output record

Confidential ©2012 Syntel, Inc.


Package View of a Rollup

Confidential ©2012 Syntel, Inc.


Rolloup Transform

Confidential ©2012 Syntel, Inc.


Vector Creation Through Rollup

Confidential ©2012 Syntel, Inc.


SCAN

 For every input record, Scan generates an output record


that consists of a running cumulative summary for the
group to which the input record belongs, up to and
including the current record
 For example, the output records might include successive
year-to-date totals for groups of records.

Confidential ©2012 Syntel, Inc.


Template Mode

 Template mode is the simplest way to use SCAN. In the


transform parameter, you specify an aggregation function
that describes how the cumulative summary should be
computed.

Confidential ©2012 Syntel, Inc.


Expanded Mode

 Expanded mode provides more control over the


scan. It lets you edit the expanded package, so
you can specify transformations that are not
possible with template mode
 With an expanded SCAN package, you must
define the following items:
 DML type named temporary_type
 Initialize function that returns a temporary_type
record
 Scan function that takes two input arguments (an
input record and a temporary_type record) and
returns an updated temporary_type record
 Finalize function that returns an output record

Confidential ©2012 Syntel, Inc.


NORMALIZE

 Normalize generates multiple output records from each of


its input records
 You can directly specify the number of output records for
each input record
OR
 You can make the number of output records dependent on
a calculation

Confidential ©2012 Syntel, Inc.


Run Time Behavior of Normalize

 Reads the input record.


 Performs temporary initialization
 Performs iterations of the normalize transform function
 Sends the output record to the out port

Confidential ©2012 Syntel, Inc.


Simple Example

 An Input file having vector data is normalized

Confidential ©2012 Syntel, Inc.


Simple Example

 Output is as below:

Confidential ©2012 Syntel, Inc.


Phase

 A phase is a stage of a graph that runs to completion


before the start of the next phase
 By dividing a graph into phases, you can make the best use
of resources such as memory, disk space, and CPU cycles
 The boundary between two phases is called a phase break,
and it belongs to the first of the two phases

Confidential ©2012 Syntel, Inc.


CheckPoint

 Checkpoint is a point at which the Co>Operating


System saves all the information it would need to
restore a job to its state at that point
 You can have checkpoints only at phase breaks
 As the execution of the graph successfully
passes each succeeding checkpoint, the
Co>Operating System:
 Deletes the information it has saved to be able to
restore the job to its state at the preceding
checkpoint
 Deletes the temporary files it has written in the
layouts of the components in all phases since the
preceding checkpoint
 Commits the effects on the file system of all
phases since the preceding checkpoint

Confidential ©2012 Syntel, Inc.


Pictorial Representation

 Please see below:

Confidential ©2012 Syntel, Inc.


CHECK ORDER

 Check Order tests input records to determine


whether the records are sorted according to the
key specify in the key parameter
 Limit :Maximum number of incorrectly ordered
records Allowed

Confidential ©2012 Syntel, Inc.


COMPARE CHECKSUMS

 Compare Checksums compares two checksums


generated by the Compute Checksum component
 You can have checkpoints only at phase breaks
 As the execution of the graph successfully
passes each succeeding checkpoint, the
Co>Operating System:
 Deletes the information it has saved to be able to
restore the job to its state at the preceding
checkpoint
 Deletes the temporary files it has written in the
layouts of the components in all phases since the
preceding checkpoint
 Commits the effects on the file system of all
phases since the preceding checkpoint
Confidential ©2012 Syntel, Inc.
COMPARE RECORDS

 Compare Records reads records from two flows,


compares the records one by one, and writes
one-line text reports
 It compares records using a byte-to-byte
comparison
 Limit

Confidential ©2012 Syntel, Inc.


COMPUTE CHECKSUM

 Compute Checksum computes a checksum for


records
 Output record format
record
big endian real(8) sum_of_crc;
big endian real(8) sum_of_lengths;
unsigned big endian integer(4) xor_of_crc;
unsigned big endian integer(4) record_count;
end

Confidential ©2012 Syntel, Inc.


GENERATE RANDOM BYTES

 Generate Random Bytes generates a specified


number of records, each consisting of a specified
number of random bytes

Confidential ©2012 Syntel, Inc.


VALIDATE RECORDS

 Validate Records separates valid records from


invalid records

Confidential ©2012 Syntel, Inc.


Questions ?

Confidential ©2012 Syntel, Inc.

You might also like