DataStage Training Day 1
DataStage Training Day 1
DataStage Training Day 1
Training Day 1
1-1
Course Prerequisites
To gain full benefit from this class, you
should have a strong working knowledge of:
– Relational database management systems
(RDBMS)
– Structured Query Language (SQL)
– Windows GUI
– Open Database Connectivity (ODBC)
1-2
What is DataStage?
1-3
DataStage Architecture
Repository DataStage
DataStage Manager Designer DataStage
Administrator Director
UniVerse Objects
SERVER
DataStage
"Engine" Repository
1-4
Client Components
• Designer:It creates Datastage jobs that are
compiled into executable programs.
• Director: used to run and monitor the Datastage jobs.
• Manager:A user interface to edit and view contents of
the repository and perform import and export of
DataStage Components.
• Administrator: A user interface to perform
administrative jobs like setting up users/adding
projects.
1-5
DataStage Designer
1-6
Manager and Director
1-7
Repository
• Proprietary file format.
• Repository is created for each project.
• Single Server can handle multiple repositories.
• Stores all source/target definitions.
• Can be imported/exported in an XML format.
• Repository Contents.
– Data Elements.
– Jobs.
– Routines.
– Shared Containers.
– Stage Types.
– Table Definitions.
– Transforms.
1-8
DataStage Administrator
Creating & attaching a project to DataStage
Executing a UniVerse Command
Cleaning Up a Project
License
Permissions
Auto-Purge of Log Files
Inactivity Timeout
User for Scheduled Jobs
User Name Troubleshooting
1-9
DataStage Designer Design Area
Stages
repository
1 - 10
DataStage Server Stages
• Is like a station which performs certain tasks. E.g.
Read data, Aggregate Data
• Each Stage has its own editable properties
1 - 11
ODBC Stage
• ODBC stage can be used to extract,write and aggregate data
• ODBC stage can have any number of inputs and outputs.
• The three editable properties for this stage are “Stage”, “Inputs”, “Outputs”
1 - 12
ODBC Stage - Stage Properties
• Stage Name: Stage Name to identify that particular
stage.
• You can add the DSN name from the drop down list.
• Specify valid username and password for the same.
( Can be used as a job parameters entered during run
time.)
• Specify the database specific quote character.
• You can choose the character map for the ODBC
support ( necessary for supporting various languages)
1 - 13
Add multiple tables
ODBC Stage - Output Properties (Extract) which have been
imported before
Preview
data
Define SQL Type
• Generated Query: Depending upon the columns chosen
• User Generated Query ( SQL Over Ride)
• Stored Procedure Read.
1 - 14
ODBC Stage - Output Properties Contd.. Specify joins if multiple tables are used or
any clause to restrict data being extracted
Isolation Level
Specify a suitable transaction isolation level. This specifies how potential conflicts between transactions
(i.e., dirty reads, nonrepeatable reads, and phantom reads) are handled. Choose from:
• None. The default isolation level for the database is used.
• Read Uncommitted. Dirty reads, nonrepeatable reads, and phantom reads are all possible.
• Read Committed. Dirty reads are not possible. Nonrepeatable reads and phantom reads are both
possible.
• Repeatable Read. Dirty reads and nonrepeatable reads are not possible, but phantom reads are
possible.
• Serializable. Transactions are serializable. Dirty reads, nonrepeatable reads, and phantom reads are
not possible.
• Versioning. Transactions are serializable, but have a higher concurrency than with Serializable.
Versioning is implemented using a non-locking protocol.
• Auto commit. Each write is committed separately. Do not use this in conjunction with transaction
groups
1 - 15
ODBC Stage - Extracting Data
• Add table definitions before you can incorporate them
in ODBC stage
• You can extract multiple tables from the same ODBC
stage by specifying joins in the selection clause
• You can have user defined SQL or used stored
procedures to extract data
• The columns loaded can be from multiple tables, but
column names have to be unique
• You can preview data
• All the columns loaded in the stage are owned by the
link which connects it to the next stage.
1 - 16
ODBC Stage - Input Properties (Load)
•The input “stage” properties remain the same as the output properties.
•You choose the table name if the table definition has been imported before or create table in the
target ODBC
•Choose the appropriate Update Action required.
Update action
Specifies how the data is written. Choose the option you want from the drop-down list box:
•Clear the table, then insert rows. Deletes the contents of the table and adds the new rows.
•Insert rows without clearing. Inserts the new rows in the table.
•Replace existing rows completely. Deletes the existing rows, then adds the new rows to the table.
•Update existing rows only. Updates the existing data rows. If a row with the supplied key does not exist in the table then the table is not updated but
a warning is logged.
•Insert new or update existing rows. New rows are added or, if the insert fails, the existing rows are updated.
•Update existing or insert new rows. The existing data rows are updated or, if this fails, new rows are added.
•Call stored procedure. The data is written using a stored procedure. When you select this option, the Procedure name field appears.
•User-defined SQL. The data is written using a user-defined SQL statement. When you select this option, the View SQL tab is replaced by the Enter
SQL tab.
1 - 17
ODBC Stage -Input Properties Columns
Right
click
1 - 18
ODBC Stage - Create Table Option
•You can define the ddl for table creation or it can be generated by the tool .
•You can choose to drop table before creating if necessary
1 - 19
ODBC Stage - Transaction Handling
•Isolation levels are chosen as discussed before.
• Rows Per Transaction: This is the number of rows which are
written before the data is committed to the data table. The default
value is 0 signifying that al the rows are written before data is
committed to the table
• Parameter Array Size: this is the number of rows written at a
time. Default is 1signifying that each row is written in a separate
operation . This should be changed/tuned depending upon the
storage space required.
Note: if the above two settings conflict the Rows per transaction
takes precedence.
1 - 20
ODBC Stage - Recap
• Define the connection
• Define/import the tables and the table
metadata.
• Choose one of the following to read or
write data
– User Defined
– Generated
– Stored procedure
1 - 21
Sequential File Stage
• Used to extract/read data or write data to a Flat file.
• The text file can be read/created from any file on the server or any drive
mapped on the server.
• Can have any number of outputs and inputs
1 - 22
Sequential File Stage - Defining Input Data
When you write data to a sequential file stage it has an input link. The properties are defined on the inputs page of the stage
File name
Update Action
Note: The file has to imported in the repository before using it in this stage. The procedure is similar
to importing ODBC table definitions
Backup existing file creates a back up before overwriting or appending to it. This can be used if a
job is reset during execution
1 - 23
Sequential File Stage - Input Data Format
Format contains parameters that define the format of data in the file
1 - 24
Sequential File Stage - Output Data Format
The general page and the columns page remains the same for both read and write
of data
Additional Properties
Missing Column Action: Allows
you to specify the action for columns
missing in the data being read.
Options are
• Pad with Null
•Pad with empty string
•Map empty String
Suppress for Truncation warnings:
This is used if you are reading only
three columns from a file which has
more than three columns defined.
1 - 25
Aggregator Stage
Aggregator Stage classifies data rows from single input link into
groups and computes totals or other aggregate functions. The
sum/totals are output from the stage.You can only have on input
link but many output links if the data has to be grouped in different
ways.
Aggregator is an “Active Stage”
Before/After Routines: These
are routines which have been
included in the repository and
can be executed before or after
the stage.
E.g.: run a dos command or
send email.
1 - 26
Aggregator Input Data
Data to be aggregated is passed from a previous stage link and that is used as input for the
aggregator stage.The columns passing out of the previous stage are reflected on input
columns tab.
All the column header except Sort and Sort order are discussed previously
1 - 27
Aggregator Output Data
Derivation editor
The standard columns for data type appear with two changes:
Group: Specifies whether a give column is used in the group by clause.
Derivation: Each input column needs to have a derivation for this stage.This will be
discussed in the next slide.
1 - 28
Aggregator Stage - Derivation Editor
Iteratively choose each column and define
whether to use a group by or use some
aggregation function from the drop down list
Click if you
want to group
by a column
Note: The functions available change according the underlying column data type.
1 - 29
Transformation Stage
• Handles extracted data to perform conversions and pass data to another transformation
stage or other DataStage stage
• Can have any number inputs or outputs.
• Has one primary data link and many reference links
Double click the
stage to open the
transformation editor
Link
Area
Metadata
Area
1 - 30
Transformation Editor
• Link Area: Is where all the column definitions.expressions/stage variables and constraints
are defined.The link area is divided in two parts the left pane shows you the input links and the
right pane shows the output columns as well as their derivations,constraints and stage
variables. Each of them will be explained in detail.
• Metadata Area: This shows the column metadata again split in two panes one for the input
columns and one for the output columns. Here you can edit the columns as u seem necessary
for their properties.
• Shortcut Area:
1 - 31
Transformer Concepts
When you use the transformation stage you have already defined what input data has to be
passed and changed. The input columns are transformed in the transformation editor on
the output pane for each column chosen.
Input Links: This stage has one primary link which is used to get bulk of the data to be
transformed.It can also use many other links as reference links which will affect how the
data is to be transferred. (Look Up Activity)
Reference Links:You can define key expressions for columns on the input columns on the
reference links. The most common expression is specifying an equii-join between the
primary table and reference table
Primary Link
Reference Link
1 - 32
Transformation - Key Expressions
Key expressions are used to affect how the data flowing thru the stage is affected. They
are defined for reference link “key” columns. A column has to be defined as a key
column by changing that field to yes in the reference link metadata area
CAT_ID is changed to a key column and then its key expression is defined in the link area
1 - 33
Transformation - Output Column Expressions
What can you do with the transformation stage?
Add delete Columns: You can add new columns by right clicking the output pane in the Link
area and then defining the column metadata in the metadata area. The new column derivations
can be custom strings/derivations/concatenations/standard functions applied on input columns
or stage variables. We will discuss them shortly.
Move/edit Columns: In the output pane in the link area you can move the columns up and
down by dragging. Also in the metadata area you can edit the column definitions as per
requirements.
Before/After Routines: This is an active stage hence you can specify routines to be executed
by passing suitable input values as discussed before.
1 - 34
Transformation Stage - Expression Editor
Expression editor allows you to define expression for columns which have to be modified
in the transformer stage. What can you do with the expression editor?
Entering Expressions: Whenever the insertion point is an expression box you can right
click to open the suggested operand menu. This allows you a rich support of defining
various elements and using predefine transformations/routines and functions. Also allows
you to define expressions based on stage variables.
Continued…
1 - 35
Transformation Stage -Expression Editor
Suggested Operand menu for Transformer stage
DS Macro: They are provided with DataStage and require no arguments you can select them from a
drop down list
DS Functions:Provides a list of available DataStage functions which can be used.
DS constant: A list of predefined DataStage constants
1 - 36
Transformation Stage -Expression Editor
Suggested Operand menu for Transformer stage
DS Routines: They are defined in the repository. You can use your own routines or use the ones
provided by DataStage.
DS Transforms:Provides a list of available DataStage transforms. Replace the %Arg% with real
arguments required by that particular transformation in the required format. They can be viewed in the
Manager.
Stage variables are shown in the output pane of the transformations stage.
1 - 38
Transformation Constraints
You can define limits for output data by specifying constraints in the transformation stage.
Constraints are BASIC expression which can be defined for each output link of a
transformer stage. You can also specify reject links for rows which haven’t been written to
any output stage. To define constraints do the following
Plug-in Stage
• Plugin stages are mostly used for access of external programs that can be leveraged by
the DataStage applications.
• DataStage supports a number of plug-in stages to perform special tasks. They have to be
installed separately on the server and have to be configured before use.
• After installing these stages they have to be registered on the server system before being
used.
E.g.: ORABULK: bulk loading of data in Oracle database.
BCPLOAD: Bulk load for MS SQL Server.
Each plug-in stage has its own configurable properties which can be checked from the
manuals. Please refer them if you plan to use it.
1 - 40