DataStage Training Day 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

DataStage

Training Day 1

1-1
Course Prerequisites
To gain full benefit from this class, you
should have a strong working knowledge of:
– Relational database management systems
(RDBMS)
– Structured Query Language (SQL)
– Windows GUI
– Open Database Connectivity (ODBC)

1-2
What is DataStage?

Datastage is a comprehensive tool for the


fast,easy creation and maintenance of
data marts and datawarehouse.

1-3
DataStage Architecture
Repository DataStage
DataStage Manager Designer DataStage
Administrator Director

UniVerse Objects

SERVER
DataStage
"Engine" Repository

Connections to data sources, e.g. ODBC, native API, FTP, etc.

1-4
Client Components
• Designer:It creates Datastage jobs that are
compiled into executable programs.
• Director: used to run and monitor the Datastage jobs.
• Manager:A user interface to edit and view contents of
the repository and perform import and export of
DataStage Components.
• Administrator: A user interface to perform
administrative jobs like setting up users/adding
projects.

1-5
DataStage Designer

1-6
Manager and Director

1-7
Repository
• Proprietary file format.
• Repository is created for each project.
• Single Server can handle multiple repositories.
• Stores all source/target definitions.
• Can be imported/exported in an XML format.
• Repository Contents.
– Data Elements.
– Jobs.
– Routines.
– Shared Containers.
– Stage Types.
– Table Definitions.
– Transforms.

1-8
DataStage Administrator
 Creating & attaching a project to DataStage
 Executing a UniVerse Command
 Cleaning Up a Project
 License
 Permissions
 Auto-Purge of Log Files
 Inactivity Timeout
 User for Scheduled Jobs
 User Name Troubleshooting

1-9
DataStage Designer Design Area

Stages

repository

1 - 10
DataStage Server Stages
• Is like a station which performs certain tasks. E.g.
Read data, Aggregate Data
• Each Stage has its own editable properties

Stages used most often are:


• ODBC Stage
• Sequential file Stage
• Transformation Stage
• Aggregator stage

1 - 11
ODBC Stage
• ODBC stage can be used to extract,write and aggregate data
• ODBC stage can have any number of inputs and outputs.
• The three editable properties for this stage are “Stage”, “Inputs”, “Outputs”

Note:DSN has to be defined


on the server machine as a
system DSN.

Input: When data is


added/Loaded to a table
Output: When data is
read/Extracted from a
table

1 - 12
ODBC Stage - Stage Properties
• Stage Name: Stage Name to identify that particular
stage.
• You can add the DSN name from the drop down list.
• Specify valid username and password for the same.
( Can be used as a job parameters entered during run
time.)
• Specify the database specific quote character.
• You can choose the character map for the ODBC
support ( necessary for supporting various languages)

1 - 13
Add multiple tables
ODBC Stage - Output Properties (Extract) which have been
imported before

Preview
data
Define SQL Type
• Generated Query: Depending upon the columns chosen
• User Generated Query ( SQL Over Ride)
• Stored Procedure Read.

1 - 14
ODBC Stage - Output Properties Contd.. Specify joins if multiple tables are used or
any clause to restrict data being extracted

Verify the SQL being generated to extract


data. Can be user defined or generated

Isolation Level
Specify a suitable transaction isolation level. This specifies how potential conflicts between transactions
(i.e., dirty reads, nonrepeatable reads, and phantom reads) are handled. Choose from:
• None. The default isolation level for the database is used.
• Read Uncommitted. Dirty reads, nonrepeatable reads, and phantom reads are all possible.
• Read Committed. Dirty reads are not possible. Nonrepeatable reads and phantom reads are both
possible.
• Repeatable Read. Dirty reads and nonrepeatable reads are not possible, but phantom reads are
possible.
• Serializable. Transactions are serializable. Dirty reads, nonrepeatable reads, and phantom reads are
not possible.
• Versioning. Transactions are serializable, but have a higher concurrency than with Serializable.
Versioning is implemented using a non-locking protocol.
• Auto commit. Each write is committed separately. Do not use this in conjunction with transaction
groups
1 - 15
ODBC Stage - Extracting Data
• Add table definitions before you can incorporate them
in ODBC stage
• You can extract multiple tables from the same ODBC
stage by specifying joins in the selection clause
• You can have user defined SQL or used stored
procedures to extract data
• The columns loaded can be from multiple tables, but
column names have to be unique
• You can preview data
• All the columns loaded in the stage are owned by the
link which connects it to the next stage.

1 - 16
ODBC Stage - Input Properties (Load)
•The input “stage” properties remain the same as the output properties.
•You choose the table name if the table definition has been imported before or create table in the
target ODBC
•Choose the appropriate Update Action required.

Update action
Specifies how the data is written. Choose the option you want from the drop-down list box:
•Clear the table, then insert rows. Deletes the contents of the table and adds the new rows.
•Insert rows without clearing. Inserts the new rows in the table.
•Replace existing rows completely. Deletes the existing rows, then adds the new rows to the table.
•Update existing rows only. Updates the existing data rows. If a row with the supplied key does not exist in the table then the table is not updated but
a warning is logged.
•Insert new or update existing rows. New rows are added or, if the insert fails, the existing rows are updated.
•Update existing or insert new rows. The existing data rows are updated or, if this fails, new rows are added.
•Call stored procedure. The data is written using a stored procedure. When you select this option, the Procedure name field appears.
•User-defined SQL. The data is written using a user-defined SQL statement. When you select this option, the View SQL tab is replaced by the Enter
SQL tab.

1 - 17
ODBC Stage -Input Properties Columns

Right
click

• Columns can be defined by adding a new


row on the columns menu and editing the
row contents
• You can Specify data types keys etc
• If the target table is already defined and
imported you can load the table definition
as you did before.
•Column definitions can also be used from
the incoming link to this stage.

1 - 18
ODBC Stage - Create Table Option
•You can define the ddl for table creation or it can be generated by the tool .
•You can choose to drop table before creating if necessary

1 - 19
ODBC Stage - Transaction Handling
•Isolation levels are chosen as discussed before.
• Rows Per Transaction: This is the number of rows which are
written before the data is committed to the data table. The default
value is 0 signifying that al the rows are written before data is
committed to the table
• Parameter Array Size: this is the number of rows written at a
time. Default is 1signifying that each row is written in a separate
operation . This should be changed/tuned depending upon the
storage space required.
Note: if the above two settings conflict the Rows per transaction
takes precedence.
1 - 20
ODBC Stage - Recap
• Define the connection
• Define/import the tables and the table
metadata.
• Choose one of the following to read or
write data
– User Defined
– Generated
– Stored procedure

1 - 21
Sequential File Stage
• Used to extract/read data or write data to a Flat file.
• The text file can be read/created from any file on the server or any drive
mapped on the server.
• Can have any number of outputs and inputs

1 - 22
Sequential File Stage - Defining Input Data
When you write data to a sequential file stage it has an input link. The properties are defined on the inputs page of the stage

File name

Update Action

Note: The file has to imported in the repository before using it in this stage. The procedure is similar
to importing ODBC table definitions
Backup existing file creates a back up before overwriting or appending to it. This can be used if a
job is reset during execution
1 - 23
Sequential File Stage - Input Data Format
Format contains parameters that define the format of data in the file

Fixed width Column: If this is chosen the data is


written in fixed width columns and the width is
calculated according to the “Display” in column
definition
First line is column name:
Omit last new line: Omits the last new line
character in the file
Flush after every row: Used for named pipes only.
Records between reader and writer are handled one
record at a time
Delimiter:Specified if fixed width is not specified.
Quote Characters:
Default Padding
Default Null String

1 - 24
Sequential File Stage - Output Data Format
The general page and the columns page remains the same for both read and write
of data
Additional Properties
Missing Column Action: Allows
you to specify the action for columns
missing in the data being read.
Options are
• Pad with Null
•Pad with empty string
•Map empty String
Suppress for Truncation warnings:
This is used if you are reading only
three columns from a file which has
more than three columns defined.
1 - 25
Aggregator Stage
Aggregator Stage classifies data rows from single input link into
groups and computes totals or other aggregate functions. The
sum/totals are output from the stage.You can only have on input
link but many output links if the data has to be grouped in different
ways.
Aggregator is an “Active Stage”
Before/After Routines: These
are routines which have been
included in the repository and
can be executed before or after
the stage.
E.g.: run a dos command or
send email.

1 - 26
Aggregator Input Data
Data to be aggregated is passed from a previous stage link and that is used as input for the
aggregator stage.The columns passing out of the previous stage are reflected on input
columns tab.
All the column header except Sort and Sort order are discussed previously

Sort:Aggregator stage performs the


grouping of columns in memory. If the
input data coming from an ODBC or
Sequential file stage is sorted ( using order
by clause), there are performance
efficiencies which can be gained.
This column defines the sort column
position.
E.g. order by PRODUCT_ID,DATE_ID
Here the sort key for product column is 1
and for the date column is 2
Sort Order: Ascending( default)/Descending

1 - 27
Aggregator Output Data

Derivation editor

Note:Each column in the


input stage must either be
grouped or summarized.

The standard columns for data type appear with two changes:
Group: Specifies whether a give column is used in the group by clause.
Derivation: Each input column needs to have a derivation for this stage.This will be
discussed in the next slide.
1 - 28
Aggregator Stage - Derivation Editor
Iteratively choose each column and define
whether to use a group by or use some
aggregation function from the drop down list

Click if you
want to group
by a column

Note: The functions available change according the underlying column data type.

1 - 29
Transformation Stage
• Handles extracted data to perform conversions and pass data to another transformation
stage or other DataStage stage
• Can have any number inputs or outputs.
• Has one primary data link and many reference links
Double click the
stage to open the
transformation editor

Link
Area

Metadata
Area

1 - 30
Transformation Editor
• Link Area: Is where all the column definitions.expressions/stage variables and constraints
are defined.The link area is divided in two parts the left pane shows you the input links and the
right pane shows the output columns as well as their derivations,constraints and stage
variables. Each of them will be explained in detail.
• Metadata Area: This shows the column metadata again split in two panes one for the input
columns and one for the output columns. Here you can edit the columns as u seem necessary
for their properties.
• Shortcut Area:

1 - 31
Transformer Concepts
When you use the transformation stage you have already defined what input data has to be
passed and changed. The input columns are transformed in the transformation editor on
the output pane for each column chosen.
Input Links: This stage has one primary link which is used to get bulk of the data to be
transformed.It can also use many other links as reference links which will affect how the
data is to be transferred. (Look Up Activity)
Reference Links:You can define key expressions for columns on the input columns on the
reference links. The most common expression is specifying an equii-join between the
primary table and reference table
Primary Link

Reference Link

Key Expression Linking two tables using a


join

1 - 32
Transformation - Key Expressions
Key expressions are used to affect how the data flowing thru the stage is affected. They
are defined for reference link “key” columns. A column has to be defined as a key
column by changing that field to yes in the reference link metadata area

CAT_ID is changed to a key column and then its key expression is defined in the link area

• A Join is specified between the primary link


column CATEGORY_ID and the reference
link key column CAT_ID
• The key expressions are validated by the
expression editor. If they are incorrect they
appear as red denoting error.

1 - 33
Transformation - Output Column Expressions
What can you do with the transformation stage?

Add delete Columns: You can add new columns by right clicking the output pane in the Link
area and then defining the column metadata in the metadata area. The new column derivations
can be custom strings/derivations/concatenations/standard functions applied on input columns
or stage variables. We will discuss them shortly.
Move/edit Columns: In the output pane in the link area you can move the columns up and
down by dragging. Also in the metadata area you can edit the column definitions as per
requirements.
Before/After Routines: This is an active stage hence you can specify routines to be executed
by passing suitable input values as discussed before.
1 - 34
Transformation Stage - Expression Editor
Expression editor allows you to define expression for columns which have to be modified
in the transformer stage. What can you do with the expression editor?

Where can you open/use the expression editor?

Entering Expressions: Whenever the insertion point is an expression box you can right
click to open the suggested operand menu. This allows you a rich support of defining
various elements and using predefine transformations/routines and functions. Also allows
you to define expressions based on stage variables.
Continued…

1 - 35
Transformation Stage -Expression Editor
Suggested Operand menu for Transformer stage
DS Macro: They are provided with DataStage and require no arguments you can select them from a
drop down list
DS Functions:Provides a list of available DataStage functions which can be used.
DS constant: A list of predefined DataStage constants

1 - 36
Transformation Stage -Expression Editor
Suggested Operand menu for Transformer stage
DS Routines: They are defined in the repository. You can use your own routines or use the ones
provided by DataStage.
DS Transforms:Provides a list of available DataStage transforms. Replace the %Arg% with real
arguments required by that particular transformation in the required format. They can be viewed in the
Manager.

Please refer the manual for detailed description of each


transform/routines
1 - 37
Transformation Variables
You can define local variables for a given transformation stage. These can be
used to define key expressions or the output column derivations.They can be also
uses as follows

Stage variables are shown in the output pane of the transformations stage.

1 - 38
Transformation Constraints
You can define limits for output data by specifying constraints in the transformation stage.
Constraints are BASIC expression which can be defined for each output link of a
transformer stage. You can also specify reject links for rows which haven’t been written to
any output stage. To define constraints do the following

Similar dialogue box appears.

Specify the constraints expression as desired.


This can be used while writing data for partitioned tables on certain keys. ( E.g. Key_ID>
10) or any other filtering criterion desired.
1 - 39
Folder Stage
• Folder stages are used to read / write data as files in a directory located on a DataStage
Server
• Folder stage can read multiple files from a single directory and can deliver the files to the
job as rows on an output link

Plug-in Stage
• Plugin stages are mostly used for access of external programs that can be leveraged by
the DataStage applications.
• DataStage supports a number of plug-in stages to perform special tasks. They have to be
installed separately on the server and have to be configured before use.
• After installing these stages they have to be registered on the server system before being
used.
E.g.: ORABULK: bulk loading of data in Oracle database.
BCPLOAD: Bulk load for MS SQL Server.
Each plug-in stage has its own configurable properties which can be checked from the
manuals. Please refer them if you plan to use it.
1 - 40

You might also like