14 Good Design Tips in Datastage 8.0.1
14 Good Design Tips in Datastage 8.0.1
14 Good Design Tips in Datastage 8.0.1
1
------------------------------------------------------------------------------------------------------------------------------
1) When you need to run the same sequence of jobs again and again, better create a
sequencer with all the jobs that you need to run. Running this sequencer will run all the
jobs. You can provide the sequence as per your requirement.
2) If you are using a copy or a filter stage either immediately after or immediately before
a transformer stage, you are reducing the efficiency by using more stages because a
transformer does the job of both copy stage as well as a filter stage
3) Use Sort stages instead of Remove duplicate stages. Sort stage has got more
grouping options and sort indicator options.
5) Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of
Transformer stage only if the anticipated volumes are high and performance becomes a
problem. Otherwise use Transformer. It is very easy to code a transformer than a modify
stage.
6)Avoid propagation of unnecessary metadata between the stages. Use Modify stage
and drop the metadata. Modify stage will drop the metadata only when explicitly
specified using DROP clause.
7)Add reject files wherever you need reprocessing of rejected records or you think
considerable data loss may happen. Try to keep reject file at least at Sequential file
stages and writing to Database stages.
8)Make use of Order By clause when a DB stage is being used in join. The intention is to
make use of Database power for sorting instead of Data Stage resources. Keep the join
partitioning as Auto. Indicate don’t sort option between DB stage and join stage using
sort stage when using order by clause.
9)While doing Outer joins, you can make use of Dummy variables for just Null checking
instead of fetching an explicit column from table.
10)Data Partitioning is very important part of Parallel job design. It’s always advisable to
have the data partitioning as ‘Auto’ unless you are comfortable with partitioning, since all
Data Stage stages are designed to perform in the required way with Auto partitioning.
11) Do remember that Modify drops the Metadata only when it is explicitly asked to do
so using KEEP/DROP clauses.
12) Range Look-up: Range Look-up is equivalent to the operator between. Lookup
against a range of values was difficult to implement in previous Data Stage versions. By
having this functionality in the lookup stage, comparing a source column to a range of
two lookup columns or a lookup column to a range of two source columns can be easily
implemented.
13) Use a Copy stage to dump out data to intermediate peek stages or sequential debug
files. Copy stages get removed during compile time so they do not increase overhead
14)Where you are using a Copy stage with a single input and a single output, you should
ensure that you set the Force property in the stage editor TRUE. This prevents
DataStage from deciding that the Copy operation is superfluous and optimizing it out of
the job.
6)Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop
the metadata. Modify stage will drop the metadata only when explicitly specified using DROP
clause.
This is not strictly true as there are many ways to achieve this, although Modify stage is the
recommended medium.
For example, you can use a dataset stage and only select columns you want to use. Such actions
should be annotated clearly.
You should identify columns not required and drop them at the earliest opportunity. You can drop
columns as part of the output in most stages. Ideally, local coding standards should make the
method consistent. As long as it is clearly indicated on the canvas or in the job description of
course ;)
10)Data Partitioning is very important part of Parallel job design. It’s always advisable to have
the data partitioning as ‘Auto’ unless you are comfortable with partitioning, since all Data Stage
stages are designed to perform in the required way with Auto partitioning.
I disagree strongly with your use of 'auto' as default. Unless you properly partition, you will get
unexpected and possibly incorrect results, especially if you are running multi-node. You have to
understand partitioning and utilise it correctly, especially when performing joining or lookup
operations.
DR
Palgun | May 28
Daren,
However you may not have understood my perspective in Point No 6. Please be noted that
though you drop columns in the Output tab from other stages(other than Modify), these
columns will be carried over thru-out the job at background. This can be proved if u use RCP.
In case you want to completely drop the columns, u need to drop them using Modify stage.
Sometimes you may need to read several columns and after a bit of processing theese
columns may be useless to be carried over further. These cases give the best use of a Modify
stage. Also note that NO OTHER STAGE drops the metadata at rutime.
Palgun | Jul 3
@Saurocks
You need to understnd that there a few stages in Datastage that have nil or minimal memory
usage and shuld be used in top priority whn it comes to 'performance'. Modify and Copy
stages are examples of this.
Furthermore, there is no other stage in Datastge which drops the columns at runtime, except
ofcourse the Modify stage. You may try to not send a few columns into the output tab of a
stage Ex Transformer stage, but those columns get propagated thruout the job, at runtime.
This can be checked using the "Runtime column propagation" option. However when you use
the Modify's Drop option, the columns get dropped all the way!
In this way, Modify is quite unique. Also the type conversions work in a faster way this stage
Arash | Aug 31
Hi all,
thank to Palgun for this blog and to all for your posts,
I have 20.000.000 of records in a job which use more than 20 range look-up and 2 of
them are not using dataset as reference(they use odbc source as reference)
my job take 55 min and now I m adding sort and RMV(which take 20 min separately [I
tested it in separate job])
Thank you
Best regards
Arash
Palgun | Aug 31
Dear Arash,
1) If the reference link in a lookup has very huge no. of records, better replace that Lookup
with a Join.
2) Are you using that copy stage just to sort data? If yes. better use a Sort stage itself and sort
the data and then pass them on to the Rdup stage. That is the best possible performance
tuning you can do.
Regards,
Palgun
Arash | Sep 21
Hi Palgun,
Thanks for your replay, but even if I use JOIN instead of lookup, it will be difficult to do
the
range!!
I mean range become complicate when you use a join :(
Regards
Arash
Reading Multiple files through Parallel jobs
Hi,
I want to read multiple files from a single directory and i have to do this through Parallel jobs.I'm
using DataStage version 7.5.2. Can any one help me in this regard.Your quick reply will be
highly appreciated.
Shiva Ram
Replied Oct 8, 2007
Hi,
Have the mode, file pattern in Sequential File Stage and grab a key word
which is common in all files, for eg., if you have files,
UNITED_STATES.txt, STATES_UNITED.txt, UNITED_STATES_OF_AMERICA.txt, if you
set the mode to file pattern and set the pattern as 'STATES' , all of these
three files will be read by one same sequential file stage.
Thanks
0
vasu koti
Replied Oct 9, 2007
HI
In the sequential stage format tab properties you need to set multiple file
s option
file1=path\abc.txt
file2=path\def.csv
with cheers
Vasudev> From: email@removed To: email@removed Dat
e: Mon, 8 Oct 2007 15:49:53 +0000> Subject: [ascential-l] Reading Multiple
files through Parallel jobs> > > > Hi,> > I want to read multiple files from a single
directory and i have to do this through Parallel jobs.I'm using DataStage v
ersion 7.5.2. Can any one help me in this regard.Your quick reply will be h
ighly appreciated.> > Thanks> Haris> > > > > >
how you will extract multiple files using sequential file with out using file pattern
Related
6 Replies
0
Probal Banerjee
Replied Aug 9
using loop.
probal1u
0
vasudev koti
Replied Aug 9
Hi Kiran Kumar
You can read multiple files with sequential file Stage. In the sequential file stage you can assign
properties you need to specify the
eg:
File=pathname of the file1
File=pathname of the .file2.
File=path name of the .file3
?n
being read from (repeat this for reading multiple files).
It executes in parallel mode.
karthik reddy
Replied Aug 9
Hi
i have multiple files like file1 and file2...etc
Thanks to advance
0
Deepak
Replied Aug 9
@ Kiran
Other option.
You can simply use "cat file1 file2 file3"
if you know the file name.
And this can be used in Filter option.
Or you can use Multiple instance job too. Pass the file name as Instance id.
White Papers and Webcasts
Popular
Related
• Closing the data privacy gap: Protecting sensitive data in non-production environments
Deepak
Replied Aug 9
@ Karthik
You cannot handle multiple delimiters in the same stage.
You have to pre process the files, to unify the delimiters.
Simple tr command should work out for that reason.
0
Shin Takakura
Replied Aug 9
If you don't know which file is using which delimiter, then you might need
some kind of pre-processor to convert all tabs to comma or vice versa in
the file to make the logic simple.
Best Regards,
Shin Takakura
How to handle null by modify stage?
Hi, I try to handle null warning by using modify stage but it's still warning log.
My data type >> ID (integer), name(varchar(20)
I put this in modify stage
>> ID:int32 = handle_null(ID, 98);name:string = handle_null(name, "NA");
Program run finish with warning like this :
Sequential_File_6: When checking operator: When validating export schema: At field "name":
Exporting nullable field without null handling properties
Sequential_File_6: When checking operator: When validating export schema: At field "ID":
Exporting nullable field without null handling properties
Please help clarify this , How can i reject warning?
Thank you in advance,
2 Replies
0
tigerpeng
Replied Oct 14, 2009
Renjie Luo
Replied Oct 19, 2009