DataStage Faq S
DataStage Faq S
DataStage Faq S
1. Main differences b/w DataStage 7.5.2 to 8.0.1 A. In DataStage 7.5.2 we have manager as client. In 8.0.1 we dont have any manager
client. The manager client is embedded into designer client.
B. In 7.5.2 quality stage has separate designer. In 8.0.1, quality stage is integrated in C. In 7.5.2 code and metadata is stored in file based system. In 8.0.1 code is a file
based system where as metadata is stored in database. designer.
D. In 7.5.2 we required operating system authentications. In 8.0.1, we require operating E. In 7.5.2 we dont have range lookup. In 8.0.1, we have range lookup. F. In 7.5.2 a single join stage can't support multiple references. In 8.0.1 a single join
stage can support multiple references. system authentications and DataStage authentications.
G. In 7.5.2, when a developer opens a particular job, and another developer wants to
open the same job, that job can't be opened. In 8.0.1, it can be possible when a developer opens a particular job and another developer wants to open the same job then it can be opened as read only job. H. In 8.0.1 a compare utility is available to compare 2 jobs, one in development another is in production. In 7.5.2 it is not possible. I. In 8.0.1 quick find and advance find features are available, in 7.5.2 not available. J. In 7.5.2 first time one job is run and surrogate keys generated from initial to n value. Next time the same job is compiled and run again surrogate key is generated from initial to n. automatic increment of surrogate key is not available in 7.5.2 but in 8.0.1 surrogate key is incremented automatically. a state file is used to store the maximum value of surrogate key.
1.
Side
Data Modeling
a) Logical Modeling: Logical modeling deals with gathering the business requirements and converting them into a model. b) Physical Modeling: Physical modeling involves the actual design of a database according to the requirements that were established during logical modeling. 2) Dimensional Modeling Dimensional modeling is divided into 2 types. a) Star Schema - Simple & Denormalized form. Much Faster compared to snow flake. b) Snowflake Schema - Complex with more Granularities. More normalized form. Slow.
2.
Surrogate Key is the Primary Key in a Dimension table. It is independent of the underlying database i.e. Surrogate Key is not affected by the changes going on with the source database. 3. Differentiate between Database data and Data warehouse data? Data in a Database is for OLTP. a) Detailed or Transactional b) Both Readable and Writable. c) Current. d) Volatile 4. Data in a DWH is for OLAP. a) for Analysis & BI. b) We can only read from the DWH c) Historical data d) non-volatile
What is the flow of loading data into fact & dimensional tables?
Page 2 of 57
First Data should be loaded into Dimension tables where the surrogate keys are generated and then to Fact tables. The surrogate keys are referenced as foreign keys in Fact tables.
5.
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. DataStage used Orchestrate with DataStage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now DataStage has purchased Orchestrate and integrated it with DataStage XE and released a new version DataStage 6.0 i.e. Parallel Extender.
6.
Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column (in case of comparison, if-else). We can also use this for computing the result which is useful in multiple columns of the target table(s). Constraints - Constraint is like a filter condition which limits the number of records coming from Input according to business rule. Derivation - Expression that specifies the value to be passed to the target column.
7.
Normal Lookup: In this whenever DataStage wants to look up, it just places the target table data into buffer. It is used when the reference table or file contains less data. Sparse Lookup: To look up, it fires a SQL query to the database instead of placing into buffer. To use sparse look up your target database must be larger than source table and your target should be a database of any type. Range lookup: This will help you to search records based on particular range. It will search only that particular range records and provides good performance instead of searching the entire record set. 8. Explain about Error Handling in DataStage and best practices.
In DataStage sequence there is an "Exception Handler" activity. When you are calling your jobs from a DataStage Sequence you should do following : Step 1: Go to job properties of master sequence and check the checkbox "Add Checkpoints so sequence is restartable on Failure" and "Automatically handle activities that fail". Step2: In your sequence use an exception handler activity. After exception handler activity you may include an email notification activity. Here if the job fails the handle will go to the exception handler activity and an email will be sent notifying the user that a sequence has failed. 9. What are the different types of links in DataStage?
They are 3 different links in the DataStage. 1. stream link means straight link 2. Reference link it acts like a lookup. 3. Rejected link used in parallel jobs 10. How to use Excel file as input in DataStage?
Page 3 of 57
You can use excel file as input by importing the .xls file. step1 --> Go to Administrative Tools -> Data Source (ODBC) --> System DSN. Click on Add button and configure the corresponding .xsl file in your system DSN. Make sure that workbook contains the name of your excel sheet. Step2 --> Import the excel file into the DataStage as ODBC table definition. Step3 --> Use ODBC stage as input stage. You should be able to use excel file very effectively. Please let me know if you face any problem.
11.
What is the default cache size? How do you change the cache size if needed?
Default cache size is 128 MB. We can increase it by going into DataStage Administrator and selecting the Tunables Tab and specify the cache size over there. 12. Differentiate Primary Key and Partition Key?
Primary key is the key that we define on a table column or set of columns (composite PK) to make sure that all the rows in that table column or columns are unique. Partition key is the key that we use while partitioning a table (in database) for processing the source records in ETL. We should define the partition based on the stages (in DataStage) or transformations (in Informatica) we use in a job (in DataStage) or mapping (in Informatica). To improve the target load process, we partition the data. 13. How to remove the locked jobs using DataStage? Go to Director -- Tools --- Clear the Job Resources option there u find the PID Number. Then select that PID and click a logout. Your job gets released.
14.
How do you execute DataStage job from UNIX command line prompt?
/opt/Ascential/DataStage/DSEngine/bin/dsjob -server $ServerName \ -user $User \ -password $ServerPassword \ -run -wait -mode NORMAL \ -param FmtDirectory=$FMT_PATH \ -param ReFmtDirectory=$REFMT_PATH \ -param Twin=$TWIN \ -param Group=$GROUP \ -param PrcsDte=$PROCDTE \ -param Cmelnd=$CME_IND\ -param Mndlnd=$MND_IND \ IDL $DSSEQUENCE.${APPLCDE}
15.
Hashed File is classified into 2 types. a) Static - Sub divided into 17 types based on the Primary Key Pattern. b) Dynamic - Default Hash file is "Dynamic - Type Random 30 D". These are the three types of files will be created when we create a Hash file .data .type .over .
16. How to call a Stored Procedure which is in SQL Server database in DataStage job?
Page 4 of 57
In ODBC stage properties -- Click on OUTPUTS --- General --- select Stored Procedure -- browse to get the stored procedure. We can use a Stored Procedure stage while designing parallel jobs. 17. Explain what are SUMMARY TABLES and use? Summary tables contain the summarized or "Aggregated" data to tune up query performance. Example: Suppose that we have a table which contains following columns: a) Medicine_Name b) Sell c) Time Now the business requirement is to get the sales of medicine on monthly basis. Here if a query is fired to aggregate the medicine cell will have to use aggregation to get the monthly sales each time. Instead of that if a summary table is created which contains the monthly sales records the query cost will decrease as the query will directly get the aggregated data from the summary table. In this scenario the summary table will contain following columns: a) Medicine_Name b) Sell c) MONTH Hence for sell of all days of month only one aggregated record will come in Summary table. i.e. for each month one row will be there in summary table containing aggregated data. This will increase the performance of a query.
19. Where the DataStage hash files are stored when they are created?
There are two ways of specifying where the Hash Files will be created. a) Account: This is generally the project name. If this is selected the Hash file will be generated in the hash file path specific to the project/Account. b) Folder Location: If this option is selected the Hash file will be generated in the directory specified by the user. 20. How do you create a sequencer in DataStage parallel extender? Sequencer is an activity in the DataStage sequence which will run the output job connected to it in the sequence based on the input job status (ALL/ANY). Steps to create a sequence:
Page 5 of 57
a) Go to Menu. In the Menu select File->New. b) In the window that appears select "Job Sequence". The other way of running the jobs in a sequence is using "Job Control" options in DataStage. For this a) Open a parallel job. b) Go to Job Properties-> Job Control Here you can select the jobs which should run before these parallel jobs. 21. How and where do we use or integrate Java with DataStage? Java can be integrated into DataStage using Java Pack. There are two stages which come in java pack a) Java Client: It is used for Sources or targets. b) Java Transformer: It is used for processing row by row. Both stages allow using the Java Classes in your DataStage jobs. 22. Where does director create its log files? The log files are stored in server. At the client side (windows) the log files are not stored. The logs are maintained in RT_Logn files on DataStage server. Deleting logs: You can set Auto purge of job logs from DataStage administrator to delete the log files automatically.
Page 6 of 57
26. How can u see, delete dataset in UNIX and windows? You can use orchadmin dump <dataset>.ds to see the data in the Dataset in UNIX. orchadmin <delete | del | rm> <dataset>.ds to delete the dataset. In DataStage, delete using the dataset management option in designer or director.
28. When do you use separate sort stage and inbuilt sort utility in a stage? What is the
difference?
29. How to check the no. of nodes while running the job in UNIX? APT_CONFIG_FILE
30. A transformer stage is running in parallel. How can we run it in sequential? Transformer properties -- Advanced -- Execution mode -- there u can select default/ sequential/ parallel mode of execution. 31. A sequential file has 4 duplicate records? How to get 3rd one?
32. How can the server job be used in PX job? Through Server shared containers. We can select the stages and define the functionality using them. 33. How can u run the job in sequential mode? By specifying 1 node in APT_CONFIG_FILE.
34. If a job compiled with "job completed successfully" but The DS director window of Log page shows warning messages. What will we do? Generally we can ignore warning messages shown in director. If you want to abort your job based on the no. of warning messages, you can specify that in DataStage Director --- Tools --- Options ---Limits tab ---- Rows/Warnings.
35. Explain the difference between Data Sets and File Sets? Where each one is used and
how they are used? What is the size limit? The dataset and file set both are file stages.
Page 7 of 57
Dataset can have single input link or single output link whereas File set can have single input link or single output link and a single reject link. File set can execute only in parallel mode whereas dataset can be configured to execute either in parallel or in sequential mode. The default size for both is 2GB. 36. How to implement the scenario: Add 10 days to a particular date in DataStage PX? Extract the day part of the date by using the string function and then add how many days you want and then again add all those parts into a single string and then convert the string into date object.
38. Explain the difference between server transformer and parallel transformer? Or what
is the difference between BASIC Transformer and NORMAL Transformer? When we will go for BASIC or NORMAL Transformer? The main difference is server Transformer supports basic transforms only but in parallel supports both basic and parallel transforms. Server transformer is basic language compatibility & parallel transformer is c++ language compatibility. Transformer stage in server jobs vs Transformer stage in Parallel jobs 1. In server jobs support single input stream link and multiple input reference link where in parallel support only single input stream link and no reference link. 2. In server jobs lookup has to be done in transformer stage whereas in parallel it has separate lookup stage for this. 3. In parallel it has the capability to perform the surrogate key generation as an in-built functionality whereas no such functionality available in server job transformation stage. Difference: 1. A Basic transformer compiles in "Basic Language" whereas a Normal Transformer compiles in "C++". 2. Basic transformer does not run on multiple nodes whereas a Normal Transformer can run on multiple nodes giving better performance. 3. Basic transformer takes less time to compile than the Normal Transformer. Usage: 1. Basic transformer should be used in Server Jobs.
Page 8 of 57
2. Normal Transformer should be used in Parallel jobs as it will run on multiple nodes here giving better performance.
39. Where do you specify nodes for DataStage jobs? In DataStage Manager->Tools->Configurations here one default configuration file exist name "default.apt" there we can add any no of nodes or we can add new configuration file.
40. How to retrieve hash file data from administrator command? Open the DataStage administrator. Select your project. On the right hand side you will see "Command" button. Click on it. You will get a command interface window. Here you can write select queries to retrieve data from a hash file. For example: Let's say you have created a Hash file say "TestHashFile". To retrieve the contents you will have to fire following query: select * from TestHashFile; Note: In the Hashed file stage you will have to select the option "Use Account Name" to create a hash file. 41. What is the difference between Filter and External Filter stages? Filter stage supports only SQL statements in where property where as External Filter supports UNIX commands in Filter command property.
42. How to enable the runtime column propagation in the DataStage? Where this option
does exists? There is an option in DataStage administrator ---> projects tab ---> properties ---> general tab ---> enable runtime column propagation for parallel jobs. If you enable this you can select runtime propagation to specify that columns encountered by a stage in a parallel job can be used even if they are not defined in the Metadata explicitly. You can see the runtime propagation option in most of the active stages in the output tab (if exists) in columns sub tab.
43. How can we filter in sequential stage? How do you remove duplicates in a flat file? We have filter command in input link properties tab where we can specify the UNIX commands as well. Use remove duplicates for this. 44. How to remove duplicate records from a file and capture the rejected records in a
separate file? Sort the records using the key field. In sort stage put "key change column true. Then zero will be assigned to the duplicate records. Then put a condition for zero record then send it to reject link. 45. What is the difference between a Filter and a Switch Stage? A switch stage can have maximum 128 output links. A filter stage can have any number of output links.
46. How can I handle the before/after job subroutine in the transformer stage?
Page 9 of 57
In Transformer Stage click on properties tab in tool bar (Left corner tab). It displays stage properties page like the other job properties page. There you can specify before/after stage subroutines. 47. How to load the top 3 marks (column name) from sequential file to the target? First sort the data on Marks column using sort stage then use transformer stage. In constraints, use system variable @INROWNUM (@INROWNUM< 3). In this way you can get top 3 marks. (OR) This can be achieved by sort & external filter stages. First sort the data in descending/ascending order after that use external filter stage & give UNIX command head -3 in the box. 48. When do you use dataset stage/lookup file stage/file stage in a parallel job? What is the significance of these files...what are their differences? 49. What is the difference between job control and sequencer? Job control: It just inserts the compilation code. Whenever you add the job it just places executable code there. There is no diagrammatical representation here. Sequencer: Sequencer is a diagrammatical representation of activities where we can connect different type of objects through activities. There is no interaction of executable code here. Sequencer can apply the control to jobs means we can run the job depending on the condition. We can run schedule monitor as the normal job.
50. How to run a Shell Script within the scope of a DataStage job?
By using "ExcecSH" command at Before/After job properties. Select the EDIT tab in the toolbar ---> choose job properties ---> Select the job parameters ---> choose the before/ after routines ---> select the EXCESH Command. In the Sequence, we have an ExcecSH stage, through which we can run the shell scripts by providing the path and parameters for the script.
Page 10 of 57
Link Partitioner: It receives data on a single input link and diverts the data to a maximum no. of 64 output links and the data processed by the same stage having same Meta data. Link Collector: It will collect the data from 64 input links, merges it into a single data flow and loads into the target. These both are active stages and the design and mode of execution of server jobs have to be decided by the designer.
55. Types of views in DataStage Director? a) Status View list of jobs, status, dates and time taken to run. b) Schedule View scheduling details of jobs. c) Log View log for last few runs of each Job with Warning messages, Error messages,
informational messages, Event messages & Program Generated messages. There are 3 types of views in DataStage Director.
1.
DataStage provides us with a stage Remove Duplicates in Enterprise Edition. Using that we can eliminate the duplicates based on a key column. We can also use sort stage (option: allow duplicates (True/False)). 2. The Duplicates can be eliminated by loading the corresponding data in to Hash file and Specify the column(s) on which u want to eliminate as the key(s) of hash. 3. SQL: delete from tablename a where rowid > (select min (rowid) from tablename b) where a.key values=b.key values.
58. What are XML files and how to read data from XML files and what stage to be used?
We can use XML metadata importer to import the XML source definition and XML input to read the XML document. For each and every element of the XML, we should give the XPATH expression in the XML input. 59. Give the OS of the Server and the OS of the Client of your most recent project? Server is on UNIX and Client machine i.e. the machine where you design a job is on Windows XP.
Page 11 of 57
60. What is Hash file stage and what is it used for? Used for Look-ups. It is like a reference table. It can also be used in-place of ODBC, OCI Tables for better performance. We can also use the Hash File stage to remove duplicate rows by specifying the hash key on a particular field or column.
1. Go to DataStage Administrator--->Projects--->Properties--->Environment--->User
Defined. Here you can see a grid, where you can enter the Parameter name and the corresponding path of the file. 2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box.
63. How do you pass the parameter to a job sequence if the job is running at night?
Two ways 1. Set the default values of the Parameters in the Job Sequencer and map these Parameters to the job. 2. Run the job in the sequencer using dsjobs utility where we can specify the values for each parameter.
64. How do you catch bad rows from OCI stage? We will place some conditions like 'where' inside the OCI stage and the rejected rows can be obtained.
65. What happens if the job fails at night? Job Sequence will abort and an error exception handling job will send a mail to the corresponding member. You can define a job sequence to send an email using SMTP activity if the job fails. OR Log the failure to a log file using DSlogfatal/DSLogEvent from the job control tab in the job properties.
66. What is the order of execution done internally in the transformer with the stage
editor having input links on the left hand side and output links on the right hand side?
1. 2. 3. 4.
First the left side tables will be joined. Then it executes stage variables. Constraints in transformer. Derivations on the columns.
Page 12 of 57
68. How will you call external function or subroutine from DataStage? U can call external functions/subroutines by using Before/After stage/job Subroutines: ExecSH if it is UNIX machine ExecDOS if it is Windows machine. 69. How do you merge two files in DS?
Use the funnel stage to merge the two files in PE. Also if you want to add a file to an existing dataset/ Sequential file, then you can write to the same dataset/sequential with dataset/sequential file set to APPEND mode. 70. What are Static Hash files and Dynamic Hash files?
Dynamic Hash Files can automatically adjust their size - modulus (no. of groups) and separation (group size) based on the incoming data. Type 30 is default dynamic hash file being used. Static Hash files do not adjust their modulus automatically and are best when data is Pre known/static. Overflow groups are used when the data size is equal or greater than the specified Large Record size in dynamic HFs. Since Static HFs do not create hashing groups automatically, when the group cannot accommodate a row it goes to overflow. Overflows should be minimized as much as possible to optimize the performance.
74. Have you ever involved in updating the DS versions like DS 5.X, if so tell us some of
the steps you have taken in doing so?
Page 13 of 57
Yes. The following are some of the steps I have taken in doing so: 1) Take a back up of the whole project(s) by exporting the project as a .dsx file. 2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work. 3) After installing the new version import the old project(s) and you have to compile them again. You can use 'Compile All' tool for this. 4) Make sure that all your DB DSN's are created with the same name as the old ones. This step is for moving DS from one machine to another. 5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i, then there is tool in DS CD that can do this for you. 6) Do not stop the 6.0 server before the upgrade version 7.0. Install process collects the project information during the upgrade. There is NO rework (recompilation of existing jobs/routines) needed after the upgrade. 75. What other ETL's you have worked with? ETL tools are 1) Informatica by Informatica Corporation 2) DataStage by IBM 3) Ab Initio by Ab Initio software Corporation 4) Data Junction by Pervasive software 5) Oracle Warehouse Builder by Oracle Corporation 6) MS SQL Server Integration by Microsoft 7) Decision Stream or Data Manager by Cognos. 76. What is DS Designer used for - did u use it?
1. 2. 3. 4.
Design the server jobs & parallel jobs Creating the local & shared containers Design the sequences Compile (single & multiple), validate and running of the jobs.
77. What is DS Director used for - did u use it? DataStage Director is used to validate, run, monitor and debug the warnings & errors by viewing the log file and to schedule the Jobs. 78. What is DS Administrator used for - did u use it? The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. It is primarily used to create/delete the DataStage project, assign the user roles to the project and set the parameters for the jobs at project level. Assigning the users to the project can also be done here.
79. What is DS Manager used for - did u use it? To manage the metadata, create/update routines, functions, export/import of jobs (.dsx or .xml) from Dev to Test and then to production environment.
80. Explain the differences between Oracle8i/9i?
Page 14 of 57
1. Oracle 8i does not support pseudo column SYSDATE but 9i supports. 2. Oracle 8i we can create 256 columns in a table but in 9i we can create up to 1000
columns. Multiprocessing, databases more dimensional modeling 81. What are Routines and where/how are they written and have you written any routines before? Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines Routines are stored in the Routines branch of the DataStage Repository, where you can create, view, or edit them using the Routine dialog box. The following program components are classified as routines: Transform functions: These are functions that you can use when defining custom transforms. DataStage has a number of built-in transform functions which are located in the Routines. Examples: Functions branch of the Repository. You can also define your own transform functions in the Routine dialog box. Before/After subroutines: When designing a job, you can specify a subroutine to run before or after the job, or before or after an active stage. DataStage has a number of built-in before/after subroutines, which are located in the Routines. Built-in: Before/After branch in the Repository. You can also define your own before/after subroutines using the Routine dialog box. Custom Universe functions. These are specialized BASIC functions that have been defined outside DataStage. Using the Routine dialog box, you can get DataStage to create a wrapper that enables you to call these functions from within DataStage. These functions are stored under the Routines branch in the Repository. You specify the category when you create the routine. If NLS is enabled,
82. What are other Performance tunings you have done in your last project to increase
the performance of slowly running jobs? Extraction Side: 1. Staged the data coming from ODBC/OCI stages or any database on the server using Hash/Sequential files for optimum performance Performance of hash files/ sequential files is good because it avoids many trips to the database server. 2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster selects, inserts and updates. Row buffering also to improve the performance. In process and inter process (for server jobs). 3. Tuned the 'Tunables' in Administrator for better performance for each project. Default cache size for hash files is 128 MB. We can modify it to improve the performance. 4. Sorted the data as much as possible in DB and reduced the use of DataStage-Sort for better performance of jobs use order by constraint. 5. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries. 6. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
Page 15 of 57
7. Use the power of RDBMS. 8. Try not to use a sort stage when you can use an ORDER BY clause in the database. 9. Using a constraint to filter records is much slower than performing a SELECT WHERE. Transformations Side:
Page 16 of 57
87. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job?
1. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run
the job. May be you can schedule the sequencer around the time the file is expected to arrive. 2. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. 88. Read the String functions in DS Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [start] length], string [delimiter, instance, repeats] 89. How did u connect to DB2 in your last project? Using DB2 ODBC drivers. DB2 Plug-in Stage Dynamic Relational Stage. 90. What are Sequencers? Sequencers are the job control programs that execute other jobs based on the present Job parameters. A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers. The sequencer operates in two modes: ALL mode: In this mode all the inputs to the sequencer must be TRUE for any of the sequencer output triggers to fire. ANY mode: In this mode, output triggers can be fired if any of the sequencer inputs are TRUE.
91. How would call an external Java function which are not supported by DataStage?
Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use those files as a source in DataStage job. 92. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
1. Use Crontab utility along with dsexecute () function with proper parameters
passed. 2. "AUTOSYS": Through Autosys you can automate the job by invoking the shell script written to schedule the DataStage jobs. 93. What is the difference between routine and transform and function?
1. A routine describes the Business logic and Transformer specifies the transform of 2. By using Routines we can return values but by transformers we cannot return
values. the data from one location to another by using transformation rules.
Page 17 of 57
94. Tell me one situation in your last project, where you had faced a problem and how
you solved it? 1. The jobs in which data is read directly from ODBC/OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. 2. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted for the former.
95. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a
Truncate statement to the DB or Delete statement? There is no TRUNCATE on ODBC stages. It is Clear table and that is a delete from statement. In an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have altered table permissions where Delete doesn't). It just clears the data in the table when each new loop it encounters. It wont send any commands to the database. 96. What are the command line functions that import and export the DS jobs? A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components. Parameters: Username, Password, Hostname, Project name, Current Directory (C:/Ascential/DataStage7.5.1/dsexport.exe), Filename (Job name). 97. How do you rename all of the jobs to support your new File-naming conventions? Create an Excel spreadsheet with new and old names. Export the whole project as a .dsx. Write a Perl/Shell program, which can do a simple rename of the strings looking up in to the Excel file. Then import the new .dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.
98. Difference between Hash file and Sequential File? 1. Hash file stores the data based on hash algorithm and on a key value. A
sequential file is just a file with no key column.
2. Hash file can be used as a reference for look up. Sequential file cannot. 3. Hash file can be stored in DS memory (Buffer) but Sequential file cannot. 4. Duplicates can be removed in a hash file by specifying a key on a column. It is not 5. There is a limit of 2GB for both Hash file and Sequential file.
99. How can we create Containers? There are Two types of containers 1. Local Container: available for that particular Job only 2. Shared Container: can be used anywhere in the project. Steps to create a Local/Shared Container: Step1:Select the stages required true for sequential file.
Page 18 of 57
Step2:Edit--->Construct Container--->Local/Shared Shared containers are stored in the Shared Containers branch of the Tree Structure. They are of 2 types. 1. Server shared containers (Can also be used in Parallel jobs) 2. Parallel shared containers (Cannot be used in Server jobs). 100. What are the Job parameters? It is a runtime parameter which allows you to enter a value to restrict the data at the job run time. A job parameter is created with name, prompt, data type, default value and it is represented as follows #parameter name#.
101. How do you fix the error "OCI has fetched truncated data" in DataStage? This error occurs when Oracle Stage tries to fetch a column like 34.55676776... And actually its data type is decimal (10,2). The solution here is to either truncate or Round the data till 2 decimal positions.
102. How can we join one Oracle source and Sequential file? Join and look up stages can be used to join oracle and sequential file. 103. What about System variables?
DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only. @DATE The internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM a field mark, Char (254). @IM an item mark, Char (255). @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages. @LOGNAME The user login name. @MONTH The current extracted from the value in @DATE. @NULL The null value. @NULL.STR The internal representation of the null value, Char (128). @PATH The pathname of the current DataStage project. @SCHEMA The schema name of the current DataStage project. @SM A sub value mark (a delimiter used in Universe files), Char (252). @SYSTEM.RETURN.CODE Status codes returned by system processes or commands. @TIME The internal time when the program started. See the Time function. @TM A text mark (a delimiter used in Universe files), Char (251). @TRUE The compiler replaces the value with 1. @USERNO The user number. @VM A value mark (a delimiter used in Universe files), Char (253). @WHO The name of the current DataStage project directory. @YEAR The current year extracted from @DATE.
104.
Whats difference between operational data store (ODS) & data warehouse?
An operational data store (or "ODS") is a database designed to integrate data from multiple sources to facilitate operations, analysis and reporting. Because the data originates
Page 19 of 57
from multiple sources, the integration often involves cleaning, redundancy resolution and business rule enforcement. An ODS is usually designed to contain low level or atomic (indivisible) data such as transactions and prices as opposed to aggregated or summarized data such as net contributions. Aggregated data is usually stored in the Data warehouse. The data in ODS is volatile and the data which is nonvolatile, historical and time variant is DWH data.
105.
Whats the difference between DataStage Developers and DataStage Designers? What are the skills required for this.
DataStage developer is one who will code the jobs and DataStage designer is one who will deal with the blue prints and he will design the jobs the Stages those are required in developing the code.
106.
107.
Four Places u can call (i) Transform of routine (A) Date Transformation (B) Upstring Transformation (ii) Transform of the Before & After Subroutines (iii) XML transformation (iv)Web base transformation
108.
Suppose that 4 jobs controlled by the sequencer like (job 1, job 2, job 3 and job 4). If job 1 has 10,000 rows, after run the job only 5000 data has been loaded in to target table remaining are not loaded and your job going to be aborted then. How can sort out the problem.
Suppose job sequencer synchronizes or control 4 job but job 1 have problem, in this condition u should go to director and check what type of problem it is showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action. So u should go to Run window--->Click---> Tracing--->Performance or In your target table --->general---> action---> select this option here two option (i) On Fail -- Commit , Continue (ii) On Skip -- Commit, Continue. First u check how much data is already loaded and after that select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u gets successful massage.
109.
Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those
Page 20 of 57
stages whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time. If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.
110. What are the third party tools used in DataStage? Autosys, Control-M, TNG, event coordinator, compile all are some of them.
Where does UNIX script of DataStage executes whether in client machine or in server? DataStage jobs are executed in the server machines only.
111.
112. How can we pass parameters to job by using file? You can do this by passing parameters from Unix file and then calling the execution of a DataStage job. the ds job has the parameters defined (which are passed by Unix) 113. Default nodes for DataStage parallel Edition? The Number of Nodes depends on the number of processors in your System. Usually 2 ** n. 114.
I want to process 3 files in sequentially one by one, how I can do that while processing the files it should fetch files automatically.
If the metadata for all the files is same then create a job having file name as parameter, then use same job in routine and call the job with different file names...OR You can create sequencer to use the job.
115. What is ' insert for update ' in DataStage? There is a lock for update option in Hashed File Stage, which locks the hashed file for updating when the search key in the lookup is not found. 116.
What are the Repository Tables in DataStage and what are they?
A data warehouse is a repository (centralized as well as distributed) of Data, able to answer any adhoc, analytical, historical or complex queries. Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. In DataStage I/O and Transfer, under interface tab: input, output & transfer pages. U will have 4 tabs and the last one is build under that u can find the TABLE NAME. 117. What is version Control? Version Control is used to
i)
ii) iii)
Store/view different versions of DS jobs run different versions of same job reverts to previous version of a job
118.
Page 21 of 57
thoughts? If Hash files output is connected to a transformer stage, the hash file will be consider as a Lookup file. If there is no primary link to the same Transformer stage, then this will be treated as a primary link itself. You can implement SCD in server jobs by using Lookup functionality. This will not return any error code. 119. What is Merging and how it can be done?
Merge stage is used to join two tables. It takes the Key columns and sorts them in ascending/descending order. Let us consider two tables Emp, Dept. If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending/descending order and can join those two tables. How can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the database connectivity and then use an adapter for just connecting between the two? You need to configure the ODBC connectivity for any database in the DataStage. 121. What is the OCI? And how to use the ETL Tools? OCI uses the "Oracle Call Interface" of the oracle to load the data. Its a kind of lowest level of Oracle being used for loading the data. If data is partitioned in your job on key 1 and you aggregate on key 2, what issues could arise? Data will partition on both the keys and it will take some more time to execute.
120.
122.
123.
What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself? This will eliminate the unnecessary records even getting in before joins are made?
This means try to improve the performance by avoiding use of constraints wherever possible and instead using them while selecting the data itself using a where clause. This improves performance.
124.
Incremental load means daily load. Whenever you are selecting data from source, Select the records which are loaded or Updated between the time stamp of the last successful load and todays load starts Date and Time. For this you have to pass the parameters for those two dates. Store the last runs date and time in a file and read the parameter through job parameters and state second argument as current date and time.
125.
Open the ODBC Data Source Administrator found in the control panel/administrative Tools. Under the system DSN tab, add the Driver to Microsoft Excel. Then you will be able to access the XLS file from DataStage.
126.
What is NLS in DataStage? How we use NLS in DataStage? What advantages in that? At the time of installation I am not chosen that NLS option, now I want to use that options what can I do? To reinstall that DataStage or first uninstall and install once again?
Page 22 of 57
NLS (National Language Support) is basically a Local language setting (character set). Once you install the DS you will get NLS. Just login into Admin and you can set the NLS of your project based on the requirement (need to map the NLS with your project). Suppose if you know you have a file with some Greek characters, if u set the NLS for Greek, while running the job DataStage will recognize those special characters. By using NLS function we can do the following - Process the data in a wide range of languages - Use Local formats for dates, times and money - Sort the data according to the local rules If NLS is installed, various extra features appear in the product. For Server jobs, NLS is implemented in DataStage Server engine. For Parallel jobs, NLS is implemented using the ICU library.
127.
DataStage understands the architecture of the system through (APT_CONFIG_FILE). This file contains information about node names, disk storage information...etc. APT_CONFIG is just an environment variable used to identify the *.apt file. Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server. 128. what is difference between server jobs & parallel jobs
Server jobs: These are available if you have installed DataStage Server Edition. They run on the DataStage Server, connecting to other data sources as necessary. Parallel jobs: These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required. Server jobs can be run on SMP, MPP machines. Here performance is low (Speed). Parallel jobs can be run only on cluster machines. Here performance is high (Speed). Server generates DataStage BASIC, parallel generates Orchestrate shell script (OSH), C++ and mainframe generates COBOL, JCL. In server and mainframe you tend to do most of the work in/using Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many stages like the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment. How is DataStage 4.0 functionally different from the enterprise edition now? What are the exact changes? There are lots of Changes in DS EE. CDC Stage, Stored Procedure Stage, Etc..........
129.
130.
What is the meaning of the following? 1) If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel 2) Tuning should occur on a job-by-job basis. Use the power of DBMS.
If u have SMP machines u can use IPC, link-collector, link-partitioner for performance tuning. If u have cluster, MPP machines u can use parallel jobs. The third point specifies about tuning the performance of job. By using the power of Database like Analyzing, creating index, creating partitions one can improve the performance of SQLs used in the jobs.
Page 23 of 57
131.
What is DataStage Multi-byte, Single-byte file conversions in Mainframe jobs? How we use those conversions in DataStage? What is UTF 8? Whats the use of it?
132.
Three kinds of routines are there in DataStage. 1. Server routines used in server jobs will be written in BASIC Language. 2. Parallel routines used in parallel jobs will be written in C/C++ Language. 3. Mainframe routines used in mainframe jobs will be written in COBOL/JCL Language.
133.
What are the validations you perform after creating a job in designer? What are the different types of errors you faced during loading and how did you solve them?
Check for the Parameters, check whether the input files/tables exist or not and also check for the Usernames, data source names, passwords etc
134.
Complex design means having more joins and look ups. We can easily implement any complex design in DataStage by following simple tips in terms of increasing the performance also. There is no limitation in using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding then go for another job. Dont use more than 7 look ups for a transformer stage otherwise go for including one more transformer stage.
135.
Yes, we can use shared container as lookup in server jobs. Wherever we use same Lookup in multiple places, on that time we will develop lookup in shared containers, and then we can use those shared containers as lookup. 136. Is it possible to run parallel jobs in server jobs? No, U cannot run parallel jobs in server edition. But you can run parallel jobs as server jobs (sequential) by changing APT_CONFIG file and setting node = 1 or setting node map constrain with node1. 137. What is trouble shooting in server jobs? What are the diff kinds of errors encountered while running any job?
Page 24 of 57
138.
How can you implement slowly changing dimensions in DataStage? Can u join flat file and database in DataStage? How?
Yes, we can implement SCD in DataStage. Let me explain SCD Type 2 for Time stamp. Step 1: Time stamp we are creating via shared container. It returns system time and One key. For satisfying the lookup condition we are creating a key column by using the column generator. Step 2: Our source is Data set and Lookup table is oracle OCI stage. By using the Changes capture stage we will find out the differences. The change capture Stage will return a value for chage_code. Based on return value we will find out whether this is for insert, Edit, or update. if it is insert we will modify With current timestamp and the old time stamp will keep as history. Yes, we can join a flat file and database in an indirect way. First create a job which can populate the data from database into a Sequential file and name it as Seq_First. Take the flat file which you are having and use a Merge Stage to join these two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join and Right Outer Join. You can use any one of these which suits your requirements.
139.
Does Enterprise Edition only add the parallel processing for better performance? Are any stages/transformations available in the enterprise edition only?
DataStage Standard Edition was previously called as DataStage Server Edition. DataStage Enterprise Edition was originally called Orchestrate, and then renamed to Parallel Extender when purchased by Ascential. DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed mainly for Unix, it now supports Windows, Linux and Unix System Services on mainframes. DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, MVS jobs. MVS jobs are jobs designed using an alternative set of stages that are generated into Cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server and then transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages; MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage. Row Merger and Row splitter are only present in parallel Stage. 140. What are OraBulk and BCP stages? ORABULK is used to load bulk data into single table of target oracle database. BCP is used to load bulk data into a single table for Microsoft SQL server and Sybase.
141.
If youre running 4 ways parallel and you have 10 stages on the canvas, how many processes does DataStage create?
It depends on the number of active stages and how they are linked. Only active stages can create a process. For e.g. if there are 6 active stages (like transforms) linked by some passive stages, the total no. of processes will be 6x4=24.
Page 25 of 57
142.
How can I specify a filter command for processing data while defining sequential file output data?
When u opens a sequential file to edit, In general tab u has the check box option - stage uses filter command, select this and then go for required page (inputs/outputs). You will find in page's General tab Filter command enabled, enter required filter commands such as gzipcompress command in Unix, or Gunzip-uncompress in Unix. Sequential file - edit | General tab - select stage uses filter command option | Inputs/Outputs (page) - Filter Command option - enter/browse required command in it.
143.
used?
By using this User variable activity we can create some variables in the job sequence. These variables are available for all the activities in that sequence. Most probably this activity is @ starting of the job sequence.
144. 145.
Can you convert a snow flake schema into star schema? Yes, we can convert by attaching one hierarchy to lowest level of another hierarchy. How to extract data from more than 1 heterogeneous Source. Means, example 1 sequential file, Sybase, Oracle in a single Job.
You can convert all the heterogeneous sources into sequential files & join them using Merge stage (OR) You can write a user defined query in the source itself to join them.
146.
Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. You can use orchadmin dump <dataset>.ds to view dataset in UNIX. A fileset is a mechanism DataStage uses to manage a collection of sequential files. The file collection metadata is in a fileset file with an .fs extension.
147.
What is the exact difference between Join, Merge and Lookup Stage?
The exact difference between Join, Merge and lookup is The three stages differ mainly in the memory they use. DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a merge stage or a lookup stage. Here's how to decide what to use: If the reference datasets are big enough to cause trouble, use a Join. A Join does a HighSpeed sort on the driving and reference datasets. This can involve I/O if the Data is big Enough, but the I/O is all highly optimized and sequential. Once the sort is over the Join processing is very fast and never involves paging or other I/O. Unlike Join stages and Lookup stages, the Merge stage allows you to specify several Reject links as many as input links.
Page 26 of 57
Data selection is very easy in join and it is difficult in merge. Join doesnt support reject links where as merge support reject links. Join supports only 2 files or 2 databases where as merge supports no. of i/p & o/ps. So finally performance wise join is better than merge. Lookup is used just for comparison purpose when the data is less.
148.
There are three different types of user-created stages available for PX. What are they? Which would you use? What are the disadvantages for using each type?
These are the three different stages: i) Custom ii) Build iii) Wrapped
149.
DataStage from Staging to MDW (Master Data Warehouse) is only running at 1 row per second! What do we do to remedy?
In general, If you have too many stages (especially transformers, hash look up), There would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. If there are too many look ups that are being done, ensure that you have appropriate Indexes while querying. If you dont want to write the query and use intermediate stages, ensure that you use proper elimination of data between stages so that data volume dont cause overhead. So, there might be a re-ordering of stages needed for good performance. Other things in general that could be looked in: 1. for massive transaction, set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk. 2. Enable row buffering and set appropriate size for row buffering. 3. It is important to use appropriate objects between stages for performance.
150.
How the hash file is doing lookup in server jobs? How is it comparing the key values?
Hashed File is used for two purposes: 1. To remove duplicate records 2. For reference lookups. The hashed file contains of 3 parts: each record have Hash Key, Key Header and Data portion. By using the hash algorithm and key value the lookup is faster.
151.
Informatica has more Developer-friendly features, but when it comes to scalability in performance, It is much inferior as compared to DataStage. Here are a few areas where Informatica is inferior 1. Partitioning - DataStage PX provides many more robust partitioning options than Informatica. You can also re-partition the data whichever way you want. 2. Parallelism - Informatica does not support full pipeline parallelism. 3. File Lookup - Informatica supports flat file lookup, but the caching is horrible. DataStage supports hash files, lookup file sets, datasets for much more efficient lookup. 4. Merge/Funnel - DataStage has a very rich functionality of merging or funneling the streams. In Informatica the only way is to do a Union, which by the way is always a Union-all.
152.
Page 27 of 57
Build opts generates C++ code (oops concept) and Subroutine is a normal program and you can call anywhere in your project. If I add a new environment variable in Windows, how can I access it in DataStage? You can view all the environment variables in designer. You can check them in Job properties. You can add and access the environment variables from Job properties. 154. Is it possible to call one job in another job in server jobs?
153.
I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs. Steps will be Edit --> Job Properties --> Job Control Click on Add Job and select the desired job. 155. What is hashing algorithm and explain briefly how it works?
Hashing is key-to-address translation. This means the value of a key is transformed into a disk address by means of an algorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to how well the algorithms work. These algorithms are usually quite simple and use division and remainder techniques. Interesting to note that these approaches are called "Monte Carlo Techniques" because the behavior of the hashing or randomizing algorithms can be simulated by a roulette Wheel where the slots represent the blocks and the balls represent the records. Is it possible to move the data from oracle warehouse to SAP Warehouse using DATASTAGE? We can use DataStage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available with DataStage Version 7.5.
156.
157. Is it possible to access the same job by two users at a time in DataStage? No, it is not possible to access the same job by two users at the same time. DS will produce the following error: "Job is accessed by other user".
158.
What are the differences between the data stage 7.0 and 7.5 in server jobs?
There are lots of new stages available in DS7.5 For E.g.: CDC Stage, Stored procedure Stage etc... In server jobs we have stored procedure stage, command stage and generate report option in file tab. In job sequence, stages like start loop activity, end loop activity, terminate loop activity and user variable activities were introduced. In parallel jobs surrogate key stage, stored procedure stages were introduced.
159.
How much would be the size of the database in DataStage? What is the difference between Inprocess and Interprocess?
Regarding the database size, it varies and depends upon the project. Inprocess is the process where the server transfers only one row at a time to the target and Interprocess means that the server sends group of rows to the target table. These both are
Page 28 of 57
160.
The steps required are: 1. Select the data source stage depending on the sources. E.g.: flat file, database, xml. 2. select the required stages for transformation logic such as transformer, link collector, link partitioner, Aggregator, merge etc 3. select the final target stage where you want to load the data either into data warehouse, data mart, ODS, staging etc 161. What is the difference between DRS and ODBC stage?
DRS stage should be faster than the ODBC stage as it uses native database connectivity. You need to install and configure the required database clients on your DataStage Server for it to work. Dynamic Relational Stage was leveraged for PeopleSoft to have a job to run on any of the supported databases. It supports ODBC connections too. ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivity for the chosen target. We use DRS stage in parallel jobs.
162.
What are Profile Stage, Quality Stage and Audit Stage in DataStage?
Profile stage is used to analyze the data and their relationship. Quality stage is used to check for the quality. Audit Stage -------------
163.
Environment variable is a predefined variable that can be used while creating a DS job. We create/ declare these variables in Administrator. While designing the job we set the properties for these variables. For Example Database Uname/Pwd we declare these variables in Admin and set the values Scott and Tiger values in the properties tab in the Menu (Parameters tab). ex: $APT_CONFIG_FILE Like above we have so many environment variables. Please go to job properties and click on "add environment variable" to see more of the environment variables.
164. What is DataStage engine? What is its purpose? DataStage sever contains DataStage engine. DS Server will interact with the Client components and Repository. Whenever the engine is on then only we can develop the jobs.
165. What is the purpose of exception activity in DataStage 7.5? The stages followed by exception activity will be executed whenever there is an unknown error occurs while running the job sequencer.
166. What is a project? Specify its various components?
Page 29 of 57
You always enter into the DataStage through a DataStage project. When you start a DataStage Client you are prompted to connect to a project. Each project contains: 1. DataStage jobs. 2. Built-in components. These are predefined components used in a job. 3. User-defined components. These are customized components created using the DataStage Manager or DataStage Designer.
What is meaning of file extender in DataStage server jobs? Can we run the DataStage job from one job to another job that files data where it is stored? File extender means adding the columns or records to the existing file. We can run the DataStage job from one job to another job in DataStage.
167.
168.
1. If you want to know some job is a part of a sequence, then in the Manager right click
on the Job and select Usage Analysis. It will show all the jobs dependents. 2. To find how many jobs are using a particular table. 3. To find how many jobs are using a particular routine. Like this, you can find all the dependents of a particular object. Its like nested. We can move forward and backward and can see all the dependents. 169. key? Purpose of using the key and difference between Surrogate key and natural
We use keys to provide relationships between the entities (Tables). By using Primary and Foreign key relationship, we can maintain integrity between the data. The natural key is the one coming from the OLTP system. The surrogate key is the artificial key which we are going to create in the target DW. We can use these surrogate keys instead of using natural key. In the SCD2 scenarios Surrogate keys play a major role. 170. Where actually the flat files store? What is the path? Flat files stores the data and the path can be given in general tab of the sequential file stage. Normally flat files will be stored at FTP servers or local folders. .CSV, .EXL and .TXT are the file formats available for the Flat files.
171.
When you compile a job, it ensure that basic things like all the important stage Parameters have been set, mappings are correct, etc. and then it creates an executable Job. You validate a compiled job to make sure that all the connections are valid. The entire job Parameters are set and a valid output can be expected after running this job. It is like a dry run where you don't actually play with the live data but you are confident that things Will work. When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made : - Connections are made to the data sources or data warehouse. - SQL SELECT statements are prepared. - Files are opened. Intermediate files like Hashed File, Universe, or ODBC stages that uses the local data source is created, if they do not already exist.
Page 30 of 57
172.
Other than full outer join (more than two links are not supported.), all other joins support Joining of more than two inputs. 1) Number Of Reject Link (Join) does not support reject link. (Merge) has as many reject links as the input links (if there are n-input links then 1 Will be master link and n-1 will be the update link). 2) Data Selection (Join) There are various ways in which data is being selected. E.g. we have different types of joins like inner, outer (left, right, full), cross join, etc. So, you have different selection criteria for dropping/ selecting a row. (Merge) Data in Master record and update records are merged only when both have same value for the merge key columns.
173.
How to implement type2 slowly changing dimension in DataStage? Give me with example?
Slow changing dimension is a common problem in Data ware housing. For example: There exists a customer called Lisa in a company ABC and lives in New York. Later she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem Type 1: The new record replaces the original record, no trace of the old record at all. Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes. In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use. In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key. Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and Performance can become a concern. Type2 should only be used if it is necessary for the data warehouse to track the historical changes. In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. For example, a new column will be added which shows the original address as New York and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. So Type 3 should only be used if the changes will only occur for a finite number of times.
174.
How we can call the routine in DataStage job? Explain with steps?
Routines are used for implementing the business logic. They are two types 1) Before Sub Routines and 2) After Sub Routines Steps: double click on the transformer stage right click on any one of the mapping field select [dsroutines] option within edit window give the business logic and select the either of the options ( Before / After Sub Routines).
Page 31 of 57
175.
Controlling Datstage jobs through some other DataStage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using DataStage macros in Routines. To Execute one job from other job, following steps needs to be followed in Routines. 1. Attach job using DSAttachjob function. 2. Run the other job using DSRunjob function 3. Stop the job using DSStopJob function JCL defines Job Control Language it is used to run more number of jobs at a time with or without using loops. steps: click on edit in the menu bar and select 'job properties' and enter the parameters as parameter prompt type STEP_ID stringSource SRC stringDSN DSN string Username unm stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from the listbox and run the job.
176.
A job can be made read only by the following process: Export the job in the .dsx format and change the attribute which store the read-only information from 0 ( 0 refers to editable job) to 1 ( 1 refer to the read only job). Then import the job again and override or rename the existing job to have both of the form. How to drop the index before loading data in target and how to rebuild it in DataStage? This can be achieved by "Direct Load" option of SQL Loaded utility.
177.
178.
179. How to kill the job in data stage? You should use kill -14 jobid. Sometimes using kill -9 jobid leaves things in a bad state. 180.
What is the difference between sequential file and a dataset? When to use the copy stage?
Sequential file: It acts as a source & permanent storage for target. Its extension is .txt. Dataset: (.ds) it act as a temporary storage stage, mainly it is used before the target Stage. While using this stage the I/P datas are partied & convert into Internal Dataset format. Then it is easy to load the data into target stage. Dataset supports parallelism but sequential file doesn't. Copy: it acts as a placeholder. It has a single i/p & many o/ps. If u want to add a new stage in ur job at that time it is very easy otherwise u have to modify the whole job. How to parameterize a field in a sequential file? I am using DataStage as ETL Tool, Sequential file as source. We cannot parameterize a particular field in a sequential file; instead we can parameterize
181.
Page 32 of 57
the source file name in a sequential file. #FILENAME# 182. What does separation option in static hash-file mean?
The different hashing algorithms are designed to distribute records evenly among the groups of the files based on characters and their position in the record ids. When a hash file is created, Separation and Modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hash file is created, DATASTAGE creates a file that contains the number of groups specified by modulo. Size of Hash file = modulus (no. groups) * Separations (buffer size).
183.
What do you mean by active stages and passive stages in DataStage environment? Active Stage: It is the "T" of ETL and Passive Stage : It is the "E & L" of ETL
184. What is Runtime Column Propagation and how to use it? If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job.
185.
How does u check for the consistency and integrity of model and repository?
186.
Give one real time situation where link partitioner stage used?
If we want to send more data from the source to the targets quickly we will use the link partitioner stage in the server jobs. We can make a maximum of 64 partitions and this is an active stage. We can't connect two active stages but it is accepted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partitioner will be collected by the link collector at a max of 64 partitions. This is also an active stage. So in order to avoid the connection of active stage from the transformer to the link collector we will be using inter process communication. As this is a passive stage, by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage.
187. How do you clean the DataStage repository? Remove the log files periodically.
188.
Can both Source systems and Target Data warehouse (may be oracle, SQL Server...etc) can be on windows environment or one of the systems should be in UNIX/Linux environment?
In server edition you can have both in Windows. But in PX target should be in UNIX. 189. What is the difference between static hash files and dynamic hash files? Static hash file cannot change their no. of groups except through manual resizing whereas Dynamic hash file automatically change their no. of groups (modulus) in response to the amount of data stored in a file.
Page 33 of 57
190. How can I schedule the cleaning of the file &PH& by dsjob? Create a job with dummy transformer and sequential file stage. In Before Job subroutine, use ExecTCL to execute the following command CLEAR.FILE &PH&.
191.
What is the difference between symmetrically parallel processing, massively parallel processing?
Symmetric Multi Processing (SMP) - Some Hardware resources may be shared by Processors. Processors communicate via shared memory and have single operating System. Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor has exclusive access to the hardware resources. Cluster systems can be physically dispersed. The processors have their own operating system and communicate via high speed network. Symmetric Multi Processing (SMP) is the processing of programs by multiple processors that share a common operating system and memory. This SMP is also Called as "Tightly Coupled Multiprocessing". A Single copy of the Operating System is In-charge for all the Processors running in an SMP. This SMP Methodology doesnt exceed more than 16 Processors. SMP is better than MMP systems when Online Transaction Processing is Done, in which many users can access the same database to do a search with a relatively simple set of common transactions. One main advantage of SMP is its ability to dynamically balance the workload among computers (As a result Serve more users at a faster rate). Massively Parallel Processing (MPP) is the processing of programs by multiple processors that work on different parts of the program and share different operating systems and memories. These Different Processors which run communicate with each other through message interfaces. There are cases in which up to 200 processors which run for a single application. An Interconnect arrangement of data paths allows messages to be sent between different processors which run for a single Application or product. The Setup for MPP is more complicated than SMP. An Experienced Thought Process should be applied when u setup these MPP and one should have a good in-depth knowledge to partition the database among these Processors and how to assign the work to these processors. An MPP system can also Be called as a loosely coupled system. An MPP is considered to be better than an SMP for applications that allow a number of databases to be searched in parallel.
192.
!#bin/bash dsadm - user su - root password (encrypt) DSHOMEBIN=/Ascential/DataStage/home/dsadm/Ascential/DataStage/DSEngine/bin if check ps -ef | grep DataStage (client connection is there) { kill -9 PID (client connection) } uv -admin - stop > dev/null uv -admin - start > dev/null verify process check the connection echo "Started properly" run it as dsadm 193. How to find the process id? Explain with steps? You can find it in UNIX by using ps -ef command. It displays all the process currently running
Page 34 of 57
on the system along with the process ids. 194. How to find the number of rows in a sequential file? Using Row Count system variable.
195.
You can improve performance of hashed file by 1. Preloading hash file into memory this can be done by enabling preloading options in hash file output stage. 2. Write caching options it makes data written into cache before being flushed to disk. You can enable this to ensure that hash files are written in order onto cache before flushed to disk instead of order in which individual rows are written. 3. Pre-allocating Estimating the approx size of the hash file so those files need not to be splitted too often after write operation. 196. What is difference between ETL and ELT?
ETL usually scrubs the data then loads into the Data mart or Data Warehouse where as ELT Loads the data then use the RDMBS to scrub and reload into the Data mart or DWH. ETL = Extract >>> Transform >>> Load ELT = Extract >>> Load >>> Transform
197.
Transaction Size - The transaction size for the new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page. Rows per transaction - The number of rows written before a commit is executed for the Transaction. The default value is 0, that is, all the rows are written after being committed to the data table. Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement.
198.
Oracle OCI: We can write the source query in this stage but we cant write lookup query in this stage instead of this we are using hash file stage for the lookup. We are having the facility to write multiple queries before (Oracle OCI/Output/SQL/Before) or after (Oracle OCI/Output/SQL/After) executing the actual query (Oracle OCI/Output/SQL/Before) We dont have multi-row lookup facility in this stage. ODBC: We can write both source query as well as lookup query in this stage itself We are not having the facility to write multiple queries in this stage. We are having the multi-row lookup facility in this stage.
199.
Can you tell me for what purpose .dsx files are used in the DataStage?
.dsx is the standard file extension of all the various DataStage jobs. Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and
Page 35 of 57
200.
In parallel jobs we can use surrogate key generator stage. In server jobs we can use an inbuilt routine called KeyMgtGetNextValue. You can also generate the surrogate key in the database using the sequence generator.
201.
If we have two sources with same Meta data, how to check the data in 2 sources is same or not? If the data is not same I want to abort the job? How can we do this?
Use a change Capture Stage. Output it into a Transformer. Write a routine to abort the job which is initiated at the Function. @INROWNUM = 1. So if the data is not matching it is passed in the transformer and the job is aborted.
202.
How do you call procedures in DataStage? Use the Stored Procedure Stage.
203.
The straight link is the one where data are passed to next stage directly and The reference link is the one where it shows that it has a reference (reference key) to the main table. 204. Disadvantages of staging area? Disadvantage of staging area is disk space as we have to dump data into a local area.
205.
206.
What are the new features of DataStage 7.1 from DataStage 6.1?
207. How to write and execute routines for PX jobs in c++? You define and store the routines in the DataStage repository (ex: in routine folder) and these routines are executed on c++ compilers.
208.
1. What are various processes which start when the DataStage engine starts? 2. What are the changes need to be done on the database side, if I have to use DB2 stage?
Page 36 of 57
3. DataStage engine is responsible for compilation or execution or both? There are three processes start when the DataStage engine starts: 1. DSRPC 2. DataStage Engine Resources 3. DataStage telnet Services.
209.
How we use the DataStage Director and its run-time engine to schedule running the solution, testing and debugging its components, and monitoring the resulting executable versions on ad hoc or scheduled basis?
210. Where can you output data using the Peek Stage? The output of peek stage can be viewed in DataStage director LOG and also can be saved as a text file. 211. Why hash file is faster than sequential file and ODBC stage? Hash files store in local server and cache.
212.
What is Ad-Hoc access? What is the difference between Managed Query and Ad-Hoc access?
213.
1. Difference between Hash file and Sequential File? What is modulus? 2. What is iconv and oconv functions? 3. How can we join one Oracle source and Sequential file? 4. How can we implement Slowly Changing Dimensions in DataStage? 5. How can we implement Lookup in DataStage Server jobs? 6. What are the third party tools used in DataStage? 7. What is the difference between routine and transform and function? 8. What are the Job parameters? 9. What is Plug-in? 10. How can we improve the performance of DataStage jobs? 11. How can we create Containers? 12. What about System variables? 13. What is the use of Usage analysis? 14. Different ways a project can be moved to production? Ans: export-import and Version control. 15. What database is the DataStage repository use? Answer: Universe Database 16. How is the scheduling done in the project? 17. Which version of DataStage is used in the project? 18. What is the performance tuning required while dealing with large data? 19. What does the reject option in transformer do? 20. What is the architecture of DataStage?
Page 37 of 57
21. How do you define and use the job parameters? 22. What are stage variables, system variables and environment variables? 23. How to use routines in DataStage? 24. What is difference between shared-container and local-container? 25. How do you connect to Oracle? 26. Please explain any ETL process that you have developed? 27. What is hash file? Types of hash files. 28. If you are doing any changes in shared-container will it reflect in all the jobs wherever you used this shared container? 29. Have u written any custom routines in your project? If so explain? 30. How do you get log info into a file? 31. What is before job subroutine/After job subroutine? When do you use them? 32. How do you backup and restore the project? 33. What is Clear Status File and when do you use it? 34. What is Cleanup Resources and when do you use it? 35. Can I join a flat file and oracle and load into oracle? Is this possible? 36. While loading some data into target suddenly there is a problem loading Process stopped how can u start loading from the records that were left? 37. What are the general problems that u face in DataStage? 38. What are the various reports that could be generated using this DataStage? 39. How to remove blank spaces from data? 40. What is Active and Passive stage? 41. What all are the stages you have used in your project? 42. Could DataStage generate test cases? 43. What is difference between hash file and sequential file? 44. What is the difference between Transform and routine? 45. What is sequencer? These are some of DataStage PX Jobs questions which can be asked in interviews. 1) Types of parallel processing 2) What is the SMP (Symmetric Multi Processing) and MPP (Massively Parallel Processing)? 3) What are CPU limited, Memory limited and Disk I/O limited jobs? 4) Can one combine pipeline & partition parallelism? 5) Advantages of PX over server job 6) Is it possible to create user-defined stage in PX? 7) Can I use hash file in PX? 8) What is surrogate key stage? 9) What is the use of APT_DUMP_SCORE? Ans: To get messages in logs such as no. of processes, no. of nodes used. 10) What are the four types of joins possible in Joiner stage? Ans: Inner, Left Outer, Right Outer, Full outer 11) What are the components of APT_CONFIG_FILE? Ans: Nodes, Fast node, Pools, Resource 12) What are the points that needs to be considered while creating the config file? Ans: Available nodes, CPU time, available memory, what other process to be executed on same nodes, are there any configurations restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them, Get breakdown of the resource usage? Is the hardware config SMP, Cluster or MPP? 13) When are wrappers created? Ans: only for executable commands for UNIX, dos 14) When are buildups created? Ans: More functionality, complex logic needed. 15) When are custom stage created?
Page 38 of 57
Ans: new operators need which are not in EE 16) What are different job sequencer stages? 17) What is iconv and oconv functions? 18) Can we implement Slowly Changing Dimensions in DataStage? What are Job parameters? 19) What does the reject option in transformer do? 20) How is parallelism executed? 21) What is RCP? 22) Difference between join, merge, and lookup stage? 23) What is dataset? 24) Differences between dataset, fileset and lookup file set? Questions on Data Warehousing concept 1. What is Data Warehouse? 2. What is difference between Data Warehouse and Data Mart? 3. What is Star schema? 4. What is Snow-flake schema? 5. What is fact and dimension? 6. What is surrogate key? 7. What Normalization? Explain 3rd Normalized form? 8. What is the difference between OLTP and OLAP? 9. Are you involved in data modeling? If yes which tool/tech you are using? 10. Which schema modeling techniques you ever used? 11. What do you mean by summary table? 12. What are Degenerated Dimensions? 13. What is fact less fact? Fact table that contains no measures or facts are called "factless fact tables" or "junction tables". Oracle question based on data warehouse? 1. What is parallel execution 2. What are Bitmap and B-Tree indexes? Explain Local Vs Global variables 3. What is materialized view 4. What is page size/array size in oracle? 5. What are integrity constraints? 6. How can one tune SQLs in Oracle?
214.
Testing of jobs can be performed at many diff levels: Unit testing, SIT (System Integration Testing) and UAT (User Acceptance Testing) Phases. Testing basically involves functionality and performance tests. First, data for the job needs to be created to test the functionality. By changing the data we will see whether the requirements are met by the existing code. Each iteration of code change should be accompanied by a testing iteration. Performance tests basically involve load tests and see how well the existing code Perform in a finite period of time. Performance tuning can be performed on SQL or the job design or the basic/OSH (Orchestrate Shell) code for faster processing times. In addition, all job designs should include an error correction and fail over support so that the code will be robust. 215. What is pivot stage? Why are u using? What purpose that stage will be used?
Pivot stage is used to make the horizontal rows into vertical. Source Table
Page 39 of 57
Environment variable is a predefined variable where we can use while creating a DS job. We can set this either at Project level or at Job level. Once we set specific variable that will be available into the project/job. We can also define new environment variable. For that we can got to DS Admin. 217. What is a routine?
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines.
218. How can we create environment variables in DataStage? This mostly comes under Administrator part. As a Designer, we can add directly by Designer -- view -- job properties -- parameters -- add environment variable -- under user defined -- then add.
219.
What is the use of Hash file? Why cant we use sequential file?
The primary use of the hash file is to do a look up. You can use a sequential file for look up but you need to write your own routine to match the columns. Coding time and execution time will be more expensive. But when you generate a hash file the hash file indexes the key by an inbuilt hashing algorithm. So when a look up is made, it will be much faster. Also it eliminates the duplicate rows.
220.
How to read the data from EXCEL FILES? Explain with steps?
Reading data from Excel file is 1) Save the file in .csv (comma separated files). 2) Use a flat file stage in DataStage job panel. 3) Double click on the flat file stage and assign input file to the .csv file (which u stored). 4) Import metadata for the file. (once you imported or typed metadata, click view data to check the data values) Then do the rest transformation as needed 221. Whats the meaning of performance tuning technique, Example?
Performance tuning means we have to take some action to increase the performance of the
Page 40 of 57
slowly running job by 1) using link partitioner and link collector to speed up the performance 2) using sorted data for aggregation 3) using sorter at source side and aggregation at target side 4) Tuning the OCI stage for 'Array Size' and 'Rows per Transaction' numerical Values for faster selects, inserts and updates. 5) Do not use IPC stage at target side..............etc.
222.
Unit Testing: In DataStage scenario, Unit Testing is the technique of testing the Individual DataStage jobs for their functionality. Integrating Testing: When the two or more jobs are collectively tested for their Functionality that is called Integrating testing. 223. How can we improve performance in aggregator stage? We need to sort the data before sending to it.
224.
Consider three CPUs connected in series. When data is being fed into the first one, it starts processing, simultaneously is being transferred into the second CPU and so on. You can compare this with 3 sections of pipe as water enters the pipe it starts moving into all the sections of pipe. Partition Pipeline- Consider 3 CPUs connected in parallel and being fed with data at the same time thus reduces the load and efficiency. You can compare a single big pipe having 3 inbuilt pipes. As water is being fed to them it consumes large quantity in less time.
225.
CREATE INDEX emp_ename ON emp(ename) TABLESPACE users STORAGE (INITIAL 20K NEXT 20k PCTINCREASE 75);
226.
What are the main diff between server job and parallel job in DataStage?
There are lot of differences in using same stages in server and parallel. For example, in parallel, a sequential file or any other file can have either an input link or an output ink, but in server it can have both (that too more than 1). Server jobs can compile and run with in the DataStage server but parallel jobs can compile and run in the DataStage Unix server. Server jobs extract total rows from source to other stage and then only that stage will be activate and passing the rows into target level or DWH. It is time taking. But in parallel jobs it is two types 1.pipeline parallelism 2.partition parallelism 1. Based on statistical performance, we extract some rows from source to another stage and at the same time the stage will be activate and passing the rows into target level or DWH. It will maintain only one node within source and target. 2. Partition parallelism will maintain more than one node within source and target.
Page 41 of 57
227. What is the use of job control? Job control is used for scripting. With the help of scripting, we can set parameters for a caller job, execute it, do error handling etc tasks. 228. What are different types of file formats? Some are comma delimited csv files, tab delimited text files...
229.
What is complex Flat File stage? In which situation we are using this one?
A complex flat file can be used to read the data at the initial level. By using CFF, we can read ASCII or EBCDIC (Extended Binary coded Decimal Interchange Code) data. We can select the required columns and can omit the remaining. We can collect the rejects (bad formatted records) by setting the property of rejects to "save" (other options: continue, fail). We can flatten the arrays (COBOL files).
230. Why job sequence is use for? What are batches? What is the difference between job sequence and batches? Job Sequence allows you to specify a sequence for server or parallel jobs to run. The sequence can also have control information, for e.g., you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you defined a job sequence, it can be scheduled and run using DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job. A Batch is a collection of jobs grouped together to perform a specific task. I.e. It is a special type of job created using DataStage director which can be scheduled to run at a specific time. Difference between Sequencers and Batches: Unlike as in sequencers, in batches we cannot provide the control information. 231. For what purpose is the Stage Variable is mainly used?
Stage variable is a temporary storage memory variable in transformer stage. If we are doing calculations repeatedly, we can store that result in a stage variable. It can be used in situations where you want to Store a Previous record value in a variable and compare that with current record value and use in if then else conditional statements. 232. What is a sequential file that has single input link? Sequential file always has a single link because it cannot accept the multiple links or threads. Data in sequential file always runs sequentially.
233. What are different types of star schema? Multi star schema or galaxy schema is one of the types of star schema.
234. What is the sequencer stage?
Lets say there are two jobs (J1 & J2) as the input links and one job (J3) as output link for a sequencer stage in a DS job sequencer. The sequencer can be set to "ALL" or "ANY". If it is set to "ALL", the sequencer triggers the third job (J3) only after the 2 input linked jobs (J1 & J2) completes and if it is set to "ANY", it just waits for any of the job (J1 or J2) to complete and then triggers the third one.
Page 42 of 57
235. What is the use of tunable? Tunables is a tab in project property in DataStage Administrator, in that we can change the value of cache size between 000 to 999 MB. Default size of cache size is 128MB.
236. What is user activity in DataStage? The User variable activity stage defines some variables; those are used in the sequence in future.
237. What is the alternative way where we can do job control?
Job control can be done using : DataStage job Sequencers DataStage Custom routines Scripting Scheduling tools like Autosys
Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using DataStage? Currently, the total is converted to ASCII, even though the individual records are stored as EBCDIC.
238.
239.
Suppose you have table "sample" & three columns in that table sample:
Cola Colb Colc 1 10 100 2 20 200 3 30 300 Assume: cola is primary key How will you fetch the record with maximum cola value using DataStage tool into the target system. You can use OCI stage to read the source file. In the OCI Stage write user defined SQL query as "Select max(cola) from the table" which will fetch the maximum value available in the table then load the data to Target Table.
240.
It is always preferable & appropriate that we must use a sort stage before aggregate stage. Hence based on the aggregate logic we should sort the incoming data by using hash partition on keys. Then we can use same partition on Aggregate stage. This is most commonly used. 241. Is it possible to query a hash file? Justify your answer. Yes, it is possible. A hash file is like a database table for all practical purposes. Hence you can login to the tssh/uvsh/TCL prompt and query it like any other database table. 242. What is the difference between RELEASE THE JOB and KILL THE JOB? Release the job is to release the job from any dependencies and run it. Kill the job means kill the job that's currently running or scheduled to run.
Page 43 of 57
243. What is process to remove duplicate records using Copy Stage? In the copy stage, there is an option to sort the input data where you can select the key on the basis of which you can perform the sort operation. There is an option named 'Unique'. Select this, it will return the unique record after the sort operation. 244.
What is the difference between IBM Web Sphere DataStage 7.5 (Enterprise Edition) & Standard Ascential DataStage 7.5 Version?
IBM acquires DataStage from Ascential. DataStage 7.5 was released with the name of Ascential DataStage and only 8.0 was released with IBM Websphere DataStage 8.0. IBM Information Server is also known as DS 8.0. It has more features like Quality Stage & MetaStage. It maintains its repository in DB2 unlike files in 7.5. Also it has a stage specifically for implementing the SCD 1 & 2 (SCD Stage). The advanced edition of DataStage is Websphere DataStage and Quality stage, it is released by IBM itself and given the version as 8.0.1, in this there are only 3 client tools (administrator, designer, director), here they have removed the manager, it is included in designer itself (for importing and exporting) and in this they have added some extra stages like SCD stage, by using this we can implement scd1 and scd2 directly, and some other advanced stages. They have included the Quality Stage, which is used for data validation which is very important for DWH. There are so many things available in Quality stage. 245. What is the difference between Merge Stage and Lookup Stage?
Lookup stage: 1. Used to perform lookups. 2. Multiple reference links, single input link, single output link, single rejects link. 3. Large amount of memory usage. Because paging is required. 4. Data on input links or reference links need NOT to be sorted. Merge stage: 1. Combines the sorted datasets with the update datasets. 2. Several reject links, multiple output links will be exist. 3. Less memory usage. 4. Data needs to be sorted. 246. What is repository? Repository resides in a specified data base. It holds all the meta data, raw data, mapping information and all the respective mapping information. 247. How do you extract job parameters from a file? Through user variable activity in sequencer Job. Through calling a routine. 248. Aggregators What does the warning Hash table has grown to xyz . mean? Aggregator cannot store the data onto disk like Sort stage do the data landing. Your system memory will be occupied by the data that is going to aggregator. If your system memory is full then you get that kind of weird messages. I dealt with that kind of error once. My solution to that is using multiple chunks of data and multiple aggregators.
Page 44 of 57
249. What are the different Debug stages available in DataStage PX? Head, tail, row generator, column generator, peek, sample, write range map. 250.
Where we can use the Stages Link Partitioner, Link Collector & IPC? Whether in Server Jobs or in Parallel Jobs? And SMP is a Parallel or Server?
We use Link partitioner and link collector stages in server jobs to speed up processing. Suppose you have a source and target and a transformer in between that does some processing, applying fns etc. You can speed it up by using link partitioner to split the data from source into different links, apply the Business logic and then collect the data back using link collector and pump it into output. IPC stage is also intended to speed up processing.
251.
What is the difference between the Dynamic & Static RDBMS Stages?
252.
What is TX and what is the use of this in DataStage? As I know TX stand for Transformer Extender, but I don't know how it will work and where we will be used?
WebSphere Transformation Extender (WTX) is the universal transformation engine for WebSphere that addresses these complex data challenges to integration. Through its unique ability to speak to and process any data type in its native format, it tackles the "hard, ugly challenges" in integrating systems and information across the enterprise through a codeless, graphical approach to development.
253.
When you are able to handle Null handling and Data type changes in ODBC stages why you need Modify Stage?
Used to change the data types, if the source contains the varchar and the target contains integer then we have to use this Modify Stage and we have to change according to the requirement. And we can do some modification in length also. 254. Explain a specific scenario where we would use range partitioning? If the data is large and if you cannot process the full data in one time process you will generally use the Range partitioning. It's Partitioning by Column wise. Have you ever contact with the DS Administration people? What are the situations and what problems you phased? Creating projects, maintaining rights of users, killing hung jobs.
255.
256.
Page 45 of 57
257.
Run time column propagation is used in case of partial schema usage. When we only know about the columns to be processed and we want all other columns to be propagated to target as they are, we check enable RCP option in administrator or output page columns tab or stage page general tab and we only need to specify the schema of tables we are concerned with. According to documentation Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see Schema File on page 5-8 and on page 5-31) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for Sequential File Stage 5-47any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export ". 258. What is the use of environmental variables?
Environment Variables are the ones which sets the environment. Once you set these variables in DataStage you can use them in any job as a parameter. Example is you want to connect to database you need userid, password and schema. These are constant throughout the project so they will be created as environment variables. Use them where ever you want with #Var#. By using this if there is any change in password or schema no need to worry about all the jobs. Change it at the level of environment variable that will take care of all the jobs. 259. What is the difference between Job Control and Job Sequence?
Job control specially used to control the job, means through this we can pass the parameters, some conditions, some log file information, dashboard information, load recover etc..., Job sequence is used to run the group of jobs based upon some conditions. For final/incremental processing we keep all the jobs in one diff sequence and we run the jobs at a time by giving some triggers.
260. How you will define global variables in DS? The user variable stage allows you to define global variables within a sequence. These variables can be used anywhere in the sequence, The values of the user variables are set by expressions in the stages properties.
261. How to find duplicate records using transformer stage in server edition?
1. Can write a SQL query depending upon the fields. 2. You can use a hash file, by nature which doesnt allow duplicates. Attach a reject
Page 46 of 57
3. Transformer stage to identify and remove duplicates from one output, and
direct all Input rows to another output (the "rejects"). This approach requires sorted input.
262.
263.
For every job in DataStage, a phantom is generated for the job as well as for every active stage which contributes to the job. These phantoms write logs reg. the stage/job. If there is any abnormality occurs, an error message is written and these errors are called phantom errors. These logs are stored at &ph& folder. Phantoms can be killed through DataStage Administrator or at server level.
264.
Has files is a passive stage. In DS Server edition, hash files are used in jobs where lookups are required. Hash files are generally used in intermediate jobs in DS, where there is a need to store data in an intermediate table format. All files in DataStage are Passive stage like hash file, sequential file etc. The stage which does some process in it is called active stage. Ex: Transformer, Sort, Aggregate.
265.
What are Orchestrate options in generic stage, what are the option names? Value? Name of an Orchestrate operator to call? What are the orchestrate operators available in DataStage for AIX environment?
Orchestrate is the old name for PX and EE. All the internal stages generally called Internal Operators in PX. You can write Custom operators if there is any necessity of enhancing the functionality that is not available in Orchestrate. Its a very complex process and only few people actually do this kind of coding. You can write this code in C++ and add the complied code to DataStage. Please verify the Custom Operators Docs in EE or PX versions. 266. How do you configure api_dump?
267.
How to attach an MTR file (Map Trace) via email and the Map Trace is used to
Page 47 of 57
268. What are the important considerations while using join stage instead of Lookups? 1. If the volume of data is high then we should use Join stage instead of Lookup. 2. If you need to capture mismatches between the two sources, lookups provide easy option.
269.
Phantom Process is Orphaned process. Sometimes some processes will be still running in the server even though you kill the actual process. Some threads will keep running without any source process they are called Phantom Process. If you see the Directory called %PH% this folder captures the log of phantom process. In the transformer stage the order of the output links should be such that the dependant links should come after independent links. Otherwise job will abort with message phantom error.
270. How to find out individual counts for a particular table? While implementing the SCD type-2, due to maintaining of history, the primary key constraint will be violated. To have a primary key for the table for the sake of unique key we use surrogate key. 271. How to work with Repeating fields in DataStage? We should create some staging variable & define the value for repeating fields there & then just map these to respective columns. 272.
How do you load partial data after job failed source has 10000 records, Job failed after 5000 records are loaded. This status of the job is abort, Instead of removing 5000 records from target, how can I resume the load?
There are lots of ways of doing this. But we keep the Extract, Transform and Load process separately. Generally only load job never fails unless there is a data issue. All data issues are cleared before in transform only. There are some DB tools that do this automatically. If you want to do this manually. Keep track of number of records in a hash file or test file. Update the file as you insert the record. If job failed in the middle then read the number from the file and process the records from there only ignoring the record numbers before that try @INROWNUM function for better result. 273. How can we increment the surrogate key value for every insert in to target database? We have Surrogate key generator stage in DataStage enterprise edition which will generate keys incrementally. In Server edition, we have to use stage variables or pre-
Page 48 of 57
274.
What is the baseline to implement partition or parallel execution method in DataStage job? E.g. more than 2 million records only advised?
275. A batch is running and it is scheduled to run in 5 minutes. But after 10 days the time changes to 10 minutes. What type of error is this and how to fix it?
276.
How do you delete header and footer on the source sequential file and how do you create header and footer on target sequential file using DataStage?
In Designer Palette Development/Debug we can find Head & tail. By using this we can do. By using UNIX sed command we can delete header and footer. i.e.; for header sed -n '1|p' and footer sed -n '$|p' 277. What is Invocation ID?
An 'invocation id' is what makes a 'multi-instance' job unique at runtime. With normal jobs, you can only have one instance of it running at any given time. Multi-instance jobs extend that and allow you to have multiple instances of that job running (hence the name). So, you can run multiple 'copies' of the same job as long as the currently running invocation ids are unique.
278.
What is the difference between Sequential Stage & Dataset Stage? When do you use them?
Sequential file can be used as a target file but Dataset can't be used as a final target, it can be used as temporary target. Dataset can be used as a hash file (just like in server hash file), it improves the performance. Dataset has no limit and sequential file has a size limit. Sequential file does not show the null value means it never accepts the null, we need to pass some value, but it is not in dataset. Performance is very good in dataset when compared to sequential file because of parallelism.
279.
Page 49 of 57
If there is any fatal error means the job got aborted but if there are any warnings are there means the job not aborts but we have to handle those warnings also. Log file must be cleared with no warnings also. So many errors will come in diff jobs like Parameter not found in job load recover, Child job is failed because of some....., Control job is failed because of some.....etc. 280. How to connect two stages which do not have any common columns between them? If suppose two stages dont have the same column name then in between use one Transformer stage and map the required column. If those are sources then you better use copy stage after one stage and change the column name according to your requirement, then join these two sources. If it is oracle database then select the user defined sql, there you put alias name and change the column name by selecting the columns tab. 281. How to develop the SCD using LOOKUP stage?
We can implement SCD by using LOOKUP stage, but it is for only scd1, not for scd2. We have to take source (file or db) and dataset as a ref link (for look up) and then LOOKUP stage, in this we have to compare the source with dataset and we have to give condition. After that in transformer we have to give the condition, after that we have to take two targets for insert and update, there we have to manually write the sql insert and update statements.
282.
How can we improve the performance of the job while handling huge amount of data?
1. 2. 3. 4. 5. 6. 283.
Decrease usage of transformer stage. Use sorted input for aggregator. Filter the data in source itself (if it is oracle database, write user def sql). Use tunable tab in Administrator for cache memory, buffer size (if u have knowledge, then only go for that, otherwise u will get some problems like deadlocks). According to the requirement you use join, lookup or merge (depends upon the data in input dataset and reference link data), use dataset for ref. link (for lookup). Use partition methods for different stages.
I am running huge amount of data through a sort stage and it is very slow. Is there any option which can increase the performance with in the sort stage?
See if it is oracle db then u can write user def query for sort and remove duplicates in the source itself. And maintaining some key partition techniques u can improve the performance. If it is not the case means better go for some key partition techniques in sort, keeping the same partition which is in previous stage. Dont allow the duplicates, remove duplicates and give unique partition key.
284.
Hi friends, I know while using lookup stage if the lookup table has huge amount of data we will go for sparse lookup or else will use normal lookup. Would
Page 50 of 57
anybody explain how the memory will be allocated for both types of lookups? The lookup type is not decided by the data on the reference link but by the amount of data on the input link. If and only if your input link has few records, Then you can go for a Sparse Lookup. There is no memory allocation for this type of lookup, And a SQL where clause is directly fired on the reference link database for fetching the lookup key value. In case of a Normal Lookup, The entire reference link data is read into the memory and then a lookup is done in the memory itself. So memory comes into picture only for normal lookup's and not for Sparse lookup's.
285.
If a record is duplicated 3 times then how to get middle duplicated record? Is it advisable to use basic Transformer in Parallel jobs?
I think we have to use remove duplicate stage here to remove duplicates. First we have to take dataset, after that remove duplicates stage and target, OR source as dataset and transformer (in this give constraint) and target.
286. What is the purpose of Debugging stages? In real time where we will use? The main uses of Debugging Stages (row gen, peak, tail, head etc) are to monitor the jobs and to generate mock data when we dont have real time data to test. 287.
In DataStage job successfully finished. But showing errors in director window. How do we fix and find the cause for the errors?
I think in designer u find the green color but in Director -> log view --> u find some fatal errors and job is aborted, it means job is successfully completed but the data is not loaded in to the target, may be some miss matched data, data type length problemetc
288.
What is the use of clear status file in DS? How we will use it? Explain with examples?
When you clear a jobs status file you reset the status records associated with all stages in that job. You should therefore use this option with great care, and only when you believe a job has hung or aborted. In particular, before you clear a status file you should: 1) Try to reset the job 2) Ensure that all the jobs processes have ended.
289.
If you have a huge volume of data to be referenced, which stage will you use? Join or Lookup stage? Why?
For huge size of data we must use join. Lookup stage requires all the data to be loaded in memory 1st & then only it can start matching process with inputs. Hence It is advisable to use join stage than Lookup stage. If you want to capture Reject records then you can use transformer (or filter) after Join Stage and capture the null values. Or Instead of join stage use Merge Stage to capture the reject records.
290.
How can I load a flat file into target as fast as I can? Assuming that the source bottleneck is not there, that is there are no performance issues in the source side.
Page 51 of 57
291.
At the time to incremental loading I need to take the reusable files' path. In this situation how to assign the one path of file to PROJDEF using the job parameters?
292.
Write the shell script using UNIX to run the 2 jobs? How to schedule the 2 jobs using UNIX crontab to run for particular time?
Use crontab -e to edit your crontab file, or create one if it doesnt already exist. Now enter following line into the file : 30 08 10 06 * /home/nikhilanshuman/runjob.sh Following is the description of the schedule: 30 30th Minute 08 08 AM 10 10th Day 06 6th Month (June) * Every day of the week /home/nikhilanshuman/runjob.sh is the path where the script was saved.
293.
1. Is the Hash file is active or passive? If we take as source? 2. Can u take sequential file as look up? 3. In hash file dynamic 30, there r two types: 1) generic 2) specific what is the meaning? 4. How to connect MERGE STAGE while source as two tables? 5. What is the purpose of MERGE? 6. How can DS job scheduled in UNIX? 7. How do u know how many rows rejected? 8. Use of universe stage? 9. What is SEQ file buffer? 10. Diff between DS routine, DS transform, DS function in transform stage? 11. If one table contains 100 rows, another one is 1000. Which one we take as lookup? 12. If a file has 100 records. After 70 records job is abort? How to load reaming? 13. Shared Containers replace server jobs into parallel job? 14. is possible to extract EXCEL sheet by using SEQ file? 15. How do you decide when to go for join or lookup? 16. Diff between hash & range & entire Partion keys? 17. In lookup stage in parallel which Partion key u select? 18. What is routine?
Page 52 of 57
19. Where the hash file store, if I take hash is target? 20. Def of system variables? Where we use it? What are system variables in your project? 21. What are plug-ins? 22. What is the purpose of fact less fact table? 23. If source is OCI & flat files then why u need ODBC in u r project rather than OCI? 24. Can we take SEQ file as source to Hash file, it works or not? If not what error it will give? 25. Can we use AGG & Transform stage use for sorting? How? 26. How do you do automation job? 27. How can we join one oracle & flat files? 28. If there is one million records. We use OCI or some another stage? 29. What is purpose of filter option in SEQ file in server jobs? 30. How to filter data by using SEQ? 31. Diff between clear log file & clear stage file? 32. I want to process 3 files in sequentially at a time how can we do in job sequence? 33. What is version control in DataStage? 34. How to release job? 35. Diff between parallel jobs and parallel shared containers? 36. How to give scheduling time? 37. In job sequence I have 5 jobs? I want to run 1 & 4 only. Reaming I run later? How can u do that? 38. Diff between lookup file set & file set? 39. Data migration? 40. I have 1000 records? How to load last 100 records? 41. How to cleansing your DataStage job? 42. Where we write the routines? 43. How to execute job sequence? 44. I am using oracle database? But I am not using OCI stage? Without OCI or ODBC HOW CAN I EXTRACT THE DATA? 45. I have 4 jobs, is it run at a time? Is it possible? What is status? 46. Is it aborted why? Is it validating why? 47. What is an environmental variable? Where is it stores? 48. How to connect DB source? I.e. path? 49. Diff between server sequence stage & parallel sequence stage? 50. Diff between dataset & SEQ file in parallel? 51. Diff between change apply & change capture stage? 52. Diff copy, difference, transform stage? 53. Diff funnel, filter, external filter in parallel? 54. Diff between file set, dataset, lookup file set, SEQ file? 55. Diff between hash & SEQ file in server? 56. Diff between static & dynamic types in hash file? 57. Diff between server SEQ & parallel SEQ files? 58. What are buildup and audit stages? What is the purpose of it? 59. How to find multiple jobs process ids without using cleanup resources in DS director? 60. Where we find the process ids from last 5 execution jobs? 61. How many format types to export data by using DS manager? 62. Can u convert server jobs into parallel jobs? 63. how to set configure file node, if I have million records? 64. I have two types of metadata in SEQ file. Is it possible to load target? 65. Diff between hash & dataset? 66. I have two files. One file contains 1,00,000 records and another one contains 1000 records. Then how will we set nodes?
Page 53 of 57
In oracle enterprise edition, when we select Upsert mode? Diff between server seq file & parallel seq file? Diff between server transform stage & parallel transform stage? In OCI stage when we go for insert update OR update insert? What is the Disadvantage of surrogate key stage?
294.
If the data is not properly processed on nodes...in such situations this surrogate key generates holes(simply skips the ordering) ex: aa-1 bb-2 ba.. bg... cc-3 dd-4 ddd-5 here for ba and bg, we dont have sequence key because they have holes.
295.
What is the difference between change capture and change apply stages?
Change capture stage is used to get the difference between two sources i.e. after dataset and before dataset. The source which is used as a reference to capture the changes is called after dataset. The source in which we are looking for the change is called before dataset. This change capture will add one field called "change code" in the output from this stage. By this change code one can recognize which kind of change this is like whether it is delete, insert or update. Change apply stage is used along with change capture stage. It takes change code from the change capture stage and apply all the changes in the before dataset based on the change code. Change Capture is used to capture the changes between the two sources. Change Apply will apply those changes in the output file.
296.
I am eager to know about DataStage Engine commands. Tell me something about that?
DataStage Engine commands are otherwise called as Universe Commands. They are used to retrieve the details that are stored in the DataStage repository in the form of files and tables. We can also access the hash files and VOC contents by using those commands. E.g.: EVERY, LISTL are some of the commands. The following commands can be taken as DS Engine commands, used to start and stop the DS Engine DSHOME/bin/uv -admin -start DSHOME/bin/uv -admin stop
297.
Using server job, how to transform data in XML file into sequential file? I have used XML input, XML transformer and a sequential file?
Page 54 of 57
Import the XSD for the xml file via manager-metadata importer and store it. If transformation functions are not required to be done on the data in XML file, there is no need for the transformer. To load data directly from a XML file to Sequential file:use these stages in the design:FOLDER->XMLINPUT->SEQUENTIAL FILE (If transformation is required, u can use the Transformer stage in between XMLINPUT stage and SEQFILE stage) In FOLDER stage properties->Assign the file path , can be given as wildcard. Here the whole data in the file will be taken as a single row and streamed through the folder stage. In output: metadata will be 'File'(filename-key-varchar) and 'Data'(single row of data-long varchar 9999) In XMLINPUT stage->Source- select the 'File' in source column and content as File path or (select 'Data' and XML doc- select accordingly) Output Link-> Import the column metadata from table definition stored and map directly to seq file( or to transformer->seq file).
298.
If suppose we have 3 jobs in sequencer, while running if job1 is failed then we have to run job2 and job 3, how we can run?
To run a job even if its previous job in the sequence is failed you need to go to the TRIGGER tab of that particular job activity in the sequence itself. There you will find three fields: Name: This is the name of the next link (link going to the next job, e.g. for job activity 1 link name will be the link going to job activity 2). Expression Type: This will allow you to trigger your next job activity based on the status you want. For example, if in case job 1 fails and you want to run the job 2 and job 3 then go to trigger properties of the job 1 and select expression type as "Failed - (Conditional)". This way you can run your job 2 even if your job 1 is aborted. There are many other options available. Expression: This is editable for some options. Like for expression type "Failed" you cannot change this field.
299.
How to fetch the last row from a particular column in a sequential file?
300.
double click on transformer stage---> Go to Stage properties(its having in header line first icon) --->double click on stage properties --->Go to inputs --->go to partitioning---> select one partition technique(without auto)--->now enable perform sort--->click on perform sort--> now enable unique--->click on that and we can take required column name. Now output will have unique values so here duplicates will be removed. 301. What is the use of combinability mode?
Page 55 of 57
302.
Wrapped stage: Enables you to run an existing sequential program in parallel Build stage: Enables you to write a C expression that is automatically generated into a Parallel custom stage. Custom stage: Provides a complete C++ API for developing complex and extensible Stages.
303.
Is it like if we define the values of variables in DSParams file, then there is no need to give the values at job level are Project level? & how to configure this file at job level? So that we need not hardcode the values.
304.
305.
306.
307.
308.
What is Orchestrate Schema? Distinguish internal data type (Orchestrate schema) vs. external data type?
Page 56 of 57
309.
What are these terms used for in DataStage - Downstream and Upstream?
Things to Learn:
Manager: Routines: 1. ExecSh 2. DSwaitforfile 3. DSsendmail SDK: Date Routines 1. Convert date/TIME STAMP to OraOCI format 2. ODBC format 3. DatecurrentGMTtime User defined: 1. Find job 2. Any date to YYYY-MM-DD format (used date functions iconv, oconv) 3. Getnextkeyvalue: This routine is used in Server jobs for reading the max Surrogate key value of a dimension from the hash file (max. key value is loaded into the hash file by a job at the starting of a sequence). This is for inserting a new record into the dimension table. Transforms: 1. Dates 2. Strings 3. Numeric Functions: 1. Date functions : current time stamp, null handling, strings (concat, TRIM, left, right), numeric
Page 57 of 57
312.
313.