Datastage
Datastage
Datastage
Pipeline Parallelism
1. 2. 3. 4. Extract, Transform and Load processes execute simultaneously The downstream process starts while the upstream process is running like a conveyor belt moving rows from process to process Advantages: Reduces disk usage for staging areas and Keeps processors busy Still has limits on scalability
Pipeline Parallelism
1. 2. 3. Divide the incoming stream of data into subsets known as partitions to be processed separately Each partition is processed in the same way Facilitates near-linear scalability. However the data needs to be evenly distributed across the partitions; otherwise the benefits of partitioning are reduced
Within parallel jobs pipelining, partitioning and repartitioning are automatic. Job developer only identifies 1. Sequential or Parallel mode (by stage) 2. Partitioning Method 3. Collection Method 4. Configuration file
Configuration File
One of the great strengths of the WebSphere DataStage Enterprise Edition is that, when designing parallel jobs, you dont have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you dont necessarily have to change your job design. WebSphere DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs. The WebSphere DataStage Designer provides a configuration file editor to help you define configuration files for the parallel engine. To use the editor, choose Tools Configurations, the Configurations dialog box appears. You specify which configuration will be used by setting the $APT_CONFIG_FILE environment variable. This is set on installation to point to the default configuration file, but you can set it on a project wide level from the WebSphere DataStage Administrator or for individual jobs from the Job Properties dialog. Configuration files are text files containing string data. The general form of a configuration file is as follows:
{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } } Node names Each node you define is followed by its name enclosed in quotation marks, for example: node "orch0" For a single CPU node or workstation, the nodes name is typically the network name of a processing node on a connection such as a high-speed switch or Ethernet. Issue the following UNIX command to learn a nodes network name:
$ uname -n
On an SMP, if you are defining multiple logical nodes corresponding to the same physical node, you replace the network name with a logical node name. In this case, you need a fast name for each logical node. If you run an application from a node that is undefined in the corresponding configuration file, each user must set the environment variable APT_PM_CONDUCTOR_NODENAME to the fast name of the node invoking the parallel job. Fastname Syntax: fastname "name" This option takes as its quoted attribute the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. For an SMP, all CPUs share a single connection to the network, and this setting is the same for all parallel engine processing nodes defined for an SMP. Typically, this is the principal node name, as returned by the UNIX command uname -n. Node pools and the default node pool Node pools allow association of processing nodes based on their characteristics. For example, certain nodes can have large amounts of physical memory, and you can designate them as compute nodes. Others can connect directly to a mainframe or some form of high-speed I/O. These nodes can be grouped into an I/O node pool. The option pools is followed by the quoted names of the node pools to which the node belongs. A node can be assigned to multiple pools, as in the following example, where node1 is assigned to the default pool () as well as the pools node1, node1_css, and pool4. node "node1" { fastname "node1_css" pools "" "node1" "node1_css" "pool4" resource disk "/orch/s0" {} resource scratchdisk "/scratch" {} } A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name ( ) from the list.
Once you have defined a node pool, you can constrain a parallel stage or parallel job to run only on that pool, that is, only on the processing nodes belonging to it. If you constrain both a stage and a job, the stage runs only on the nodes that appear in both pools. Nodes or resources that name a pool declare their membership in that pool. We suggest that when you initially configure your system you place all nodes in pools that are named after the nodes name and fast name. Additionally include the default node pool in this pool, as in the following example: node "n1" { fastname "nfast" pools "" "n1" "nfast" } By default, the parallel engine executes a parallel stage on all nodes defined in the default node pool. You can constrain the processing nodes used by the parallel engine either by removing node descriptions from the configuration file or by constraining a job or stage to a particular node pool. Disk and scratch disk pools and their defaults When you define a processing node, you can specify the options resource disk and resource scratchdisk. They indicate the directories of file systems available to the node. You can also group disks and scratch disks in pools. Pools reserve storage for a particular use, such as holding very large data sets. Pools defined by disk and scratchdisk are not combined; therefore, two pools that have the same name and belong to both resource disk and resource scratchdisk define two separate pools. A disk that does not specify a pool is assigned to the default pool. The default pool may also be identified by by and { } (the empty pool list). For example, the following code configures the disks for node1: node "n1" { resource disk "/orch/s0" {pools "" "pool1"} resource disk "/orch/s1" {pools "" "pool1"} resource disk "/orch/s2" { } /* empty pool list */ resource disk "/orch/s3" {pools "pool2"} resource scratchdisk "/scratch" {pools "" "scratch_pool1"} } In this example: 1. The first two disks are assigned to the default pool. 2. The first two disks are assigned to pool1. 3. The third disk is also assigned to the default pool, indicated by { }. 4. The fourth disk is assigned to pool2 and is not assigned to the default pool. 5. The scratch disk is assigned to the default scratch disk pool and to scratch_pool1. Buffer scratch disk pools Under certain circumstances, the parallel engine uses both memory and disk storage to buffer virtual data set records.The amount of memory defaults to 3 MB per buffer per processing node. The amount of disk space for each processing node defaults to the amount of available disk space specified in the default scratchdisk setting for the node. The parallel engine uses the default scratch disk for temporary storage other than buffering. If you define a buffer scratch disk pool for a node in the configuration file, the parallel engine uses that scratch disk pool rather than the default scratch disk for buffering, and all other scratch disk pools defined are used for temporary storage other than buffering.
Here is an example configuration file that defines a buffer scratch disk pool: { node node1 { fastname "node1_css" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } node node2 { fastname "node2_css" pools "" "node2" "node2_css" resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } } In this example, each processing node has a single scratch disk resource in the buffer pool, so buffering will use /scratch0 but not /scratch1. However, if /scratch0 were not in the buffer pool, both /scratch0 and /scratch1 would be used because both would then be in the default pool.
Partitioning
The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, ensuring an even load across your processors. When performing some operations however, you will need to take control of partitioning to ensure that you get consistent results. A good example of this would be where you are using an aggregator stage to summarize your data. To get the answers you want (and need) you must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition.
Random partitioner
Records are randomly distributed across all processing nodes. Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.
Entire partitioner
Every instance of a stage on every processing node receives the complete data set as input. It is useful when you want the benefits of parallel execution, but every instance of the operator needs access to the entire input data set. You are most likely to use this partitioning method with stages that create lookup tables from their input.
Same partitioner
The stage using the data set as input performs no repartitioning and takes as input the partitions output by the preceding stage. With this partitioning method, records stay on the same processing node; that is, they are not redistributed. Same is the fastest partitioning method. This is normally the method WebSphere DataStage uses when passing data between stages in your job.
Hash partitioner
Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record. The hash partitioner examines one or more fields of each input record (the hash key fields). Records with the same values for all hash key fields are assigned to the same processing node. This method is useful for ensuring that related records are in the same partition, which may be a prerequisite for a processing operation. For example, for a remove duplicates operation, you can hash partition records so that records with the same partitioning key values are on the same node. You can then sort the records on each node using the hash key fields as sorting key fields, then remove duplicates, again using the same keys. Although the data is distributed across partitions, the hash partitioner ensures that records with identical keys are in the same partition, allowing duplicates to be found. Hash partitioning does not necessarily result in an even distribution of data between partitions. For example, if you hash partition a data set based on a zip code field, where a large percentage of your records are from one or two zip codes, you can end up with a few partitions containing most of your records. This behavior can lead to bottlenecks because some nodes are required to process more records than other nodes.
Modulus partitioner
Partitioning is based on a key column modulo the number of partitions. This method is similar to hash by field, but involves simpler computation. In data mining, data is often arranged in buckets, that is, each record has a tag containing its bucket number. You can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns each record of an input data set to a partition of its output data set as determined by a specified key field in the input data set. This field can be the tag field. The partition number of each record is calculated as follows:
partition_number = fieldname mod number_of_partitions
where: fieldname is a numeric field of the input data set and number_of_partitions is the number of processing nodes on which the partitioner executes. If a partitioner is executed on three processing nodes it has three partitions.
Range partitioner
Divides a data set into approximately equal-sized partitions, each of which contains records with key columns within a specified range. This method is also useful for ensuring that related records are in the same partition. A range partitioner divides a data set into approximately equal size partitions based on one or more partitioning keys. In order to use a range partitioner, you have to make a range map. You can do this using the Write Range Map stage. The range partitioner guarantees that all records with the same partitioning key values are assigned to the same partition and that the partitions are approximately equal in size so all nodes perform an equal amount of work when processing the data set. Range partitioning is not the only partitioning method that guarantees equivalent-sized partitions. The random and round robin partitioning methods also guarantee that the partitions of a data set are equivalent in size. However,
these partitioning methods are keyless; that is, they do not allow you to control how records of a data set are grouped together within a partition.
DB2 partitioner
Partitions an input data set in the same way that DB2 would partition it. For example, if you use this method to partition an input data set containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then, during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node. Any reads and writes of the DB2 table would entail no network activity.
Auto partitioner
The most common method you will see on the WebSphere DataStage stages is Auto. This just means that you are leaving it to WebSphere DataStage to determine the best partitioning method to use depending on the type of stage, and what the previous stage in the job has done. Typically WebSphere DataStage would use round robin when initially partitioning data, and same for the intermediate stages of a job.
Collecting
Collecting is the process of joining the multiple partitions of a single data set back together again into a single partition. There may be a stage in your job that you want to run sequentially rather than in parallel, in which case you will need to collect all your partitioned data at this stage to make sure it is operating on the whole data set. Note that collecting methods are mostly non-deterministic. That is, if you run the same job twice with the same data, you are unlikely to get data collected in the same order each time. If order matters, you need to use the sorted merge collection method.
Ordered collector
Reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves the order of totally sorted input data sets. In a totally sorted data set, both the records in each partition and the partitions themselves are ordered. This may be useful as a preprocessing action before exporting a sorted data set to a single data file.
Auto collector
The most common method you will see on the parallel stages is Auto. This normally means that WebSphere DataStage will eagerly read any row from any input partition as it becomes available, but if it detects that, for example, the data needs sorting as it is collected, it will do that. This is the fastest collecting method.
The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows that are consumed by downstream stages (operators or data sets) Partitioning method is associated with producer. Collector method is associated with consumer. eCollectAny is specified for parallel consumers, although no collection occurs! The producer and consumer are separated by the following indicators: -> Sequential to Sequential <> Sequential to Parallel => Parallel to Parallel (SAME) #> Parallel to Parallel (not SAME) >> Parallel to Sequential > No producer or no consumer May also include [pp] notation when Preserve Partitioning flag is set At runtime, the DataStage Parallel Framework can only combine stages (operators) that: 1. Use the same partitioning method Repartitioning prevents operator combination between the corresponding producer and consumer stages Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination 2. Are Combinable Set automatically within the stage/operator definition Set within DataStage Designer: Advanced stage properties The Lookup stage is a composite operator. Internally it contains more than one component, but to the user it appears to be one stage 1. LUTCreateImpl - Reads the reference data into memory 2. LUTProcessImpl - Performs actual lookup processing once reference data has been loaded At runtime, each internal component is assigned to operators independently
Job Compilation
1. Operators. These underlie the stages in a WebSphere DataStage job. A single stage may correspond to a single
operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, WebSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job. 2. OSH. This is the scripting language used internally by the WebSphere DataStage parallel engine.
3. Players. Players are the workhorse processes in a parallel job. There is generally a player for each operator on
each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file). DataStage Designer client generates all code - Validates link requirements, mandatory stage options, transformer logic, etc. 1. Generates OSH representation of job data flow and stages GUI stages are representations of Framework operators Stages in parallel shared containers are statically inserted in the job flow Each server shared container becomes a dsjobsh operator 2. Generates transform code for each parallel Transformer Compiled on the DataStage server into C++ and then to corresponding native operators To improve compilation times, previously compiled Transformers that have not been modified are not recompiled Force Compile recompiles all Transformers (use after client upgrades) 3. Buildop stages must be compiled manually within the GUI or using buildop UNIX command line Viewing of generated OSH is enabled in DS Administrator OSH is visible in 1. Job Properties 2. Job run log 3. View Data 4. Table Definitions
Terminology
Framework schema property type virtual record/field operator step, flow, OSH command Framework
DataStage table definition format SQL type + length [and scale] dataset link row/column stage job DS engine
GUI uses both terminologies Log messages (info, warnings, errors) use Framework term
Runtime Architecture
Generated OSH and Configuration file are used to compose a job SCORE similar to the way an RDBMS builds a query optimization plan 1. Identifies degree of parallelism and node assignment for each operator 2. Inserts sorts and partitioners as needed to ensure correct results 3. Defines connection topology (datasets) between adjacent operators 4. Inserts buffer operators to prevent deadlocks (eg. fork-joins) 5. Defines number of actual UNIX processes -Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements 6. Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control. Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage log It is only after these steps that processing begins. This is the startup overhead of an Enterprise Edition job Job processing ends when - Last row (end of data) is processed by final operator in the flow (or) A fatal error is encountered by any operator (or) Job is halted (SIGINT) by DataStage Job Control or human intervention (eg. DataStage Director STOP)
Conductor - initial Framework process Score Composer Creates Section Leader processes (one/node) Consolidates massages, to DataStage log Manages orderly shutdown Section Leader (one per Node) Forks Players processes (one/Stage) Manages up/down communication Players The actual processes associated with Stages Combined players: one process only Sends stderr, stdout to Section Leader Establish connections to other players for data flow Clean up upon completion Default Communication: SMP: Shared Memory MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)
Introduction
What is IBM Websphere DataStage? 1. Design jobs for ETL 2. Ideal tool for data integration projects 3. Import, export, create and manage metadata for use within jobs 4. Schedule, run and monitor jobs all within DataStage 5. Administer your DataStage development and execution environments 6. Create batch (controlling) jobs What are the components/applications in IBM Information Server Suite? 1. DataStage 2. Quality Stage 3. Metadata Server consisting of Metadata Access Services and Metadata Analysis Services 4. Repository which is DB2 by default 5. Business Glossary 6. Federation Server 7. Information Services Director 8. Information Analyzer 9. Information Server console Explain the DataStage Architecture? The DataStage client components are Administrator - Administers DataStage projects and conducts housekeeping on the server Designer - Creates DataStage jobs that are compiled into executable programs Director - Used to run and monitor DataStage jobs The Repository is used to store DataStage objects. The Repository which is DB2 by default is shared by other applications in the Suite What are the uses of DataStage Adminsitrator? The Administrator is used to add and delete projects, and to set project properties. The Adminsitrator also provides a command line interface to the DataStage repository. Use Administrator Project Properties window to 1. Enable job administration in Director, enable run time column propogation, auto purging options, protect project and set environment vaiables on the General tab 2. Set user and group priveleges on the Permissions tab 3. Enable or disable server side tracing on the Tracing tab 4. Specifying a username and password for scheduling jobs on the Schedule tab 5. Specify parallel job defaults on the Parallel tab 6. Specify job sequencer defaults on the Sequencer tab Explain the DataStage Development workflow? 1. Define poject properties - Administrator 2. Open (attach to) your project 3. Import metadata that defines the format of data stores your jobs will read from or write to 4. Design the job - Designer 5. Compile and debug the job - Designer
6. Run and monitor the job - Director What is the DataStage project repository? All your work is stored in a DataStage project. Projects are created during and after the installation process. You can add projects after installation on the Projects tab of Adminsitrator. The project directory is used by DataStage to store your jobs and other DataStage objects and metadata on your server. Although multiple projects can be open at the same time, they are seperate environments. You can however, import and export objects between them. Multiple users can be working in the same project at the same time. However, DataStage will prevent multiple users from editing the same DataStage object (job, table definition, etc) at the same time. What are the different types of DataStage jobs? Parallel Jobs1. Executed by DataStage parallel engine 2. Built-in functionality for pipeline and partition parallelism 3. Compiled into OSH (Orchestrate Scripting Language) 4. OSH executes operators (Execute C++ class instances) Server Jobs1. Executed by DataStage server engine 2. Compiled into basic Job Sequencers1. Master Server jobs that kick-off jobs and other activities 2. Can kick-off Server or Parallel jobs 3. Executed by DataStage server engine What are the design elements of parallel jobs? Stages - Implemented as OSH operators Passive Stages (E and L of ETL) - Read/Write data Eg., Sequential file, DB2, Oracle, Peek stages Active Stages (T of ETL) - Transform/Filter/Aggregate/Generate/Split/Merge data Eg., Transformer, Aggregator, Join, Sort stages Links - Pipes though which the data moves from stage to stage What are the different types of parallelism? Pipeline Parallelism 1. Transform, clean, load processes execute simultaneously 2. Start downstream process while upstream process is running 3. Reduces disk usage for staging areas 4. Keeps processor busy 5. Still has limits on scalability Partition Parallelism 1. Divide the incoming stream of data into subsets(partitions) to be processed by the same operator 2. The operation is performed on each partition of data seperately and in parallel 3. Facilitates near-linear scalability provided the data is evenly distributed 4. If the data is evenly distributes, the data will be processed n times faster on n nodes.
2. Suite Component roles a. DataStage Administrator - Full permission to work in DataStage Administrator, Designer and Director b. DataStage user - Permissions are assigned within DataStage - Developer, Operator, Super Operator and Production Manager A DataStage user cannot delete projects and cannot set permissions A user ID that is assigned Suite roles can immediately log onto the Information Server Console. What about a user ID that is assigned a DataStage Suite Component role? If the user ID is assigned the DataStage Administrator role, then the user will immediately acquire the DataStage Administrato permission for all projects. If the the user ID is assigned the DataStage user role, one moe step is required. A DataStage administrator must assign a corresponding role to that user ID on the permissions tab. When Suite users or groups have been assigned DataStage Administrator role they automatically appear on the permissions. Suite users or groups that have a DataStage User role need to be manually added. Explain The DataStage Credential Mapping? All the Suite users without their own DataStage credentials will be mapped to this user ID and password. Here the username and password are demohawk/demohwak. demohawk is assumed to be a valid user on the DataStage Server machine and has file permissions on the DataStage engine and project directories Suite users can also be mapped individually to specific users Note that demohawk need not be a Suite administrator or user What information are required to login into DataStage Administrator? Domain - Host name , port number of the application server. Recall that multiple DataStage servers can exist in a domain, although they must be on different machines. DataStage server - The Server that has the DataStage projects you want to administer Explain the DataStage roles? 1. DataStage Developer - full access to all areas of a DataStage project 2. DataStage Operator - run and manage release DataStage jobs 3. DataStage Super Operator - can open the Designer and view the repository in read-only mode 4. DataStage Production Managet - create and manipulate protected projects DataStage Designer Explain Import and Export and their corresponding procedures? 1. Backing up jobs and projects 2. Maintaining different versions of a job or project 3. Moving DataStage objects from one project to another 4. Sharing jobs and projects between developers Export-->DataStage components
By default, objects are exported to a text file in a specific format. By default, the extension is dsx. Alternatively, you can export the objects to a XML document. The directory you export is on the DataStage client, not the server. Objects can also be exported from the list of found objects using search functionality. Import-->DataStage components Import all to begin the import process. Use Import selected to import selected objects from the list. Select Overwrite without query button to overwrite objects with the same name without warning. For large imports you may want to disable "Perform impact analysis." This adds overhead to the import process Import-->Table Definitions Table definition describes the column and format of files and tables The table definition for the following can be imported 1. Sequential file 2. Relational tables 3. Cobol files 4. XML 5. ODBC data sources etc Table definitions can be loaded into job stages that access data with the same format. In this sense the metadata is reusable. Creating Parallel Jobs What is a Parallel Job? A parallel job is an executable DataStage program created in DataStage designer using components from repository. It compiles into Orchestrate script language(OSH) and object code(from generated C++) DataStage jobs are 1. Designed and built in Designer 2. Scheduled, invoked and monitored in Director 3. Executed under the control of DataStage Use the import process in Designer to import metadata defining sources and targets What are the benefits of renaming links and stages? 1. Documentation 2. Clarity 3. Fewer development errors Explain the Row Generator stage? 1. Produces mock data 2. No input link;single output link 3. On Properties tab, specify number of rows 4. On Columns tab, load or specify column definitions
You have a cluster of nodes available to run DataStage jobs. The network configuration between the servers is a private network with a 1 GB connection between each node. The public name is on a 100 MB network, which is what each hostname is identified with. In order to use the private network for communications between each node you need to use an alias for each node in the cluster. The Information Server Engine node (conductor node) is where the DataStage job starts. Which environment variable must be used to identify the hostname for the Engine node? A. $APT_SERVER_ENGINE B. $APT_ENGINE_NODE C. $APT_PM_CONDUCTOR_HOSTNAME D. $APT_PM_NETWORK_NAME Answer: C Which three privileges must the user possess when running a parallel job? (Choose three.) A. read access to APT_ORCHHOME B. execute permissions on local copies of programs and scripts C. read/write permissions to the UNIX/etc directory D. read/write permissions to APT_ORCHHOME E. read/write access to disk and scratch disk resources Answer: A,B,E Which two tasks will create DataStage projects? (Choose two.) A. Export and import a DataStage project from DataStage Manager. B. Add new projects from DataStage Administrator. C. Install the DataStage engine. D. Copy a project in DataStage Administrator. Answer: B,C Which three defaults are set in DataStage Administrator? (Choose three.) A. default prompting options, such as Autosave job before compile B. default SMTP mail server name C. project level default for Runtime Column Propagation D. project level defaults for environment variables E. project level default for Auto-purge of job log entries Answer: C,D,E Which two must be specified to manage Runtime Column Propagation? (Choose two.) A. enabled in DataStage Administrator B. attached to a table definition in DataStage Manager C. enabled at the stage level D. enabled with environmental parameters set at runtime Answer: A,C You are reading customer data using a Sequential File stage and transforming it using the Transformer stage. The Transformer is used to cleanse the data by trimming spaces from character fields in the input. The cleansed data is to be written to a target DB2 table. Which partitioning method would yield optimal performance without violating the business requirements? A. Hash on the customer ID field B. Round Robin C. Random D. Entire
Answer: B A job contains a Sort stage that sorts a large volume of data across a cluster of servers. The customer has requested that this sorting be done on a subset of servers identified in the configuration file to minimize impact on database nodes. Which two steps will accomplish this? (Choose two.) A. Create a sort scratch disk pool with a subset of nodes in the parallel configuration file. B. Set the execution mode of the Sort stage to sequential. C. Specify the appropriate node constraint within the Sort stage. D. Define a non-default node pool with a subset of nodes in the parallel configuration file. Answer: C,D You have a compiled job and parallel configuration file. Which three methods can be used to determine the number of nodes actually used to run the job in parallel? (Choose three.) A. within DataStage Designer, generate report and retain intermediate XML B. within DataStage Designer, show performance statistics C. within DataStage Director, examine log entry for parallel configuration file D. within DataStage Director, examine log entry for parallel job score E. within DataStage Director, open a new DataStage Job Monitor Answer: C,D,E Which environment variable, when set to true, causes a report to be produced which shows the operators, processes and data sets in the job? A. APT_DUMP_SCORE B. APT_JOB_REPORT C. APT_MONITOR_SIZE D. APT_RECORD_COUNTS Answer: A A job reads from a dataset using a DataSet stage. This data goes to a Transformer stage and then is written to a sequential file using a Sequential File stage. The default configuration file has 3 nodes. The job creating the dataset and the current job both use the default configuration file. How many instances of the Transformer run in parallel? A. 3 B. 1 C. 7 D. 9 Answer: A Your job reads from a file using a Sequential File stage running sequentially. The DataStage server is running on a single SMP system. One of the columns contains a product ID. In a Lookup stage following the Sequential File stage, you decide to look up the product description from a reference table. Which two partition settings would correctly find matching product descriptions? (Choose two.) A. Hash algorithm, specifying the product ID field as the key, on both the link coming from the Sequential File stage and the link coming from the reference table. B. Round Robin on both the link coming from the Sequential File stage and the link coming from the reference table. C. Round Robin on the link coming from the Sequential File stage and Entire on the link coming from the reference table. D. Entire on the link coming from the Sequential File stage and Hash, specifying the product ID field as the key, on the link coming from the reference table.
Answer: A,C
A job design consists of an input fileset followed by a Peek stage, followed by a Filter stage, followed by an output fileset. The environment variable APT_DISABLE_COMBINATION is set to true, and the job executes on an SMP using a configuration file with 8 nodes defined. Assume also that the input dataset was created with the same 8 node configuration file. Approximately how many data processing processes will this job create? A. 32 B. 8 C. 16 D. 1 Answer: A Which two statements are true of the column data types used in Orchestrate schemas? (Choose two.) A. Orchestrate schema column data types are the same as those used in DataStage stages. B. Examples of Orchestrate schema column data types are varchar and integer. C. Examples of Orchestrate schema column data types are int32 and string[max=30]. D. OSH import operators are needed to convert data read from sequential files into schema types. Answer: C,D You have set the "Preserve Partitioning" flag for a Sort stage to request that the next stage preserves whatever partitioning it has implemented. Which statement describes what will happen next? A. The job will compile but will abort when run. B. The job will not compile. C. The next stage can ignore this request but a warning is logged when the job is run depending on the stage type that ignores the flag. D. The next stage disables the partition options that are normally available in the Partitioning tab. Answer: C What is the purpose of the uv command in a UNIX DataStage server? A. Cleanup resources from a failed DataStage job. B. Start and stop the DataStage engine. C. Provide read access to a DataStage EE configuration file. D. Report DataStage client connections. Answer: B Which two statements regarding the usage of data types in the parallel engine are correct? (Choose two.) A. The best way to import RDBMS data types is using the ODBC importer. B. The parallel engine will use its interpretation of the Oracle meta data (e.g, exact data types) based on interrogation of Oracle, overriding what you may have specified in the Columns tabs. C. The best way to import RDBMS data types is using the Import Orchestrate Schema Definitions using orchdbutil. D. The parallel engine and server engine have exactly the same data types so there is no conversion cost overhead from moving data between the engines. Answer: B,C Which two describe a DataStage EE installation in a clustered environment? (Choose two.) A. The C++ compiler must be installed on all cluster nodes. B. Transform operators must be copied to all nodes of the cluster. C. The DataStage parallel engine must be installed or accessible in the same directory on all machines in the cluster. D. A remote shell must be configured to support communication between the conductor and section leader nodes. Answer: C,D
Which partitioning method would yield the most even distribution of data without duplication? A. Entire B. Round Robin C. Hash D. Random Answer: B Which three accurately describe the differences between a DataStage server root installation and a non-root installation? (Choose three.) A. A non-root installation enables auto-start on reboot. B. A root installation must specify the user "dsadm" as the DataStage administrative user. C. A non-root installation inherits the permissions of the user who starts the DataStage services. D. A root installation will start DataStage services in impersonation mode. E. A root installation enables auto-start on reboot. Answer: C,D,E Your job reads from a file using a Sequential File stage running sequentially. You are using a Transformer following the Sequential File stage to format the data in some of the columns. Which partitioning algorithm would yield optimized performance? A. Hash B. Random C. Round Robin D. Entire Answer: C Which three UNIX kernel parameters have minimum requirements for DataStage installations? (Choose three.) A. MAXUPROC - maximum number of processes per user B. NOFILES - number of open files C. MAXPERM - disk cache threshold D. NOPROC - no process limit E. SHMMAX - maximum shared memory segment size Answer: A,B,E Which partitioning method requires specifying a key? A. Random B. DB2 C. Entire D. Modulus Answer: D When a sequential file is written using a Sequential File stage, the parallel engine inserts an operator to convert the data from the internal format to the external format. Which operator is inserted? A. export operator B. copy operator C. import operator D. tsort operator Answer: A
Which statement is true when Runtime Column Propagation (RCP) is enabled? A. DataStage Manager does not import meta data. B. DataStage Director does not supply row counts in the job log. C. DataStage Designer does not enforce mapping rules. D. DataStage Administrator does not allow default settings for environment variables. Answer: C
Persistent Storage
Sequential file stage
The Sequential File stage is a file stage. It allows you to read data from or write data to one or more flat files. The stage can have a single input link or a single output link, and a single rejects link. The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file). For fixed-width files, however, you can configure the stage to behave differently:
1. You can specify that single file can be read by multiple nodes. This can improve performance on cluster
systems.
2. You can specify that a number of readers run on a single node. This means, for example, that a single file
can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor node). (These two options are mutually exclusive.)
File This property defines the flat file that data will be read from. You can type in a pathname, or browse for a file.
You can specify multiple files by repeating the File property
File pattern Specifies a group of files to import. Specify file containing a list of files or a job parameter
representing the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a list of file names.
Read method This property specifies whether you are reading from a specific file or files or using a file pattern to
select files (e.g., *.txt).
Missing file mode Specifies the action to take if one of your File properties has specified a file that does not exist.
Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.
Keep file partitions Set this to True to partition the imported data set according to the organization of the input
file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.
Reject mode Allows you to specify behavior if a read record does not match the expected schema (record does not
match the metadata defined in column definition). Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.
Report progress Choose Yes or No to enable or disable reporting. By default the stage displays a progress report
at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file.
Number Of readers per node This is an optional property and only applies to files containing fixed-length
records, it is mutually exclusive with the Read from multiple nodes property. Specifies the number of instances of the file read operator on a processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file.
This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP system.
Read from multiple nodes This is an optional property and only applies to files containing fixed-length records,
it is mutually exclusive with the Number of Readers Per Node property. Set this to Yes to allow individual files to be read by several nodes. This can improve performance on a cluster system. WebSphere DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates the reader on each node a separate region within the file to process. The regions will be of roughly equal size. Note that sequential row order cannot be maintained when reading a file in parallel
File update mode This property defines how the specified file or files are updated. The same method applies to all
files being written to. Choose from Append to append to existing files, Overwrite to overwrite existing files, or Create to create a new file. If you specify the Create property for a file that already exists you will get an error at runtime. By default this property is set to Overwrite.
Using RCP With Sequential Stages Runtime column propagation (RCP) allows WebSphere DataStage to be
flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask WebSphere DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so WebSphere DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see Schema File on page Schema File and on page Schema File) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: 1. 2. 3. 4. 5. 6. Sequential File File Set External Source External Target Column Import Column Export
$APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE can be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance. Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations. $APT_CONSISTENT_BUFFERIO_SIZE - Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Setting APT_CONSISTENT_BUFFERIO_SIZE=N will force stages to read data in chunks which are size N or a multiple of N.
1.
2. Creation time and date of the data set. 3. The schema (metadata) of the data set. 4. A copy of the configuration file used when the data set was created.
Data Sets are the structured internal representation of data within the Parallel Framework Consist of: Framework Schema (format=name, type, nullability) Data Records (data) Partition (subset of rows for each node) Virtual Data Sets exist in-memory correspond to DataStage Designer links Persistent Data Sets are stored on-disk Descriptor file (metadata, configuration file, data file locations, flags) Multiple Data Files (one per node, stored in disk resource file systems) node1:/local/disk1/ node2:/local/disk2/ There is no DataSet operator the Designer GUI inserts a copy operator When to Use Persistent Data Sets When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints) Stored in native internal format (no conversion overhead) Retain data partitioning and sort order (end-to-end parallelism across jobs) Maximum performance through parallel I/O Why Data Sets are not intended for long-term or archive storage Internal format is subject to change with new DataStage releases Requires access to named resources (node names, file system paths, etc) Binary format is platform-specific For fail-over scenarios, servers should be able to cross-mount filesystems Can read a dataset as long as your current $APT_CONFIG_FILE defines the same NODE names (fastnames may differ) orchadmin x lets you recover data from a dataset if the node names are no longer available
Click the Schema icon from the tool bar to view the record schema of the current data set. This is presented in text form in the Record Schema window.
2.
Click the Data icon from the tool bar to view the data held by the current data set. This options the Data Viewer Options dialog box, which allows you to select a subset of the data to view. Rows to display. Specify the number of rows of data you want the data browser to display. Skip count. Skip the specified number of rows before viewing data. Period. Display every Pth record where P is the period. You can start after records have been skipped by using the Skip property. P must equal or be greater than 1. Partitions. Choose between viewing the data in All partitions or the data in the partition selected from the drop-down list.Click OK to view the selected data, the Data Viewer window appears. 3. Copying data sets
Click the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears, allowing you to specify a path where the new data set will be stored. The new data set will have the same record schema, number of partitions and contents as the original data set. Note: You cannot use the UNIX cp command to copy a data set because WebSphere DataStage represents a single data set with multiple files. 4. Deleting data sets
Click the Delete icon on the tool bar to delete the current data set data set. You will be asked to confirm the deletion. Note: You cannot use the UNIX rm command to copy a data set because WebSphere DataStage represents a single data set with multiple files. Using rm simply removes the descriptor file, leaving the much larger data files behind.
Orchadmin Commands
Orchadmin is a command line utility provided by datastage to research on data sets. The general callable format is : $orchadmin <command> [options] [descriptor file] Before using orchadmin, you should make sure that either the working directory or the $APT_ORCHHOME/etc contains the file config.apt OR The environment variable $APT_CONFIG_FILE should be defined for your session. The various commands available with orchadmin are 1. CHECK: $orchadmin check
Validates the configuration file contents like , accesibility of all nodes defined in the configuration file, scratch disk definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined properly 2. COPY : $orchadmin copy <source.ds> <destination.ds>
Makes a complete copy of the datasets of source with new destination descriptor file name. Please not that
a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied. b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing file that was in use with the source. 3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles.
The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to delete one or more persistent data sets. -f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans. -x forces to use the current config file to be used while deleting than the one stored in data set. 4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds
This is the single most important command. 1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and preserve partitioning flag details of the persistent dataset. -c : Print the configuration file that is written in the dataset if any -p: Lists down the partition level information. -f: Lists down the file level information in each partition -e: List down the segment level information . -s: List down the meta-data schema of the information. -v: Lists all segemnts , valid or otherwise -l : Long listing. Equivalent to -f -p -s -v -e 5. DUMP: $orchadmin dump [options] descriptorfile.ds
The dump command is used to dump (extract) the records from the dataset. Without any options the dump command lists down all the records starting from first record from first partition till last record in last partition. -delim <string> : Uses the given string as delimtor for fields instead of space. -field <name> : Lists only the given field instead of all fields. -name : List all the values preceded by field name and a colon -n numrecs : List only the given number of records per partition. -p period(N) : Lists every Nth record from each partition starting from first record. -skip N: Skip the first N records from each partition. -x : Use the current system configuration file rather than the one stored in dataset.
6.
Without options deletes all the data(ie Segments) from the dataset. -f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones. -x: Uses current system config file rather than the default one stored in the dataset. -n N: Leaves the first N segments in each partition and truncates the remaining. 7. HELP: $orchadmin -help OR $orchadmin <command> -help
When performing lookups, Lookup File Set stages are used with Lookup stages. When you use a Lookup File Set stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup http://www.dsxchange.com/viewtopic.php?t=113394 A Hashed File is only available in server jobs. It uses a hashing algorithm (without building an index) to determine the location of keys within its structure. It is not amenable to parallelism. The contents of a hashed file may be cached in memory when using the Hashed File stage to service a reference input link. New rows to be written to a hashed file may first be written to a memory cache, then flushed to disk. All writes to a hashed file using an existing key overwrite the previous row. Duplicate key values are not permitted. A Lookup File Set is only available in parallel jobs. It uses an index (based on a hash table) to determine the location of keys within its structure. It is a parallel structure; it has its records spread over the processing nodes specified when it was created. The records in the Lookup File Set are loaded into a virtual Data Set before use, and the index is also loaded into memory. Duplicate key values are (optionally) permitted. If the option is not selected, duplicates are rejected when writing to the Lookup File Set. http://www.dsxchange.com/viewtopic.php?t=93287 I did testing on a Windows machine processing 100,000 primary rows against 100,000 lookup rows with a 1 to 1 match. Two key fields of char 255 and two non key fields also of char 255. I deliberately chose fat key fields. The dataset as a lookup took 2-3 minutes. The fileset as a lookup took about 40 seconds. Ran it a few times with the same results. One interesting result was memory utilisation, the fileset was consistently lighter then the dataset, by as much as 30% on RAM memory. This may be due to the keep/drop key field option of the fileset stage. If you set keep to false the key fields in the fileset are not loaded into memory as they are not required on the output side of the lookup. I am guessing that the fileset version was moving and storing 510 char less for each lookup then the dataset version. In a normal lookup these key fields travel up the reference link and back down it again, in a lookup fileset they only travel up. When I switch the same job onto an AIX box with several gig of RAM I get 7 seconds for the dataset and 4 for the fileset. With an increase to 500,000 rows I get 23 seconds for the dataset and 7 seconds for the fileset. This difference may not be so apparent if your key fields are shorter. The major drawback of a lookup fileset is that it doesn't have the Append option of a dataset, you can only overwrite it.
1. In the Output Link Properties Tab specify the name of the lookup file set being used in the lookup. 2. Ensure column meta data has been specified for the lookup file set. By default the stage will write to the file set in entire mode. The complete data set is written to each partition. If the Lookup File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default (auto) collection method.
5. Add another record by clicking one of the buttons at the bottom of the records list. Each button offers a different insertion point. A new record is created with the default name of NEWRECORD. 6. Double-click NEWRECORD to rename it. 7. Repeat steps 3 and 4 for each new record that you need to create. 8. Right-click the master record in the list and select Toggle Master Record. Only one master record is permitted.
Column definitions
You must define columns to specify what data the CFF stage will read or write. If the stage will read data from a file that contains multiple record types, you must first create record definitions on the Records tab. If the source file contains only one record type, or if the stage will write data to a target file, then the columns belong to the default record called RECORD_1. You can load column definitions from a table in the repository, or you can type column definitions into the columns grid. You can also define columns by dragging a table definition from the Repository window to the CFF stage icon on the Designer canvas.
Loading columns
The fastest way to define column metadata is to load columns from a table definition in the repository. To load columns: 1. Click the Records tab on the Stage page. 2. Click Load to open the Table Definitions window. This window displays all of the repository objects that are in the current project. 3. Select a table definition in the repository tree and click OK. 4. Select the columns to load in the Select Columns From Table window and click OK. 5. If flattening is an option for any arrays in the column structure, specify how to handle array data in the Complex File Load Option window.
Typing columns
You can also define column metadata by typing column definitions in the columns grid. To type columns: 1. Click the Records tab on the Stage page. 2. In the Level number field of the grid, specify the COBOL level number where the data is defined. If you do not specify a level number, a default value of 05 is used. 3. In the Column name field, type the name of the column. 4. In the Native type field, select the native data type. 5. In the Length field, specify the data precision. 6. In the Scale field, specify the data scale factor. 7. Optional: In the Description field, type a description of the column.
3. Select the record ID column from the Column list. This list displays all columns from the selected record,
except the first OCCURS DEPENDING ON (ODO) column and any columns that follow it.
4. Select the = operator from the Op list. 5. Type the identifying value for the record ID column in the Value field. Character values must be enclosed in
single quotation marks.
b.
4. Click OK to save your changes and to close the CFF stage editor.
Reject links
The CFF stage can have a single reject link, whether you use the stage as a source or a target. For CFF source stages, reject links are supported only if the source file contains a single record type without any OCCURS DEPENDING ON (ODO) columns. For CFF target stages, reject links are supported only if the target file does not contain ODO columns. You cannot change the selection properties of a reject link. The Selection tab for a reject link is blank. You cannot edit the column definitions for a reject link. For writing files, the reject link uses the input link column definitions. For reading files, the reject link uses a single column named rejected that contains raw data for the columns that were rejected after reading because they did not match the schema.
2. Secure FTP (SFTP) Select this option if you want to transfer files between computers in a secured channel.
Secure FTP (SFTP) uses the SSH (Secured Shell) protected channel for data transfer between computers over a nonsecure network such as a TCP/IP network. Before you can use SFTP to transfer files, you should configure the SSH connection without any pass phrase for RSA authentication. Force Parallelism You can set either Yes or No. In general, the FTP Enterprise stage tries to start as many processes as needed to transfer the n files in parallel. However, you can force the parallel transfer of data by specifying this property to yes. This allows m number of processes at a time where m is the number specified in WebSphere DataStage configuration file. If m is less than n, then the stage waits to transfer the first m files and then start the next m until n files are transferred. When you set Force Parallelism to Yes, you should only give one URI. Overwrite Set this option to have any existing files overwritten by this transfer. Restartable Mode When you specify a restartable mode of Restartable transfer, WebSphere DataStage creates a directory for recording information about the transfer in a restart directory. If the transfer fails, you can run an identical job with the restartable mode property set to Restart transfer, which will reattempt the transfer. If the transfer repeatedly fails, you can run an identical job with the restartable mode option set to Abandon transfer, which will delete the restart directory Restartable mode has the following dependent properties: 1. Job Id Identifies a restartable transfer job. This is used to name the restart directory. 2. Checkpoint directory Optionally specifies a checkpoint directory to contain restart directories. If you do not specify this, the current working directory is used. For example, if you specify a job_id of 100 and a checkpoint directory of /home/bgamsworth/checkpoint the files would be written to /home/bgamsworth/checkpoint/pftp_jobid_100. Schema file Contains a schema for storing data. Setting this option overrides any settings on the Columns tab. You can enter the path name of a schema file, or specify a job parameter, so the schema file name can be specified at run time. Transfer Type Select a data transfer type to transfer files between computers. You can select either the Binary or ASCII mode of data transfer. The default data transfer mode is binary. When reading a delimited Sequential File, you are instructed to interpret two contiguous field delimiters as NULL for the corresponding field regardless of data type. Which three actions must you take? (Choose three.) A. Set the data type to Varchar. B. Set the field to nullable. C. Set the "NULL Field Value" to two field delimiters (e.g., "||" for pipes). D. Set the "NULL Field Value" to ''. E. Set the environment variable $APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL. Answer: B,D,E $APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL - When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.
Which two attributes are found in a Data Set descriptor file? (Choose two.) A. A copy of the job score. B. The schema of the Data Set. C. A copy of the partitioned data. D. A copy of the configuration file used when Data Set was created. Answer: B,D When importing a COBOL file definition, which two are required? (Choose two.) A. The file you are importing is accessible from your client workstation. B. The file you are importing contains level 01 items. C. The column definitions are in a COBOL copybook file and not, for example, in a COBOL source file. D. The file does not contain any OCCURS DEPENDING ON clauses. Answer: A,B Which three features of datasets make them suitable for job restart points? (Choose three.) A. They are indexed for fast data access. B. They are partitioned. C. They use datatypes that are in the parallel engine internal format. D. They are persistent. E. They are compressed to minimize storage space. Answer: B,C,D Which statement describes a process for capturing a COBOL copybook from a z/OS system? A. FTP the COBOL copybook to the server platform in text mode and capture the metadata through Manager. B. Select the COBOL copybook using the Browse button and capture the COBOL copybook with Manager. C. FTP the COBOL copybook to the client workstation in text mode and capture the copybook with Manager. D. FTP the COBOL copybook to the client workstation in binary and capture the metadata through Manager. Answer: C The high performance ETL server on which DataStage EE is installed is networked with several other servers in the IT department with a very high bandwidth switch. A list of seven files (all of which contain records with the same record layout) must be retrieved from three of the other servers using FTP. Given the high bandwidth network and high performance ETL server, which approach will retrieve and process all seven files in the minimal amount of time? A. In a single job, use seven separate FTP Enterprise stages the output links of which lead to a single Sort Funnel stage, then process the records without landing to disk. B. Setup a sequence of seven separate DataStage EE jobs, each of which retrieves a single file and appends to a common dataset, then process the resulting dataset in an eighth DataStage EE job. C. Use three FTP Plug-in stages (one for each machine) to retrieve the seven files and store them to a single file on the fourth server, then use the FTP Enterprise stage to retrieve the single file and process the records without landing to disk. D. Use a single FTP Enterprise stage and specify seven URI properties, one for each file, then process the records without landing to disk. Answer: D An XML file is being processed by the XML Input stage. How can repetition elements be identified on the stage? A. No special settings are required. XML Input stage automatically detects the repetition element from the XPath expression. B. Set the "Key" property for the column on the output link to "Yes".
C. Check the "Repetition Element Required" box on the output link tab. D. Set the "Nullable" property for the column on the output link to "Yes". Answer: B Using FTP, a file is transferred from an MVS system to a LINUX system in binary transfer mode. Which data conversion must be used to read a packed decimal field in the file? A. treat the field as a packed decimal B. packed decimal fields are not supported C. treat the field as ASCII D. treat the field as EBCDIC Answer: A When a sequential file is read using a Sequential File stage, the parallel engine inserts an operator to convert the data to the internal format. Which operator is inserted? A. import operator B. copy operator C. tsort operator D. export operator Answer: A Which type of file is both partitioned and readable by external applications? A. fileset B. Lookup fileset C. dataset D. sequential file Answer: A Which two statements are true about XML Meta Data Importer? (Choose two.) A. XML Meta Data Importer is capable of reporting syntax and semantic errors from an XML file. B. XPATH expressions that are created during XML metadata import cannot be modified. C. XML Meta Data Importer can import Table Definitions from only XML documents. D. XPATH expressions that are created during XML metadata import are used by XML Input stage and XML Output stage. Answer: A,D Which two statements are correct about XML stages and their usage? (Choose two.) A. XML Input stage converts XML data to tabular format. B. XML Output stage converts tabular data to XML hierarchical structure. C. XML Output stage uses XSLT stylesheet for XML to tabular transformations. D. XML Transformer stage converts XML data to tabular format. Answer: A,B Which "Reject Mode" option in the Sequential File stage will write records to a reject link? A. Output B. Fail C. Drop D. Continue Answer: A
A single sequential file exists on a single node. To read this sequential file in parallel, what should be done? A. Set the Execution mode to "Parallel". B. A sequential file cannot be read in parallel using the Sequential File stage. C. Select "File Pattern" as the Read Method. D. Set the "Number of Readers Per Node" optional property to a value greater than 1. Answer: D When a sequential file is written using a Sequential File stage, the parallel engine inserts an operator to convert the data from the internal format to the external format. Which operator is inserted? A. export operator B. copy operator C. import operator D. tsort operator Answer: A A bank receives daily credit score updates from a credit agency in the form of a fixed width flat file. The monthly_income column is an unsigned nullable integer (int32) whose width is specified as 10, and null values are represented as spaces. Which Sequential File property will properly import any nulls in the monthly_income column of the input file? A. Set the record level fill char property to the space character (' '). B. Set the null field value property to a single space (' '). C. Set the C_format property to '"%d. 10"'. D. Set the null field value property to ten spaces (' '). Answer: D An XML file is being processed by the XML Input stage. How can repetition elements be identified on the stage? A. Set the "Nullable" property for the column on the output link to "Yes". B. Set the "Key" property for the column on the output link to "Yes". C. Check the "Repetition Element Required" box on the output link tab. D. No special settings are required. XML Input stage automatically detects the repetition element from the XPath expression. Answer: B During a sequential file read, you experience an error with the data. What is a valid technique for identifying the column causing the difficulty? A. Set the "data format" option to text on the Record Options tab. B. Enable tracing in the DataStage Administrator Tracing panel. C. Enable the "print field" option at the Record Options tab. D. Set the APT_IMPORT_DEBUG environmental variable. Answer: C On which two does the number of data files created by a fileset depend ? (Choose two.) A. the size of the partitions of the dataset B. the number of CPUs C. the schema of the file D. the number of processing nodes in the default node pool Answer: A,D What are two ways to delete a persistent parallel dataset? (Choose two.) A. standard UNIX command rm B. orchadmin command rm
C. delete the dataset Table Definition in DataStage Manager D. delete the dataset in Data Set Manager Answer: B,D A parts supplier has a single fixed width sequential file. Reading the file has been slow, so the supplier would like to try to read it in parallel. If the job executes using a configuration file consisting of four nodes, which two Sequential File stage settings will cause the DataStage parallel engine to read the file using four parallel readers? (Choose two.) (Note: Assume the file path and name is /data/parts_input.txt.) A. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the number of readers per node option to 2. B. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the read from multiple nodes option to yes. C. Set read method to file pattern, and set the file pattern property to '/data/(@PART_COUNT)parts_input.txt'. D. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the number of readers per node option to 4. Answer: B,D
Data Transformation
Transformer Stage
Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure. In order to write efficient Transformer stage derivations, it is useful to understand what items get evaluated and when. The evaluation sequence is as follows: Evaluate each stage variable initial value For each input row to process: Evaluate each stage variable derivation value, unless the derivation is empty For each output link: Evaluate each column derivation value Write the output record Next output link Next input row The stage variables and the columns within a link are evaluated in the order in which they are displayed on the parallel job canvas. Similarly, the output links are also evaluated in the order in which they are displayed. System variables WebSphere DataStage provides a set of variables containing useful system information that you can access from an output derivation or constraint.
1. 2. 3. 4. 5. 6.
@FALSE The value is replaced with 0. @TRUE The value is replaced with 1. @INROWNUM Input row counter. @OUTROWNUM Output row counter (per link). @NUMPARTITIONS The total number of partitions for the stage. @PARTITIONNUM The partition number for the particular instance.
Triggers tab The Triggers tab allows you to choose routines to be executed at specific execution points as the transformer stage runs in a job. The execution point is per-instance, i.e., if a job has two transformer stage instances running in parallel, the routine will be called twice, once for each instance. The available execution points are Before-stage and After-stage. At this release, the only available built-in routine is SetCustomSummaryInfo. You can also define custom routines to be executed; to do this you define a C function, make it available in UNIX shared library, and then define a Parallel routine which calls it (see WebSphere DataStage Designer Client Guide for details on defining a Parallel Routine). Note that the function should not return a value.
A constraint otherwise link can be defined by: 1. Clicking on the Otherwise/Log field so a tick appears and leaving the Constraint fields blank. This will catch any rows that have failed to meet constraints on all the previous output links. 2. Set the constraint to OTHERWISE. This will be set whenever a row is rejected on a link because the row fails to match a constraint. OTHERWISE is cleared by any output link that accepts the row. 3. The otherwise link must occur after the output links in link order so it will catch rows that have failed to meet the constraints of all the output links. If it is not last rows may be sent down the otherwise link which satisfy a constraint on a later link and is sent down that link as well. 4. Clicking on the Otherwise/Log field so a tick appears and defining a Constraint. This will result in the number of rows written to that link (i.e. rows which satisfy the constraint) to be recorded in the job log as a warning message. Note: You can also specify a reject link which will catch rows that have not been written on any output links due to a write error or null expression error. Define this outside Transformer stage by adding a link and using the shortcut menu to convert it to a reject link. Conditionally Aborting a Job Use the Abort After Rows setting in the output link constraints of the parallel Transformer to conditionally abort a parallel job. You can specify an abort condition for any output link. The abort occurs after the specified number of rows occurs in one of the partitions. When the Abort After Rows threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers Functions and Operators Concatenation Operator : Substring operator Input_ String[starting position, length] String Functions 1. Len(<string>) 2. Trim(<string>) 3. UpCase/DownCase(<string>) Null Handling functions 1. IsNull 2. IsNotNull 3. NulltoValue 4. NulltoZero 5. SetNull() Type Conversions 1. StringtoTimestamp 2. StringtoDecimal Using Transformer stages In general, it is good practice not to use more Transformer stages than you have to. You should especially avoid using multiple Transformer stages where the logic can be combined into a single stage. It is often better to use other stage types for certain types of operation: 1. Use a Copy stage rather than a Transformer for simple operations such as: Providing a job design placeholder on the canvas. (Provided you do not set the Force property to True on the Copy stage, the copy will be optimized out of the job at run time.) Renaming columns. Dropping columns. Implicit type conversions. Note that, if runtime column propagation is disabled, you can also use output mapping on a stage to rename, drop, or convert columns on a stage that has both inputs and outputs. 2. Use the Modify stage for explicit type conversion and null handling. 3. Where complex, reusable logic is required, or where existing Transformer-stage based job flows do not meet performance requirements, consider building your own custom stage
4.
Use a BASIC Transformer stage where you want to take advantage of user-defined functions and routines.
SCD Stage
The SCD stage reads source data on the input link, performs a dimension table lookup on the reference link, and writes data on the output link. The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. The dimension update link is a separate output link that carries changes to the dimension. You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements. SCD stages support both SCD Type 1 and SCD Type 2 processing: 1. SCD Type 1 Overwrites an attribute in a dimension table. 2. SCD Type 2 Adds a new row to a dimension table. Each SCD stage processes a single dimension and performs lookups by using an equality matching technique. If the dimension is a database table, the stage reads the database to build a lookup table in memory. If a match is found, the SCD stage updates rows in the dimension table to reflect the changed data. If a match is not found, the stage creates a new row in the dimension table. All of the columns that are needed to create a new dimension row must be present in the source data.
1. If a dimension contains a Type 2 column, you must select a Current Indicator column, an Expiration Date
column, or both. An Effective Date column is optional. You cannot assign Type 2 and Current Indicator to the same column. 2. If a dimension contains only Type 1 columns, no Current Indicator, Effective Date, Expiration Date, or SK Chain columns are allowed.
1. 2. 3. 4.
5. 6. 7. 8. 9.
(blank) The column has no SCD purpose. This purpose code is the default. Surrogate Key The column is a surrogate key that is used to identify dimension records. Business Key The column is a business key that is typically used in the lookup condition. Type 1 The column is an SCD Type 1 field. SCD Type 1 column values are always current. When changes occur, the SCD stage overwrites existing values in the dimension table. Type 2 The column is an SCD Type 2 field. SCD Type 2 column values represent a point in time. When changes occur, the SCD stage creates a new dimension row. Current Indicator (Type 2) The column is the current record indicator for SCD Type 2 processing. Only one Current Indicator column is allowed. Effective Date (Type 2) The column is the effective date for SCD Type 2 processing. Only one Effective Date column is allowed. Expiration Date (Type 2) The column is the expiration date for SCD Type 2 processing. An Expiration Date column is required if there is no Current Indicator column, otherwise it is optional. SK Chain The column is used to link a record to the previous record or the next record by using the value of the Surrogate Key column. Only one Surrogate Key column can exist if you have an SK Chain column.
To use the key source: 1. On the Input page, select the reference link in the Input name field.
2. Click the Surrogate Key tab. 3. In the Source type field, select the source type. 4. In the Source name field, type the name of the key source, or click the arrow button to browse for a file or to insert a job parameter. If the source is a flat file, type the name and fully qualified path of the state file, such as C:/SKG/ProdDim. If the source is a database sequence, type the name of the sequence, such as PRODUCT_KEY_SEQ. 5. Provide additional information about the key source according to the type: If the source is a flat file, specify information in the Flat File area. If the source is a database sequence, specify information in the DB sequence area. Calls to the key source are made by the NextSurrogateKey function. On the Dim Update tab, create a derivation that uses the NextSurrogateKey function for the column that has a purpose code of Surrogate Key. The NextSurrogateKey function returns the value of the next surrogate key when the SCD stage creates a new dimension row. A DataStage job contains a parallel Transformer with a single input link and a single output link. The Transformer has a constraint that should produce 1000 records, however only 900 came out through the output link. What should be done to identify the missing records? A. Turn trace on using DataStage Administrator. B. Add a Reject link to the Transformer stage. C. Scan generated osh script for possible errors. D. Remove the constraint on the output link. Answer: B Which three actions are performed using stage variables in a parallel Transformer stage? (Choose three.) A. A function can be executed once per record. B. A function can be executed once per run. C. Identify the first row of an input group. D. Identify the last row of an input group. E. Lookup up a value from a reference dataset. Answer: A,B,C Which two system variables must be used in a parallel Transformer derivation to generate a unique sequence of integers across partitions? (Choose two.) A. @PARTITIONNUM B. @INROWNUM C. @DATE D. @NUMPARTITIONS Answer: A,D What would require creating a new parallel Custom stage rather than a new parallel BuildOp stage? A. A Custom stage can be created with properties. BuildOp stages cannot be created with properties. B. In a Custom stage, the number of input links does not have to be fixed, but can vary, for example from one to two. BuildOp stages require a fixed number of input links. C. Creating a Custom stage requires knowledge of C/C++. You do not need knowledge of C/C++ to create a BuildOp stage. D. Custom stages can be created for parallel execution. BuildOp stages can only be built to run sequentially. Answer: B
Your input rows contain customer data from a variety of locations. You want to select just those rows from a specified location based on a parameter value. You are trying to decide whether to use a Transformer or a Filter stage to accomplish this. Which statement is true? A. The Transformer stage will yield better performance because the Filter stage Where clause is interpreted at runtime. B. You cannot use a Filter stage because you cannot use parameters in a Filter stage Where clause. C. The Filter stage will yield better performance because it has less overhead than a Transformer stage. D. You cannot use the Transformer stage because you cannot use parameters in a Transformer stage constraint. Answer: A In a Transformer you add a new column to an output link named JobName that is to contain the name of the job that is running. What can be used to derive values for this column? A. a DataStage function B. a link variable C. a system variable D. a DataStage macro Answer: D Which statement describes how to add functionality to the Transformer stage? A. Create a new parallel routine in the Routines category that specifies the name, path, type, and return type of a function written and compiled in C++. B. Create a new parallel routine in the Routines category that specifies the name, path, type, and return type of an external program. C. Create a new server routine in the Routines category that specifies the name and category of a function written in DataStage Basic. D. Edit the C++ code generated by the Transformer stage. Answer: A Which three statements about the Enterprise Edition parallel Transformer stage are correct? (Choose three.) A. The Transformer allows you to copy columns. B. The Transformer allows you to do lookups. C. The Transformer allows you to apply transforms using routines. D. The Transformer stage automatically applies 'NullToValue' function to all non-nullable output columns. E. The Transformer allows you to do data type conversions. Answer: A,C,E Which two stages allow field names to be specified using job parameters? (Choose two.) A. Transformer stage B. Funnel stage C. Modify stage D. Filter stage Answer: C,D The parallel dataset input into a Transformer stage contains null values. What should you do to properly handle these null values? A. Convert null values to a valid values in a stage variable. B. Convert null values to a valid value in the output column derivation. C. Null values are automatically converted to blanks and zero, depending on the target data type. D. Trap the null values in a link constraint to avoid derivations. Answer: A
Which two would require the use of a Transformer stage instead of a Copy stage? (Choose two.) A. Drop a column. B. Send the input data to multiple output streams. C. Trim spaces from a character field. D. Select certain output rows based on a condition. Answer: C,D In which situation should a BASIC Transformer stage be used in a DataStage EE job? A. in a job containing complex routines migrated from DataStage Server Edition B. in a job requiring lookups to hashed files C. in a large-volume job flow D. in a job requiring complex, reusable logic Answer: A You have three output links coming out of a Transformer. Two of them (A and B) have constraints you have defined. The third you want to be an Otherwise link that is to contain all of the rows that do not satisfy the constraints of A and B. This Otherwise link must work correctly even if the A and B constraints are modified. Which two are required? (Choose two.) A. The Otherwise link must be first in the link ordering. B. A constraint must be coded for the Otherwise link. C. The Otherwise link must be last in the link ordering. D. The Otherwise check box must be checked. Answer: C,D Which two statements are true about DataStage Parallel Buildop stages? (Choose two.) A. Unlike standard DataStage stages they do not have properties. B. They are coded using C/C++. C. They are coded using DataStage Basic. D. Table Definitions are used to define the input and output interfaces of the BuildOp. Answer: B,D
3. Click Add Rule to add the new message rule to the chosen handler.
Running a job
dsjob run [ mode [ NORMAL | RESET | VALIDATE ] ] [ param name=value ] [ warn n ] [ rows n ] [ wait ] [ stop ] [ jobstatus] [userstatus] [local] [opmetadata [TRUE | FALSE]] [-disableprjhandler] [-disablejobhandler] [useid] project job|job_id mode specifies the type of job run. NORMAL starts a job run, RESET resets the job and VALIDATE validates the job. If mode is not specified, a normal job run is started. param specifies a parameter value to pass to the job. The value is in the format name=value, where name is the parameter name, and value is the value to be set. If you use this to pass a value of an environment variable for a job (as you may do for parallel jobs), you need to quote the environment variable and its value, for example param '$APT_CONFIG_FILE=chris.apt' otherwise the current value of the environment variable will be used. warn n sets warning limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITWARN specified as the LimitType parameter). rows n sets row limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITROWS specified as the LimitType parameter). wait waits for the job to complete (equivalent to the DSWaitForJob function). stop terminates a running job (equivalent to the DSStopJob function). jobstatus waits for the job to complete, then returns an exit code derived from the job status. userstatus waits for the job to complete, then returns an exit code derived from the user status if that status is defined. The user status is a string, and it is converted to an integer exit code. The exit code 0 indicates that the job completed without an error, but that the user status string could not be converted. If a job returns a negative user status value, it is interpreted as an error.
-local use this when running a DataStage job from withing a shellscript on a UNIX server. Provided the script is run in the project directory, the job will pick up the settings for any environment variables set in the script and any setting specific to the user environment. -opmetadata use this to have the job generate operational meta data as it runs. If MetaStage, or the Process Meta Data MetaBroker, is not installed on the machine, then the option has no effect. If you specify TRUE, operational meta data is generated, whatever the default setting for the project. If you specify FALSE, the job will not generate operational meta data, whatever the default setting for the project. -disableprjhandler use this to disable any error message handler that has been set on a project wide basis -disablejobhandler use this to disable any error message handler that has been set for this job useid specify this if you intend to use a job alias (jobid) rather than ajob name (job) to identify the job. project is the name of the project containing the job. job is the name of the job. To run a job invocation, use the format job.invocation_id. job_id is an alias for the job that has been set using the dsjob jobid command
Stopping a job
You can stop a job using the stop option. dsjob stop [useid] project job|job_id stop terminates a running job (equivalent to the DSStopJobfunction). useid specify this if you intend to use a job alias (jobid) rather than a job name (job) to identify the job. project is the name of the project containing the job. job is the name of the job. To stop a job invocation, use the format job.invocation_id. job_id is an alias for the job that has been set using the dsjob jobid command
Listing Projects
The following syntax displays a list of all known projects on the server: dsjob lprojects This syntax is equivalent to the DSGetProjectList function.
Listing Jobs
The following syntax displays a list of all jobs in the specified project: dsjob ljobs project project is the name of the project containing the jobs to list. This syntax is equivalent to the DSGetProjectInfo function.
Listing Stages
The following syntax displays a list of all stages in a job: dsjob lstages [useid] project job|job_id This syntax is equivalent to the DSGetJobInfo function with DSJ_STAGELIST specified as the InfoType parameter.
Listing Links
The following syntax displays a list of all the links to or from a stage: dsjob llinks [useid] project job|job_id stage This syntax is equivalent to the DSGetStageInfo function with DSJ_LINKLIST specified as the InfoType parameter.
Listing Parameters
The following syntax display a list of all the parameters in a job and their values: dsjob lparams [useid] project job|job_id
Listing Invocations
The following syntax displays a list of the invocations of a job: dsjob linvocations
INFO Information. WARNING Warning. FATAL Fatal error. REJECT Rejected rows from a Transformer stage. STARTED All control logs. RESET Job reset. BATCH Batch control. ANY All entries of any type. This is the default if type is not specified. max n limits the number of entries retrieved to n.
Generating a Report
The dsjob command can be used to generate an XML format report containing job, stage, and link information. dsjob report [useid] project job|jobid [report_type] report_type is one of the following: BASIC Text string containing start/end time, time elapsed and status of job. DETAIL As basic report, but also contains information about individual stages and links within the job. LIST Text string containing full XML report. By default the generated XML will not contain a <?xml-stylesheet?> processing instruction. If a stylesheet is required, specify a RetportLevel of 2 and append the name of the required stylesheet URL, i.e., 2:styleSheetURL. This inserts a processing instruction into the generated XML of the form:
<?xml-stylesheet type=text/xsl href=styleSheetURL?> The generated report is written to stdout. This syntax is equivalent to the DSMakeJobReport function.DETAIL As basic report, but also contains information about individual stages and links within the job. LIST Text string containing full XML report.
Job Sequence
What is a Job Sequence? 1. A master controlling job that controls the execution set of subordinate jobs 2. Passes values to subordinate job parameters 3. Controls the order of execution (links) 4. Specifies conditions under which the subordinate jobs get executed (triggers) 5. Specified complex flow of control Loops, All/Any sequencer, Wait for file 6. Perform system activities ( Email, Execute system commands and executables) 7. Can include Restart checkpoints What are the Job Sequence stages? 1. Run stages Job Activity: Run a job, Execute Command/Routine Activity: Run a system command, Notification Activity: Send an email 2. Flow Control stages Sequencer: Go All/Any, Wait for file: Go when file exists/doesnt exist, Loop: Start loop and End Loop, Nested Condition: Go if condition satisfied 3. Error handling Exception Handler, Terminator 4. Variables User Variables What are the compilation options in Job Sequence properties? 1. Add checkpoints so sequence is restartable on failure Restart Functionality 2. Automatically handle activities that fail Exception stage to handle aborts 3. Log warnings after activities that finish with status other than OK 4. Log report messages after each run What are the inputs for Job Activity stage? 1. Job name (select from list) 2. Execution Action (select from list) 3. Parameters 4. Do not checkpoint run (select/unselect checkbox) What are the Job Activity Execution Actions? 1. Run 2. Reset if required, then run 3. Validate What are the different types of triggers for a Job Activity? OK (Conditional) Failed (Conditional) Warning (Conditional) Custom (Conditional) UserStatus (Conditional) Unconditional Otherwise Custom Trigger Example Job_1.$JobStatus=DSJS.RUNOK or Job_1.$JobStatus= DSJS.RUNWARN
What are the inputs for Execute Command stage? 1. Command 2. Parameters 3. Do not checkpoint run (select/unselect checkbox) What are the inputs for Notification stage? 1. SMTP Mail server name 2. Senders email address 3. Recipients email address 4. Email subject 5. Attachment 6. Email body 7. Include job status in email (select/unselect checkbox) 8. Do not checkpoint run (select/unselect checkbox) What are the inputs for Wait for file stage? 1. Filename 2. Wait for file to appear / Wait for file to appear (Select one of the two options) 3. Timeout length (disabled if the Do not timeout option is selected) 4. Do not timeout 5. Do not checkpoint run Explain the Nested Condition stage? The Nested Condition stage is used to branch out to other activities based on trigger conditions. Explain the Loop stage? The Loop stage is made up of Start Loop and End Loop. The Start Loop connects to one of the Run activities (preferably Job Activity). This Activity stage connects to the End Loop. The End Loop connects to the Start Loop activity by means of a reference link. The 2 types of looping are 1. Numeric (For counter n to n Step n) 2. List (For each thing in list) Explain the Error handling and Restartability? Error handling is enabled using Automatically handle activities that fail option. The control is passed to the Exception stage when an Activity fails Restartability is enabled using Add checkpoints so sequence is restartable on failure option. If a sequence fails, then when the Sequence is re-run, activities that completed successfully in the prior run are skipped over (unless the Do not checkpoint run option was set for an activity). Which three are valid ways within a Job Sequence to pass parameters to Activity stages? (Choose three.) A. ExecCommand Activity stage B. UserVariables Activity stage C. Sequencer Activity stage D. Routine Activity stage E. Nested Condition Activity stage Answer: A,B,D Which three are valid trigger expressions in a stage in a Job Sequence? (Choose three.) A. Equality(Conditional)
B. Unconditional C. ReturnValue(Conditional) D. Difference(Conditional) E. Custom(Conditional) Answer: B,C,E A client requires that any job that aborts in a Job Sequence halt processing. Which three activities would provide this capability? (Choose three.) A. Nested Condition Activity B. Exception Handler C. Sequencer Activity D. Sendmail Activity E. Job trigger Answer: A,B,E Which command can be used to execute DataStage jobs from a UNIX shell script? A. dsjob B. DSRunJob C. osh D. DSExecute Answer: A Which three are the critical stages that would be necessary to build a Job Sequence that: picks up data from a file that will arrive in an directory overnight, launches a job once the file has arrived, sends an email to the administrator upon successful completion of the flow? (Choose three.) A. Sequencer B. Notification Activity C. Wait For File Activity D. Job Activity E. Terminator Activity Answer: B,C,D Which two statements describe functionality that is available using the dsjob command? (Choose two.) A. dsjob can be used to get a report containing job, stage, and link information. B. dsjob can be used to add a log entry for a specified job. C. dsjob can be used to compile a job. D. dsjob can be used to export job executables. Answer: A,B
Other Topics
Environment Variables
APT_BUFFER_FREE_RUN This environment variable is available in the DataStage Administrator, under the Parallel category. It specifies how much of the available inmemory buffer to consume before the buffer resists. This is expressed as a decimal representing the percentage of Maximum memory buffer size (for example, 0.5 is 50%). When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk. APT_BUFFER_MAXIMUM_MEMORY Sets the default value of Maximum memory buffer size. The default value is 3145728 (3 MB). Specifies the maximum amount of virtual memory, in bytes, used per buffer. APT_BUFFER_MAXIMUM_TIMEOUT DataStage buffering is self tuning, which can theoretically lead to long delays between retries. This environment variable specified the maximum wait before a retry in seconds, and is by default set to 1. APT_BUFFERING_POLICY This environment variable is available in the DataStage Administrator, under the Parallel category. Controls the buffering policy for all virtual data sets in all steps. The variable has the following settings: AUTOMATIC_BUFFERING (default). Buffer a data set only if necessary to prevent a data flow deadlock. FORCE_BUFFERING. Unconditionally buffer all virtual data sets. Note that this can slow down processing considerably. NO_BUFFERING. Do not buffer data sets. This setting can cause data flow deadlock if used inappropriately. APT_DECIMAL_INTERM_PRECISION Specifies the default maximum precision value for any decimal intermediate variables required in calculations. Default value is 38. APT_DECIMAL_INTERM_SCALE Specifies the default scale value for any decimal intermediate variables required in calculations. Default value is 10. APT_CONFIG_FILE Sets the path name of the configuration file. (You may want to include this as a job parameter, so that you can specify the configuration file at job run time). APT_DISABLE_COMBINATION Globally disables operator combining. Operator combining is DataStages default behavior, in which two or more (in fact any number of) operators within a step are combined into one process where possible. You may need to disable combining to facilitate debugging. Note that disabling combining generates more UNIX processes, and hence requires more system resources and memory. It also disables internal optimizations for job efficiency and run times. APT_EXECUTION_MODE By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: ONE_PROCESS one-process mode MANY_PROCESS many-process mode NO_SERIALIZE many-process mode, without serialization
APT_ORCHHOME Must be set by all DataStage Enterprise Edition users to point to the top-level directory of the DataStage Enterprise Edition installation. APT_STARTUP_SCRIPT As part of running an application, DataStage creates a remote shell on all DataStage processing nodes on which the job runs. By default, the remote shell is given the same environment as the shell from which DataStage is invoked. However, you can write an optional startupshell script to modify the shell configuration of one or more processing nodes. If a startup script exists, DataStage runs it on remote shells before running your application. APT_STARTUP_SCRIPT specifies the script to be run. If it is not defined, DataStage searches ./startup.apt, $APT_ORCHHOME/etc/startup.apt and $APT_ORCHHOME/etc/startup, in that order. APT_NO_STARTUP_SCRIPT disables running the startup script. APT_NO_STARTUP_SCRIPT Prevents DataStage from executing a startup script. By default, this variable is not set, and DataStage runs the startup script. If this variable is set, DataStage ignores the startup script. This may be useful when debugging a startup script. See also APT_STARTUP_SCRIPT. APT_STARTUP_STATUS Set this to cause messages to be generated as parallel job startup moves from phase to phase. This can be useful as a diagnostic if parallel job startup is failing. APT_MONITOR_SIZE This environment variable is available in the DataStage Administrator under the Parallel branch. Determines the minimum number of records the DataStage Job Monitor reports. The default is 5000 records. APT_MONITOR_TIME This environment variable is available in the DataStage Administrator under the Parallel branch. Determines the minimum time interval in seconds for generating monitor information at runtime. The default is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE. APT_NO_JOBMON Turn off job monitoring entirely. APT_PM_NO_SHARED_MEMORY By default, shared memory is used for local connections. If this variable is set, named pipes rather than shared memory are used for local connections. If both APT_PM_NO_NAMED_PIPES and APT_PM_NO_SHARED_MEMORY are set, then TCP sockets are used for local connections. APT_PM_NO_NAMED_PIPES Specifies not to use named pipes for local connections. Named pipes will still be used in other areas of DataStage, including subprocs and setting up of the shared memory transport protocol in the process manager. APT_RECORD_COUNTS Causes DataStage to print, for each operator Player, the number of records consumed by getRecord() and produced by putRecord(). Abandoned input records are not necessarily accounted for. Buffer operators do not print this information. APT_NO_PART_INSERTION DataStage automatically inserts partition components in your application to optimize the performance of the stages in your job. Set this variable to prevent this automatic insertion. APT_NO_SORT_INSERTION DataStage automatically inserts sort components in your job to optimize the performance of the operators in your data flow. Set this variable to prevent this automatic insertion.
APT_SORT_INSERTION_CHECK_ONLY When sorts are inserted automatically by DataStage, if this is set, the sorts will just check that the order is correct, they won't actually sort. This is a better alternative to shutting partitioning and sorting off insertion off using APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION. APT_DUMP_SCORE Configures DataStage to print a report showing the operators, processes, and data sets in a running job. APT_PM_PLAYER_MEMORY Setting this variable causes each player process to report the process heap memory allocation in the job log when returning. APT_PM_PLAYER_TIMING Setting this variable causes each player process to report its call and return in the job log. The message with the return is annotated with CPU times for the player process. OSH_DUMP If set, it causes DataStage to put a verbose description of a job in the job log before attempting to execute it. OSH_ECHO If set, it causes DataStage to echo its job specification to the job log after the shell has expanded all arguments. OSH_EXPLAIN If set, it causes DataStage to place a terse description of the job in the job log before attempting to run it. OSH_PRINT_SCHEMAS If set, it causes DataStage to print the record schema of all data sets and the interface schema of all operators in the job log. APT_STRING_PADCHAR Overrides the pad character of 0x0 (ASCII null), used by default when DataStage extends, or pads, a string field to a fixed length.
XML Meta Data Importer reports any syntax and semantic errors when you open a source file. In the following example, the Parser Output pane indicates that at least one quote is missing from line 3.
To highlight the error in the Source pane, double-click the error in the Parser Output pane. After correcting the error outside of the XML Meta Data Importer, you can load the revised source file. To reload the file, choose File Refresh. You can process an XML schema file (.xsd) or an XML document (.xml). The file can be located on your file system or accessed with a URL.
Example
The following input does not include a namespace prefix. Input
<Person xmlns="mynamespace"> <firstName>John</firstName> </Person>
Output
<ns1:Person xmlns:ns1="mynamespace"> <ns1:firstName>John</firstName> </Person>
General Rule
In general, the XML Meta Data Importer assigns the prefix defns to the target namespace. For example: <xsd:schema targetNamespace="mynamespace" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="Person"> <xsd:complexType> <xsd:sequence> <xsd:element name="firstName" type="xsd:string" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> The firstName node generates the following XPath expression: /defns:Person/defns:firstName where defns=mynamespace
<xsd:element name="state" minOccurs="1" maxOccurs="1" /> <xsd:element name="zip" minOccurs="1" maxOccurs="1" /> </xsd:sequence> </xsd:complexType> </xsd:schema> The street node generates the following XPath expression: /defns:Person/defns:address/ns2:street where defns=demonamespace and ns2=othernamespace
The firstName tree node generates the following XPath expression: /Person/firstName
If you select an element box, you get all the sub nodes and the actual content of the element. Your selection is reflected in the Table Definition pane:
An asterisk appears after the title Table Definition when you modify the table definition. It disappears when you save the information.
Transformation Settings
These properties control the values that can be shared by multiple output links of the XML Input stage. They fall into these categories: Requiring the repetition element Processing NULLs and empty values Processing namespaces Formatting extracted XML fragments To use these values with a specific output link, select the Inherit Stage properties box on the Transformation Settings tab of the output link.
XML Output requires XPath expressions to transform tabular data to XML. A table definition stores the XPath expressions. Using the Description property on the Columns pages within the stage, you record or maintain the XPath expressions.
2. Types to find 3. Include descriptions (If checked, the text in short and long descriptions will be searched)
Impact Analysis
Right click over a stage or table definition 1. Select Find where table definitions used 2. Select Find where table definitions used (deep) Deep includes additional object types Displays a list of objects using the table definition 1. Select Find dependencies 2. Select Find dependencies (deep) Deep includes additional object types Displays list of objects dependent on the one selected Graphical Functionality 1. Display the dependency path 2. Collapse selected objects 3. Move the graphical object 4. Birds-eye view
Comparison
1. Cross project compare 2. Compare against The two objects that can be compared are 1.Jobs and 2.Table Definitions
Aggregator Stage
1. Grouping Keys 2. Aggregations Aggregation Type - Count Rows, Calculation, Re-Calculation Aggregation Type - Count Rows Count Output Column - Name of the output column which consists of the number of records based on grouping keys Aggregation Type - Calculation, Re-Calculation Column for Calculation - Input Column to be selected for calculation Options Allow Null Output - True means that NULL is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. False means 0 is output when all input values for calculation column are NULL. Method Hash (Hash table) or Sort (Pre-Sort). The default method is Hash Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory. Sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys. Use Hash method for inputs with a limited number of distinct groups 1. Uses 2K of memory/group 2. Calculations are made for all groups and stored in memory (Hash table structure and hence the name) 3. Incoming data does not need to be pre-sorted 4. Results are output after all rows have been read 5. Useful when the number of unique groups is small Use Sort method with a large (or unknown) number of distinct key column values 1. Requires inputs pre-sorted on key columns (Does not perform the sort! Expects the sort) 2. Results are output after each group 3. Can handle unlimited number of groups Sort Aggregator - one of the lightweight stages that minimize memory usage by requiring data in key column sort order Lightweight stages that minimize memory usage by requiring data in key column sort order 1. Join 2. Merge 3. Sort Aggregator
Sort Stage
DataStage designer provides two methods for parallel (group) sorting 1. Sort stage - Parallel Execution 2. Sort on a link when the partitioning is not Auto - Identified by the Sort icon Both methods use the same tsort operator Sorting on a link provides easier job maintenance (fewer stages on job canvas) but fewer options.
The Sort stage offers more options than a link sort. The Sort Utility should be DataStage as it is faster than the Unix Sort. Stable sort preserves the order of non-key columns within each sort group but are slightly slower than non-stable sorts. Stable sort is enabled by default on Sort stages but not on Sort links. If disabled no prior ordering of records is guaranteed to be preserved by the sorting operation Sort Key Modes 1. Dont Sort (Previously Sorted) means that input records are already sorted by this key. The Sort stage will then sort on secondary keys, if any. 2. Dont Sort (Previously Grouped) means that input records are already grouped by that key but not sorted 3. Sort Sort by this key Advantages of Dont Sort (Previously Sorted) 1. Uses significantly less memory/disk 2. Sort is now on previously sorted key column groups not the entire data set 3. Outputs rows after each group DataStage provides two methods for generating a sequentially (totally) sorted result 1. Sort stage - Sequential Execution mode 2. Sort Merge Collector In general a parallel Sort + Sort Merge Collector will be faster than a Sequential Sort. By default the Parallel Framework will insert tsort operators as necessary to ensure correct results. But by setting $APT_INSERTION_CHECK_ONLY we can force the inserted tsort operator to verify if the data is sorted instead of actually performing the sort operation. By default each tsort operator (Sort stage, link sort and inserted sort) uses 20MB per partition as an internal memory buffer. But the Sort stage provides the "Restrict Memory Usage" option. 1. Increasing this value can improve improve performance if the entire (or group) data can fit into memory 2. Decreasing this value may hurt performance, but will use less memory When the memory buffer is filled, sort uses temporary disk space in the following order 1. Scratch disks in the $APT_CONFIG_FILE "sort" named disk pool 2. Scratch disks in the $APT_CONFIG_FILE default disk pool 3. The default directory specified by $TMPDIR 4. The Unix /tmp directory Removing Duplicates Can be done by Sort stage Use unique option No choice on which duplicate to keep Stable sort always retains the first row in the group Non stable sort is indeterminate Remove Duplicates stage Can choose to retain first or last