An Introduction To MultiLoad

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

An Introduction to MultiLoad

Why it is called MultiLoad


If we were going to be stranded on an island with a Teradata Data Warehouse and we could only take along one Teradata load utility, clearly, MultiLoad would be our choice. MultiLoad has the capability to load multiple tables at one time from either a LAN or Channel environment. This is in stark contrast to its fleet-footed cousin, FastLoad, which can only load one table at a time. And it gets better, yet! This feature rich utility can perform multiple types of DML tasks, including INSERT, UPDATE, DELETE and UPSERT on up to five (5) empty or populated target tables at a time. These DML functions may be run either solo or in combinations, against one or more tables. For these reasons, MultiLoad is the utility of choice when it comes to loading populated tables in the batch environment. As the volume of data being loaded or updated in a single block, the performance of MultiLoad improves. MultiLoad shines when it can impact more than one row in every data block. In other words, MultiLoad looks at massive amounts of data and says, Bring it on! Leo Tolstoy once said, All happy families resemble each other. Like happy families, the Teradata load utilities resemble each other, although they may have some differences. You are going to be pleased to find that you do not have to learn all new commands and concepts for each load utility. MultiLoad has many similarities to FastLoad. It has even more commands in common with TPump. The similarities will be evident as you work with them. Where there are some quirky differences, we will point them out for you.

Two MultiLoad Modes: IMPORT and DELETE


MultiLoad provides two types of operations via modes: IMPORT and DELETE. In MultiLoad IMPORT mode, you have the freedom to mix and match up to twenty (20) INSERTs, UPDATEs or DELETEs on up to five target tables. The execution of the DML statements is not mandatory for all rows in a table. Instead, their execution hinges upon the conditions contained in the APPLY clause of the script. Once again, MultiLoad demonstrates its user-friendly flexibility. For UPDATEs or

DELETEs to be successful in IMPORT mode, they must reference the Primary Index in the WHERE clause. The MultiLoad DELETE mode is used to perform a global (all AMP) delete on just one table. The reason to use .BEGIN DELETE MLOAD is that it bypasses the Transient Journal (TJ) and can be RESTARTed if an error causes it to terminate prior to finishing. When performing in DELETE mode, the DELETE SQL statementcannot reference the Primary Index in the WHERE clause. This due to the fact that a primary index access is to a specific AMP; this is a global operation. The other factor that makes a DELETE mode operation so good is that it examines an entire block of rows at a time. Once all the eligible rows have been removed, the block is written one time and a checkpoint is written. So, if a restart is necessary, it simply starts deleting rows from the next block without a checkpoint. This is a smart way to continue. Remember, when using the TJ all deleted rows are put back into the table from the TJ as a rollback. A rollback can take longer to finish then the delete. MultiLoad does not do a rollback; it does a restart. The Purpose of DELETE MLOAD

In the above diagram, monthly data is being stored in a quarterly table. To keep the contents limited to four months, monthly data is rotated in and out. At the end of every month, the oldest month of data is removed and the new month is added. The cycle is add a

month, delete a month, add a month, delete a month. In our illustration, that means that January data must be deleted to make room for Mays data. Here is a question for you: What if there was another way to accomplish this same goal without consuming all of these extra resources? To illustrate, lets consider the following scenario: Suppose you have Table A that contains 12 billion rows. You want to delete a range of rows based on a date and then load in fresh data to replace these rows. Normally, the process is to perform a MultiLoad DELETE to DELETE FROM Table A WHERE <date-column> < 2002-02-01. The final step would be to INSERT the new rows for May using MultiLoad IMPORT.

Block and Tackle Approach


MultiLoad never loses sight of the fact that it is designed for functionality, speed, and the ability to restart. It tackles the proverbial I/O bottleneck problem like FastLoad by assembling data rows into 64K blocks and writing them to disk on the AMPs. This is much faster than writing data one row at a time like BTEQ. Fallback table rows are written after the base table has been loaded. This allows users to access the base table immediately upon completion of the MultiLoad while fallback rows are being loaded in the background. The benefit is reduced time to access the data. Amazingly, MultiLoad has full RESTART capability in all of its five phases of operation. Once again, this demonstrates its tremendous flexibility as a load utility. Is it pure magic? No, but it almost seems so. MultiLoad makes effective use of two error tables to save different types of errors and a LOGTABLE that stores built-in checkpoint information for restarting. This is why MultiLoad does not use the Transient Journal, thus averting time-consuming rollbacks when a job halts prematurely. Here is a key difference to note between MultiLoad and FastLoad. Sometimes an AMP (Access Module Processor) fails and the system administrators say that the AMP is down or offline. When using FastLoad, you must restart the AMP to restart the job. MultiLoad, however, can RESTART when an AMP fails, if the table is fallback protected. As the same time, you can use the AMPCHECK option to make it work like FastLoad if you want.

MultiLoad Imposes Limits

Rule #1: Unique Secondary Indexes are not supported on a Target Table. Like FastLoad, MultiLoad does not support Unique Secondary Indexes (USIs). But unlike FastLoad, it does support the use of Non-Unique Secondary Indexes (NUSIs) because the index subtable row is on the same AMP as the data row. MultiLoad uses every AMP independently and in parallel. If two AMPs must communicate, they are not independent. Therefore, a NUSI (same AMP) is fine, but a USI (different AMP) is not. Rule #2: Referential Integrity is not supported. MultiLoad will not load data into tables that are defined with Referential Integrity (RI). Like a USI, this requires the AMPs to communicate with each other. So, RI constraints must be dropped from the target table prior to using MultiLoad. Rule #3: Triggers are not supported at load time. Triggers cause actions on related tables based upon what happens in a target table. Again, this is a multi-AMP operation and to a different table. To keep MultiLoad running smoothly, disable all Triggers prior to using it. Rule #4: No concatenation of input files is allowed. MultiLoad does not want you to do this because it could impact are restart if the files were concatenated in a different sequence or data was deleted between runs. Rule #5: The host will not process aggregates, arithmetic functions or exponentiation. If you need data conversions or math, you might be better off using an INMOD to prepare the data prior to loading it.

Error Tables, Work Tables and Log Tables


Besides target table(s), MultiLoad requires the use of four special tables in order to function. They consist of two error tables (per target table), one worktable (per target table), and one log table. In essence, the Error Tables will be used to store any conversion, constraint or uniqueness violations during a load.Work Tables are used to receive and sort data and SQL on each AMP prior to storing them permanently to disk. A Log Table (also called, Logtable) is used to store successful checkpoints during load processing in case a RESTART is needed. HINT: Sometimes a company wants all of these load support tables to be housed in a particular database. When these tables are to be stored in any database other than the users own default database, then you

must give them a qualified name (<databasename>.<tablename>) in the script or use the DATABASE command to change the current database. Where will you find these tables in the load script? The Logtable is generally identified immediately prior to the .LOGON command. Worktables and error tables can be named in the BEGIN MLOAD statement. Do not underestimate the value of these tables. They are vital to the operation of MultiLoad. Without them a MultiLoad job can not run. Now that you have had the executive summary, lets look at each type of table individually. Two Error Tables: Here is another place where FastLoad and MultiLoad are similar. Both require the use of two error tables per target table. MultiLoad will automatically create these tables. Rows are inserted into these tables only when errors occur during the load process. The first error table is the acquisitionError Table (ET). It contains all translation and constraint errors that may occur while the data is being acquired from the source(s). The second is the Uniqueness Violation (UV) table that stores rows with duplicate values for Unique Primary Indexes (UPI). Since a UPI must be unique, MultiLoad can only load one occurrence into a table. Any duplicate value will be stored in the UV error table. For example, you might see a UPI error that shows a second employee number 99. In this case, if the name for employee 99 is Kara Morgan, you will be glad that the row did not load since Kara Morgan is already in the Employee table. However, if the name showed up as David Jackson, then you know that further investigation is needed, because employee numbers must be unique. Each error table does the following: Identifies errors Provides some detail about the errors Stores the actual offending row for debugging

You have the option to name these tables in the MultiLoad script (shown later). Alternatively, if you do not name them, they default to ET_<target_table_name> and UV_<target_table_name>. In either case, MultiLoad will not accept error table names that are the same as target table names. It does not matter what you name them. It is recommended that you standardize on the naming convention to make it easier for everyone on your team. For more details on how these

error tables can help you, see the subsection in this chapter titled, Troubleshooting MultiLoad Errors. Log Table: MultiLoad requires a LOGTABLE. This table keeps a record of the results from each phase of the load so that MultiLoad knows the proper point from which to RESTART. There is one LOGTABLE for each run. Since MultiLoad will not resubmit a command that has been run previously, it will use the LOGTABLE to determine the last successfully completed step. Work Table(s): MultiLoad will automatically create one worktable for each target table. This means that in IMPORT mode you could have one or more worktables. In the DELETE mode, you will only have one worktable since that mode only works on one target table. The purpose of worktables is to hold two things: The Data Manipulation Language (DML) tasks The input data that is ready to APPLY to the AMPs

The worktables are created in a database using PERM space. They can become very large. If the script uses multiple SQL statements for a single data record, the data is sent to the AMP once for each SQL statement. This replication guarantees fast performance and that no SQL statement will ever be done more than once. So, this is very important. However, there is no such thing as a free lunch, the cost is space. Later, you will see that using a FILLER field can help reduce this disk space by not sending unneeded data to an AMP. In other words, the efficiency of the MultiLoad run is in your hands.

Supported Input Formats


Data input files come in a variety of formats but MultiLoad is flexible enough to handle many of them. MultiLoad supports the following five format options: BINARY, FASTLOAD, TEXT, UNFORMAT and VARTEXT.

BINARY

Each record is a 2-byte integer, n, that is followed byn bytes of data. A byte is the smallest means of storage of for Teradata. This format is the same as Binary, plus a marker (X 0A or X 0D) that specifies the end of the record. Each record has a random number of bytes and is followed by an end of the record marker.

FASTLOAD TEXT

UNFORMAT

The format for these input records is defined in the LAYOUT statement of the MultiLoad script using the components FIELD, FILLER and TABLE. This is variable length text RECORD format separated by delimiters such as a comma. For this format you may only use VARCHAR, LONG VARCHAR (IBM) or VARBYTE data formats in your MultiLoad LAYOUT. Note that two delimiter characters in a row will result in a null value between them.

VARTEXT

Figure 5-1

oading extract using Teradata Multiload


Goal
Load customer extract file using FTP command line script and Teradata MultiLoad tool. The ftp command line tool will fetch the extract from a source system into an ETL server. The MultiLoad Job Script will load the script into a Teradata database.

FTP script
An FTP command line script will be used to get the get_cust_extract.txt control script has the following contents:
open 10.150.193.128 etltoolsinfouser dwpwd1 prompt lcd /dw/input/Dwh_cust_extract.txt mget Dwh_cust_extract*.txt quit

extract

file.

The

Invoke FTP commands in command prompt (Windows or Unix) by issuing the following command: ftp -s:get_cust_extract.txt

Teradata MultiLoad (Mload)


The goal is to create a Teradata MultiLoad job script to import data into an empty table and select data from the table to verify the import task.

The datawarehouse customers extract contains the following records:

MultiLoad is a Teradata utility with a purpose to do fast, high-volume loading on multiple tables and views of a Teradata RDBMS. MultiLoad processes a series of MultiLoad commands and Teradata SQL statements as a batch job script which are used for session control and data transfers. Among others, the MultiLoad script will use the following statements to load the data:

LAYOUT command and the series of FIELD commands specify each field of the data records that are sent to the Teradata database. The DML LABEL command introduces the INSERT statement. IMPORT command starts the import task, by specifying the VARtext format with a field delimiter. It uses the input file parameter and the custdml insert statement. The multiload script is composed of the following statements:
.LOGTABLE dwlogtable; .LOGON tdp1/etltoolsinfo,dwpwd1; .begin import mload tables customers; .layout custlayout; .field ID 1 INTEGER; .field CUST_ID * VARCHAR(6); .field CUST_NAME * VARCHAR(30); .field CUST_GROUP * VARCHAR(30); .field CUST_SEGMENT * VARCHAR(10); .field CUST_COUNTRY_ID * VARCHAR(3); .dml label custdml; insert into customers.*; .import infile /dw/input/Dwh_cust_extract.txt format VARtext ';'

layout custlayout apply custdml; .end mload; .logoff;

To load the extract, the following mload statement needs to be issued:


mload < load_cust_extract.mload

And the following Teradata BTEQ script is invoked (Basic Teradata Query) to verify the MultiLoad loading process:
bteq .logon tdp1/etltoolsinfo,dwpwd1; select count(1) CustCnt from customers; .quit
The query result will show the number of records in the data warehouse customer table: *** Query completed. 1 row found. 1 columns returned. *** Total elapsed time was 1 seconds. CustCnt ----22

You might also like