An Introduction To MultiLoad
An Introduction To MultiLoad
An Introduction To MultiLoad
DELETEs to be successful in IMPORT mode, they must reference the Primary Index in the WHERE clause. The MultiLoad DELETE mode is used to perform a global (all AMP) delete on just one table. The reason to use .BEGIN DELETE MLOAD is that it bypasses the Transient Journal (TJ) and can be RESTARTed if an error causes it to terminate prior to finishing. When performing in DELETE mode, the DELETE SQL statementcannot reference the Primary Index in the WHERE clause. This due to the fact that a primary index access is to a specific AMP; this is a global operation. The other factor that makes a DELETE mode operation so good is that it examines an entire block of rows at a time. Once all the eligible rows have been removed, the block is written one time and a checkpoint is written. So, if a restart is necessary, it simply starts deleting rows from the next block without a checkpoint. This is a smart way to continue. Remember, when using the TJ all deleted rows are put back into the table from the TJ as a rollback. A rollback can take longer to finish then the delete. MultiLoad does not do a rollback; it does a restart. The Purpose of DELETE MLOAD
In the above diagram, monthly data is being stored in a quarterly table. To keep the contents limited to four months, monthly data is rotated in and out. At the end of every month, the oldest month of data is removed and the new month is added. The cycle is add a
month, delete a month, add a month, delete a month. In our illustration, that means that January data must be deleted to make room for Mays data. Here is a question for you: What if there was another way to accomplish this same goal without consuming all of these extra resources? To illustrate, lets consider the following scenario: Suppose you have Table A that contains 12 billion rows. You want to delete a range of rows based on a date and then load in fresh data to replace these rows. Normally, the process is to perform a MultiLoad DELETE to DELETE FROM Table A WHERE <date-column> < 2002-02-01. The final step would be to INSERT the new rows for May using MultiLoad IMPORT.
Rule #1: Unique Secondary Indexes are not supported on a Target Table. Like FastLoad, MultiLoad does not support Unique Secondary Indexes (USIs). But unlike FastLoad, it does support the use of Non-Unique Secondary Indexes (NUSIs) because the index subtable row is on the same AMP as the data row. MultiLoad uses every AMP independently and in parallel. If two AMPs must communicate, they are not independent. Therefore, a NUSI (same AMP) is fine, but a USI (different AMP) is not. Rule #2: Referential Integrity is not supported. MultiLoad will not load data into tables that are defined with Referential Integrity (RI). Like a USI, this requires the AMPs to communicate with each other. So, RI constraints must be dropped from the target table prior to using MultiLoad. Rule #3: Triggers are not supported at load time. Triggers cause actions on related tables based upon what happens in a target table. Again, this is a multi-AMP operation and to a different table. To keep MultiLoad running smoothly, disable all Triggers prior to using it. Rule #4: No concatenation of input files is allowed. MultiLoad does not want you to do this because it could impact are restart if the files were concatenated in a different sequence or data was deleted between runs. Rule #5: The host will not process aggregates, arithmetic functions or exponentiation. If you need data conversions or math, you might be better off using an INMOD to prepare the data prior to loading it.
must give them a qualified name (<databasename>.<tablename>) in the script or use the DATABASE command to change the current database. Where will you find these tables in the load script? The Logtable is generally identified immediately prior to the .LOGON command. Worktables and error tables can be named in the BEGIN MLOAD statement. Do not underestimate the value of these tables. They are vital to the operation of MultiLoad. Without them a MultiLoad job can not run. Now that you have had the executive summary, lets look at each type of table individually. Two Error Tables: Here is another place where FastLoad and MultiLoad are similar. Both require the use of two error tables per target table. MultiLoad will automatically create these tables. Rows are inserted into these tables only when errors occur during the load process. The first error table is the acquisitionError Table (ET). It contains all translation and constraint errors that may occur while the data is being acquired from the source(s). The second is the Uniqueness Violation (UV) table that stores rows with duplicate values for Unique Primary Indexes (UPI). Since a UPI must be unique, MultiLoad can only load one occurrence into a table. Any duplicate value will be stored in the UV error table. For example, you might see a UPI error that shows a second employee number 99. In this case, if the name for employee 99 is Kara Morgan, you will be glad that the row did not load since Kara Morgan is already in the Employee table. However, if the name showed up as David Jackson, then you know that further investigation is needed, because employee numbers must be unique. Each error table does the following: Identifies errors Provides some detail about the errors Stores the actual offending row for debugging
You have the option to name these tables in the MultiLoad script (shown later). Alternatively, if you do not name them, they default to ET_<target_table_name> and UV_<target_table_name>. In either case, MultiLoad will not accept error table names that are the same as target table names. It does not matter what you name them. It is recommended that you standardize on the naming convention to make it easier for everyone on your team. For more details on how these
error tables can help you, see the subsection in this chapter titled, Troubleshooting MultiLoad Errors. Log Table: MultiLoad requires a LOGTABLE. This table keeps a record of the results from each phase of the load so that MultiLoad knows the proper point from which to RESTART. There is one LOGTABLE for each run. Since MultiLoad will not resubmit a command that has been run previously, it will use the LOGTABLE to determine the last successfully completed step. Work Table(s): MultiLoad will automatically create one worktable for each target table. This means that in IMPORT mode you could have one or more worktables. In the DELETE mode, you will only have one worktable since that mode only works on one target table. The purpose of worktables is to hold two things: The Data Manipulation Language (DML) tasks The input data that is ready to APPLY to the AMPs
The worktables are created in a database using PERM space. They can become very large. If the script uses multiple SQL statements for a single data record, the data is sent to the AMP once for each SQL statement. This replication guarantees fast performance and that no SQL statement will ever be done more than once. So, this is very important. However, there is no such thing as a free lunch, the cost is space. Later, you will see that using a FILLER field can help reduce this disk space by not sending unneeded data to an AMP. In other words, the efficiency of the MultiLoad run is in your hands.
BINARY
Each record is a 2-byte integer, n, that is followed byn bytes of data. A byte is the smallest means of storage of for Teradata. This format is the same as Binary, plus a marker (X 0A or X 0D) that specifies the end of the record. Each record has a random number of bytes and is followed by an end of the record marker.
FASTLOAD TEXT
UNFORMAT
The format for these input records is defined in the LAYOUT statement of the MultiLoad script using the components FIELD, FILLER and TABLE. This is variable length text RECORD format separated by delimiters such as a comma. For this format you may only use VARCHAR, LONG VARCHAR (IBM) or VARBYTE data formats in your MultiLoad LAYOUT. Note that two delimiter characters in a row will result in a null value between them.
VARTEXT
Figure 5-1
FTP script
An FTP command line script will be used to get the get_cust_extract.txt control script has the following contents:
open 10.150.193.128 etltoolsinfouser dwpwd1 prompt lcd /dw/input/Dwh_cust_extract.txt mget Dwh_cust_extract*.txt quit
extract
file.
The
Invoke FTP commands in command prompt (Windows or Unix) by issuing the following command: ftp -s:get_cust_extract.txt
MultiLoad is a Teradata utility with a purpose to do fast, high-volume loading on multiple tables and views of a Teradata RDBMS. MultiLoad processes a series of MultiLoad commands and Teradata SQL statements as a batch job script which are used for session control and data transfers. Among others, the MultiLoad script will use the following statements to load the data:
LAYOUT command and the series of FIELD commands specify each field of the data records that are sent to the Teradata database. The DML LABEL command introduces the INSERT statement. IMPORT command starts the import task, by specifying the VARtext format with a field delimiter. It uses the input file parameter and the custdml insert statement. The multiload script is composed of the following statements:
.LOGTABLE dwlogtable; .LOGON tdp1/etltoolsinfo,dwpwd1; .begin import mload tables customers; .layout custlayout; .field ID 1 INTEGER; .field CUST_ID * VARCHAR(6); .field CUST_NAME * VARCHAR(30); .field CUST_GROUP * VARCHAR(30); .field CUST_SEGMENT * VARCHAR(10); .field CUST_COUNTRY_ID * VARCHAR(3); .dml label custdml; insert into customers.*; .import infile /dw/input/Dwh_cust_extract.txt format VARtext ';'
And the following Teradata BTEQ script is invoked (Basic Teradata Query) to verify the MultiLoad loading process:
bteq .logon tdp1/etltoolsinfo,dwpwd1; select count(1) CustCnt from customers; .quit
The query result will show the number of records in the data warehouse customer table: *** Query completed. 1 row found. 1 columns returned. *** Total elapsed time was 1 seconds. CustCnt ----22