What Is Partitioned Primary Index (PPI) in Teradata?: Advantages
What Is Partitioned Primary Index (PPI) in Teradata?: Advantages
What Is Partitioned Primary Index (PPI) in Teradata?: Advantages
Partitioned primary index is physically splitting the table into a series of subtables,
one for every partitioning value. When a single row is accessed, it looks first at the
partitioning value to determine the subtable, then at the primary index to calculate
the rowhash for the row(s).
For example, we have PPI on a MONTH Column, the rows of particular months are all
sorted with in the same partition and whenever data is accessed for particular month,
it will retrive the data in a faster way.
It helps to avoid full table scans.
Advantages:
Range queries dont have to utilize a Full Table Scan.
Deletions of entire partitions are lightning fast.
PPI provides an excellent solution instead of using Secondary Indexes
Tables that hold yearly information dont have to be split into 12 smaller tables
to avoid Full Table Scans (FTS). This can make modeling and querying easier.
Fastload and Multiload work with PPI tables, but not with all Secondary Indexes.
Disadvantages:
A two-byte Partition number is added to the ROW-ID and it is now called a ROW
KEY. The two-bytes per row will add more Perm Space to a table.
Joins to Non-Partitioned Tables can take longer and become more complicated
for Teradata to perform.
Basic select queries utilizing the Primary Index can take longer if the Partition
number is not also mentioned in the WHERE clause of the query.
You cant have a Unique Primary Index (UPI) if the Partition Number is not at
least part of the Primary Index. You must therefore create a Unique Secondary
Index to maintain uniqueness.
Sub queries and Correlated Sub queries are two important concepts in Teradata and
used most of the times.
The basic concept behind a subquery is that it retrieves a list of values that are used
for comparison against one or more columns in the main query. Here the subquery is
executed first and based on the result set, the main query will be executed.
For example,
Select empname,deptname from employee where empid IN ( select empid from
salarytable where salary>10000).
In the above query, empid will be choosen first based on the salary in the subquery
and main query will be executed based on the result subset.
It first reads a row from the main query and then goes into the subquery to find the
rows that match the specified column value. Then it goes for the next row from the
main query. This process continues until all the qualifying rows from MAIN query.
For example,
select empname,deptno, salary
from employeetable as emp
where
salary=(select max(salary) from employeetable as emt where emt.deptno=emp.deptno)
Above query returns the highest paid employee from each department. This is also
one of the scenario based questions in teradata.
The operation for a correlated subquery differs from that of a normal subquery.
Instead of comparing the selected subquery values against all the rows in the main
query, the correlated subquery works backward. It first reads a row in the main
query, and then goes into the subquery to find all the rows that match the specified
column value. Then, it gets the next row in the main query and retrieves all the
subquery rows that match the next value in this row. This processing continues until
all the qualifying rows from the main SELECT are satisfied.
Thus, when a second row comes into the comparison that contains the same value as
an earlier row, there is no need to re-read the matching rows again. That operation
has already been done once and the AMPs remember the answer from the first
comparison.
First of all use EXPLAIN plan to see how the query is performing. Keywords like
Product joins, low confidence are measures of poor performance.
Make Sure, STATS are collected on the columns used in WHERE Clause and JOIN
columns. If STATS are collected, explain plan will show HIGH CONFIDENCE This
tells the optimizer about the number of rows in that table which will help the
optimizer to choose the redistribution/duplication of smaller tables.
Check the joining columns & WHERE Clause whether PI, SI or PPI are used.
Check whether proper alias names are used in the joining conditions.
.
COLLECT STATISTICS ON tablename INDEX (columnname) will collect statistics
on an index.
COLLECT STATISTICS ON tablename INDEX (col1, col2, ...) will collect statistics
on multiple columns of an index.
It is a false lock which is applied on the table to prevent two users from getting
conflicting locks with all-AMP requests.
PE will determine an particular AMP to manage all AMP LOCK requests for given table
and Put Pseudo lock on the table.
With more duplicates GROUP BY is more efficient while if we have fewer duplicates
the DISTINCT is efficient.
8. What is spool space and when running a job if it reached the maximum spool
space how you solve the problem in Teradata?
Spool space is the space which is required by the query for processing or to hold the
rows in the answer set. Spool space reaches maximum when the query is not properly
optimized. We must use appropriate condition in WHERE clause and JOIN on correct
columns to optimize the query. Also make sure unnecessary volatile tables are
dropped as it occupies spool space.
The two bytes are for the number of bytes for the binary length of the field.
It stores the exact no of characters stored in varchar.
10. How to make sure BTEQ utility is not erroring out while dropping a table
when the table doesnt exist in Teradata?
MultiLoad delete is faster than normal Delete command, since the deletion happens in
data blocks of 64Kbytes, whereas delete command deletes data row by row.
Transient journal maintains entries only for Delete command since Teradata utilities
doesnt support Transient journal loading
For smaller table deletes, simple DELETE command is enough. Multiload delete is
useful when the delete has to be performed on a large table in teradata.
Say, the fastload job fails. Till the fastload failed, some number of rows was sent to
the AMP's.
Now if you restart FLOAD, it would start loading record from the last checkpoint
and some of the consecutive rows are sent for the second time. These will be caught
as duplicate rows are found after sorting of data.
This restart logic is the reason that Fastload will not load duplicate rows into a
MULTISET table. It assumes they are duplicates because of this logic. Fastload
support Multiset table but does not support the duplicate rows. Multiset tables are
tables that allow duplicate rows. When Fastload finds the duplicate rows it discards
it. Fast Load can load data into multiset table but will not load the duplicate rows
You can use the IDENTITY function. Create your table, and when you define the
columns, define an integer like:
My_ID_col INTEGER GENERATED ALWAYS AS IDENTITY (START WITH 1
INCREMENT BY 1 MINVALUE -2147483647 MAXVALUE 100000000 NO CYCLE);
13. A certain load is being imposed on the table and that too,
every hour. The traffic in the morning is relatively low, and that of the
night is very high. As per this situation, which is the most advisable utility
and how is that utility supposed to be loaded?
The most suggestible utility here has to be Tpump. By making use of packet size
decreasing or increasing, the traffic can be easily handled.
14. Highlight a few of the advantages that ETL tools have over TD.