What Is Partitioned Primary Index (PPI) in Teradata?: Advantages

Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 5
At a glance
Powered by AI
Some key takeaways from the document are that Partitioned Primary Index (PPI) in Teradata physically splits tables into subtables based on partitioning values like months. This allows for faster access of data within partitions and avoids full table scans. Subqueries retrieve a list of values used for comparison in the main query, while correlated subqueries determine rows to select based on values from another table by combining subquery and join processing.

Advantages of PPI include avoiding full table scans for range queries, fast deletions of partitions, and an alternative to secondary indexes. Disadvantages include added space for the partition number and potential slower joins/queries if the partition is not mentioned. Unique indexes are also not possible without the partition as part of the primary index.

A subquery retrieves a list of values independently from the main query, while a correlated subquery works with each row from the main query to find matching rows and continues this process for each main query row. Correlated subqueries can be an efficient way to find rows like the highest paid employee per department.

1. What is Partitioned Primary Index (PPI) in Teradata?

Partitioned primary index is physically splitting the table into a series of subtables,
one for every partitioning value. When a single row is accessed, it looks first at the
partitioning value to determine the subtable, then at the primary index to calculate
the rowhash for the row(s).

For example, we have PPI on a MONTH Column, the rows of particular months are all
sorted with in the same partition and whenever data is accessed for particular month,
it will retrive the data in a faster way.
It helps to avoid full table scans.

2. What are the advantages and disadvantages of PPI in Teradata?

Advantages:
Range queries dont have to utilize a Full Table Scan.
Deletions of entire partitions are lightning fast.
PPI provides an excellent solution instead of using Secondary Indexes
Tables that hold yearly information dont have to be split into 12 smaller tables
to avoid Full Table Scans (FTS). This can make modeling and querying easier.
Fastload and Multiload work with PPI tables, but not with all Secondary Indexes.

Disadvantages:
A two-byte Partition number is added to the ROW-ID and it is now called a ROW
KEY. The two-bytes per row will add more Perm Space to a table.
Joins to Non-Partitioned Tables can take longer and become more complicated
for Teradata to perform.
Basic select queries utilizing the Primary Index can take longer if the Partition
number is not also mentioned in the WHERE clause of the query.
You cant have a Unique Primary Index (UPI) if the Partition Number is not at
least part of the Primary Index. You must therefore create a Unique Secondary
Index to maintain uniqueness.

3. SubQuery and Correlated Subquery in teradata?

Sub queries and Correlated Sub queries are two important concepts in Teradata and
used most of the times.
The basic concept behind a subquery is that it retrieves a list of values that are used
for comparison against one or more columns in the main query. Here the subquery is
executed first and based on the result set, the main query will be executed.

For example,
Select empname,deptname from employee where empid IN ( select empid from
salarytable where salary>10000).
In the above query, empid will be choosen first based on the salary in the subquery
and main query will be executed based on the result subset.

Correlated Subquery is an excellent technique to use when there is a need to


determine which rows to SELECT based on one or more values from another table. It
combines subquery processing and Join processing into a single request.

It first reads a row from the main query and then goes into the subquery to find the
rows that match the specified column value. Then it goes for the next row from the
main query. This process continues until all the qualifying rows from MAIN query.

For example,
select empname,deptno, salary
from employeetable as emp
where
salary=(select max(salary) from employeetable as emt where emt.deptno=emp.deptno)

Above query returns the highest paid employee from each department. This is also
one of the scenario based questions in teradata.
The operation for a correlated subquery differs from that of a normal subquery.
Instead of comparing the selected subquery values against all the rows in the main
query, the correlated subquery works backward. It first reads a row in the main
query, and then goes into the subquery to find all the rows that match the specified
column value. Then, it gets the next row in the main query and retrieves all the
subquery rows that match the next value in this row. This processing continues until
all the qualifying rows from the main SELECT are satisfied.

Although this sounds terribly inefficient and is inefficient on other databases,


it is extremely efficient in Teradata. This is due to the way the AMPs handle this
type of request. The AMPs are smart enough to remember and share each value that
is located.

Thus, when a second row comes into the comparison that contains the same value as
an earlier row, there is no need to re-read the matching rows again. That operation
has already been done once and the AMPs remember the answer from the first
comparison.

4. What are the Performance improvement techniques available in Teradata?

First of all use EXPLAIN plan to see how the query is performing. Keywords like
Product joins, low confidence are measures of poor performance.

Make Sure, STATS are collected on the columns used in WHERE Clause and JOIN
columns. If STATS are collected, explain plan will show HIGH CONFIDENCE This
tells the optimizer about the number of rows in that table which will help the
optimizer to choose the redistribution/duplication of smaller tables.

Check the joining columns & WHERE Clause whether PI, SI or PPI are used.

Check whether proper alias names are used in the joining conditions.

Split the queries into smaller subsets in case of poor performance.

Collect Statistics Syntax in Teradata


The following are the Collect Statistics Syntaxes in Teradata.

COLLECT STATISTICS ON tablename COLUMN columnname; will collect statistics


on a column.

.
COLLECT STATISTICS ON tablename INDEX (columnname) will collect statistics
on an index.

COLLECT STATISTICS ON tablename INDEX (col1, col2, ...) will collect statistics
on multiple columns of an index.

HELP STATISTICS tablename; will display the number of distinct values of


the columns.

COLLECT STATISTICS table name; refreshes (recollects) the table statistics.

DROP STATISTICS ON tablename ...;" will drop the statistics.

5. What does Pseudo Table Locks mean in EXPLAIN Plan in Teradata?

It is a false lock which is applied on the table to prevent two users from getting
conflicting locks with all-AMP requests.
PE will determine an particular AMP to manage all AMP LOCK requests for given table
and Put Pseudo lock on the table.

6. Which is more efficient GROUP BY or DISTINCT to find duplicates in


Teradata?

With more duplicates GROUP BY is more efficient while if we have fewer duplicates
the DISTINCT is efficient.

7. What is the difference between TIMESTAMP (0) and TIMESTAMP (6) in


teradata?
Both have the Date and Time Values. The major difference is that TIMESTAMP (6)
has microsecond too.

8. What is spool space and when running a job if it reached the maximum spool
space how you solve the problem in Teradata?

Spool space is the space which is required by the query for processing or to hold the
rows in the answer set. Spool space reaches maximum when the query is not properly
optimized. We must use appropriate condition in WHERE clause and JOIN on correct
columns to optimize the query. Also make sure unnecessary volatile tables are
dropped as it occupies spool space.

9. Why does varchar occupy 2 extra bytes?

The two bytes are for the number of bytes for the binary length of the field.
It stores the exact no of characters stored in varchar.

10. How to make sure BTEQ utility is not erroring out while dropping a table
when the table doesnt exist in Teradata?

Setting the error level to 0 will achieve this.


ERRORLEVEL (3807) SEVERITY 0;
DROP TABLE TABLENAME;
ERRORLEVEL (3807) SEVERITY 8;

10. Which is faster MultiLoad delete or Delete command in


Teradata?

MultiLoad delete is faster than normal Delete command, since the deletion happens in
data blocks of 64Kbytes, whereas delete command deletes data row by row.
Transient journal maintains entries only for Delete command since Teradata utilities
doesnt support Transient journal loading

For smaller table deletes, simple DELETE command is enough. Multiload delete is
useful when the delete has to be performed on a large table in teradata.

11. Why Fload doesnt support multiset table in Teradata?

Fload does not support Multiset table because of restart capability.

Say, the fastload job fails. Till the fastload failed, some number of rows was sent to
the AMP's.
Now if you restart FLOAD, it would start loading record from the last checkpoint
and some of the consecutive rows are sent for the second time. These will be caught
as duplicate rows are found after sorting of data.

This restart logic is the reason that Fastload will not load duplicate rows into a
MULTISET table. It assumes they are duplicates because of this logic. Fastload
support Multiset table but does not support the duplicate rows. Multiset tables are
tables that allow duplicate rows. When Fastload finds the duplicate rows it discards
it. Fast Load can load data into multiset table but will not load the duplicate rows

12. How to generate a sequence column in Teradata?

You can use the IDENTITY function. Create your table, and when you define the
columns, define an integer like:
My_ID_col INTEGER GENERATED ALWAYS AS IDENTITY (START WITH 1
INCREMENT BY 1 MINVALUE -2147483647 MAXVALUE 100000000 NO CYCLE);

13. A certain load is being imposed on the table and that too,
every hour. The traffic in the morning is relatively low, and that of the
night is very high. As per this situation, which is the most advisable utility
and how is that utility supposed to be loaded?

The most suggestible utility here has to be Tpump. By making use of packet size
decreasing or increasing, the traffic can be easily handled.

14. Highlight a few of the advantages that ETL tools have over TD.

Some of the advantages that ETL tools have over TD are: -

Multiple heterogeneous destinations, as well as sources can be operated.


Debugging process is much easier with the help of ETL tools owing to full-
fledged GUI support.
Components of ETL tools can be easily reused, and as a result, if there is an
update to the main server, then all the corresponding applications connected to
the server are updated automatically.
De-pivoting and pivoting can be easily done using ETL tools.

15. How many sessions of MAX is PE capable of handling at a


particular time?

PE can handle a total of 120 sessions at a particular point of time.

You might also like