The Teradata Database - Part 3 Usage Fundamentals PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

The Teradata Database

Part 3: Usage Fundamentals

1
Welcome to the third in the series of courses designed to provide you with an introduction to the
Teradata Database.
The first course, Database Basics, served as a refresher, or provided a foundation, to those who are new
to the world of databases. The second course, Teradata Database Fundamentals, builds on that
foundation and transitions to exploring the functionality. This course will take your knowledge one step
further.
By the end of this course you should be able to:
• Identify some common methods of handling analytics
• Explain how Teradata Everywhere works
• Identify some Teradata concepts
• Recognize the functions and types of compression
• Identify tools and utilities used for distribution and access of data
• Explain the uses of secondary and join indices

2
Module 1 - Teradata and The SQL Engine
The following will be covered during this module:

• Common methods and examples of handling analytics


• Teradata EverywhereTM - what it is and how it works
• Query Types
• Row vs. Set Processing
• Response Time vs Throughput

From Teradata Database to SQL Engine

Teradata provides access to the most popular analytic tools from Teradata's SQL-based tools used by
power users. The foundation of the Teradata SQL Engine is shown here.

Types of Analysis

3
More about types of analysis

Data Processing is on the Move


In the previous course we discussed how the use of analytics is on the rise not only in the corporate
business world, but everywhere such as professional sports and social media. The following are several
common methods of handling analytics.

Type Examples Number of Rows Response Time


Accessed
OLTP On-Line Transaction Processing (OLTP) Small Seconds
OLTP’s primary system features are
providing immediate client feedback and
supporting or facilitating high
transaction-oriented applications.
DSS Decision Support System (DSS) Large Seconds or minutes
DSS provide management at all levels of
the Organization, with information
which supports understanding of the
current Business position.
OLAP On-line Analytical Processing (OLAP) Large number of detail Seconds or minutes
With OLAP, the user may be looking for rows or moderate number
historical trends, sales rankings or of summary rows
seasonal inventory fluctuations for the
entire corporation. Usually, this involves
a lot of detailed data to be retrieved,
processed and analyzed. Therefore,
response time can be in seconds or
minutes.
Data Data Mining (DM) Moderate to large Phase 1: Minutes or
Mining DM (predictive modeling) involves number of detailed hours
analyzing moderate to large amounts of historical rows Phase 2: Seconds or
detailed historical data to detect less
behavioral patterns (e.g., buying,
attrition, or fraud patterns) that are
used to predict future behavior.

4
Type Examples
Update a checking account to reflect a deposit.
OLTP
Debit transaction takes place against current balance to reflect
amount of money withdrawn at ATM.

How many child size blue jeans were sold across all of
DSS
our Eastern stores in the month of March?
What were the monthly sale shoes for retailer X?

OLAP Show the top ten selling items across all stores for 2018.
Show a comparison of sales from this week to last week.

Data
Which customers are most likely to leave?
Mining
Which customers are most likely to respond to this promotion?

Both cursor and set processing define sets of rows of the data to process; but, while a cursor processes
the rows sequentially, set processing takes its sets all at once. Both types of processing can be
performed with a single command.

The Advantage of Using Detailed Data

Until recently, most business decisions were based on summary data. The problem is that summarized
data cannot answer some questions with accuracy. Here's a real-life example that most of us can relate
to.

Your monthly bank statement records your checking account activity. What if your bank statement only
showed you the total amount of deposits and withdrawals and you wanted to know if a certain check
had cleared? To answer that question, you would need a list of every check received by your bank.

This checking account information is detail data.

In order to perform analytics and answer business questions, decision-makers must have four things:
• The right data
• Enough detail data
• Proper data structure
• Enough computer power to access and produce reports on the data

Detail data provides a more accurate and granular information which results in the ability to make the
best informed business decisions.

Next let’s take a look at how Teradata is helping customers meet their data and analytic needs.

5
Teradata Everywhere™

Teradata EverywhereTM is a flexible, agile, and


scalable way to ensure a high return on analytic
investments.

Enables analytic users throughout the organization to use


their preferred analytic tools and engines across data
sources, at scale.

Deploy Anywhere
Provides analytic processing across flexible deployment
options, including the Teradata Cloud and public clouds, as
well as on-premises via Teradata hardware or commodity
hardware.

Empowers companies to purchase software in more


accommodating ways based on specific use cases through
simplified pricing bundles, subscription-based licenses, and
as-a-service option.

Future-proofs buying decisions by taking advantage of our


software license portability that provides flexibility to run
analytics across deployment options.

Query Types

Strategic queries face different types of scalable performance challenges compared to direct access
queries, such as OLTP or highly tuned tactical transactions.

Direct access queries process very few rows, and the work is done in a localized area, usually involving
one or a small number of parallel units.

Analytic queries read considerably more data, often scanning all of the rows of one or many tables and
perform significantly more complex operations on the data.

6
Row vs. Set Processing

Row-by-row processing can make for a slow query because one row is retrieved at a time, all
calculations are performed on it, then it is updated or inserted. Then the next row is retrieved and
processed, and so on.

For example, let's say that you wanted to sum all payment rows with balances of $100 or less. Using
row-by-row processing rows are selected one by one looking for those that meet the criteria.

Set processing, on the other hand, processes relational data set-by-set, without a cursor. A single SQL
statement completely processes all rows that meet the condition as a set.

For example, let's say that you wanted to sum all payment rows with balances of $100 or less. With set
processing, a single SQL statement completely processes all rows that meet the condition as a set.

With sufficient rows to process, this can be 10 to 30 or more times faster than the row-at-a-time
processing.

Response Time vs. Throughput

When we think of how fast a transaction is and determining the speed of that transaction, we can also
measure how long it takes to complete the task or you can measure how much gets done per unit time.
These terms are called Response Time and Throughput.

Response Time Throughput


Measures the average duration of Measures quantity of queries completed during a time
queries interval
A measure of process completion A measure of the amount of work processed
Is how long the processing takes How many queries were processed
The elapsed time per query The number of queries executed in an hour

Response Time
The phrase “response time” is often used in operating system contexts and referred to as access
time, transmission time, or execution time, depending on the context.

Throughput
For operating systems, throughput is often measured as tasks or transactions per unit time. For storage
systems or networks, throughput is measured as bytes or bits per unit time. For processors, the number
of instructions executed per unit time is an important component of performance.

7
Request for Data

Hopefully it is apparent how important accurate and quick return of data is. Let’s take a look at exactly
how a request for data is processed. This graphic simply demonstrates the path the request takes
through the process.

Let's say that you have received a request from your boss who wants to
know how many widgets had more than 15% profit margin in the last
month, but he only wants to know this for the Eastern region (USA).
Below is how the request would process:

1. A SQL request is sent from the client to Teradata.


2. The Request is passed to the Parsing Engine(s). The Parsing Engine(s) parse the request into AMP
steps.
3. The Parsing Engine(s) Dispatcher sends steps to the AMPs over the BYNET.
4. AMPs perform operations on the data.
5. Response is sent back to the Parsing Engines over the BYNET (MPP-Multi-node system).
6. Parsing Engine Dispatcher receives the response.
7. A response to the request is returned to the client.

Summary

So far you have learned more about how the Teradata Database is used for analytics. Additionally, you
learned about Teradata EverywhereTM and how it works.
In the next module you will review some additional Teradata concepts.

8
Module 2 - Additional Teradata Concepts
The following will be covered during this module:

• Teradata Virtual Storage


• Storage Performance vs. Data Temperature
• Multi-temperature Data
• Teradata Intelligent Memory
• Teradata Compression
• Linear Scalability

Teradata Virtual Storage

Now that we have done a short review of the physical components,


lets take a look at how some of those are used in other facets of the
database.

Teradata Virtual Storage maintains statistics on frequency of data


access (e.g., data temperature) and on the performance characteristics
of the disk. Data is referred to on a continuum of hot to cold.

Teradata Virtual Storage solves the need for:


near real-time analytics on current, frequently accessed hot data.
strategic analytics on large amounts of history, compliance, and
archive data that are less frequently accessed cold data.

Let's take a quick look at how that works.

Traditional Approach

Each AMP owns the same number of specific disk drives and places its data on those drives – all drives
are required to be identical in a clique. Data is placed on each drive without consideration of data
characteristics or usage.

9
Teradata Virtual Storage

With Teradata Virtual Storage, the storage is owned by Teradata Virtual Storage and is allocated to
AMPs in small pieces from a shared pool of disks in the clique. Cliques can now have multiple drive sizes
and types. Data is automatically and transparently migrated within storage based on data temperature.

Storage Performance vs. Data

As data is accessed less frequently it is referred to as data aging or becoming "cooler". But just
because data is older doesn't imply that it will only continue to get cooler. In fact, cooler data
can become warm/hot again as access increases. As portions of the data change from hot to
cold or vice versa, Teradata Virtual Storage automatically moves the data to the most
appropriate storage type. This can improve system performance over time.

For example, sales data from this quarter may remain hot until several months in the future as current
quarter/previous quarter comparisons are run in different areas of the business. After six months, that
sales data may cool off until nearly a year later when it becomes increasingly queried (i.e., becomes
hotter) by comparison against the new current quarter's sales data.

Multi-Temperature Data

To summarize, Teradata Virtual Storage (TVS) enhances performance with multi-temperature data
placement by placing the hot data on fast storage and cold (frozen) data on the slow storage. The key
concepts of multi-temperature data are performance of storage (differences between drives on
different controllers, hard disk drives - HDD vs. solid state drives SSD) and data access patterns
determined by the frequency of access and updates.

10
This diagram illustrates that TVS can, in a hybrid disk drive system, place the hot data (red) onto
faster solid-state drives (SSD) and the cooler less used data (green) onto slower hard disk drives
(HDD).

Teradata Intelligent Memory

Whenever data on the very hot data list is read from disk during query processing, a copy is kept in
Teradata Intelligent Memory's (TIM) extended memory area for future use.

When data in Intelligent Memory is needed by another query, Teradata Database automatically looks to
Intelligent Memory, eliminating the need for solid state disk (SDD) or hard disk drive (HDD) I/O, which is
up to 3,000 times slower than memory access.

If hotter data is used, it replaces the coolest data in Intelligent Memory so that the very hottest data is
always available in memory for fast query performance with the lowest possible amount of disk I/O. But,
since data is often used actively for an extended period, data is frequently kept in Intelligent Memory for
hours, days, or even weeks at a time and used many thousands of times without further physical disk
I/O.

11
Teradata Compression Comparison

Why would I want to use compression?

Compression is useful to free up additional disk space, but it may or may not enhance performance
depending on the type of compression used, the workload, the frequency of data access, and the system
capabilities. Compression generally affects performance as follows:

• The cycle of decompressing and recompressing required to access and store data may
significantly increase CPU usage compared to uncompressed data.
• Storing data in dense, compressed form may reduce I/O requirements.

Now let's take a look at some types of compression.

If the table is stored in columnar format - an organization can choose any combination or all three on a
column/table.

12
Linear Scalability

Teradata is linearly expandable. Linear scalability enables systems to grow in order to support more
users, data, queries, and query complexity without experiencing performance degradation. As the
configurations grow, performance increase happens linearly. It is smooth and predictable.

For example, a customer


wants to double their
performance.

If the amount of data and


the number of queries
stay the same, then they
could just double the
number of AMPs in their
system and thus the
performance would
potentially double… as in a
Linear growth!

Summary

So far you have learned about the Teradata and the SQL Engine and reviewed some additional Teradata
concepts.

In the next module you will learn about data distribution and access.

13
Module 3 - Data Distribution and Access
The following will be covered during this module:

• Accessing data
• Secondary and Join Indexes
• How the Teradata Database distributes rows
• Row distribution vs. hashing
• UPI NUPI row distribution
• Row partitioning
• Normalization and Referential integrity

Comparison of Primary and Secondary Indexes

Before we move into accessing data, let’s first do a review of primary and secondary indexes.

P *Primary indexes are required in a table, with one exception (NoPI tables discussed later) - Primary
indexes are used to distribute rows to an AMP - Secondary indexes are optional.

The table below shows a comparison of the two.

14
Comparison of Primary and Secondary Indexes

Accessing Data
Data is accessed in a number of different ways. The Optimizer chooses the fastest access method based
on a case-by-case basis to determine which method is more efficient.

Unique Primary Index (UPI)


The UPI is the most efficient way to access data. Accessing data through a UPI is a one-AMP operation
that leads directly to the single row with the desired UPI value.

Unique Secondary Index (USI)


A USI is a very efficient way to access data. Data access through a USI is always a two-AMP operation,
which leads directly to the single row with the desired USI value.

USI retrieval will always retrieve one row at most whereas the number of rows a NUPI will retrieve
depends upon the selectivity of the PI value. The performance of a NUPI will depend upon how non
unique the PI is.

Non-Unique Primary Index (NUPI)


Accessing data through a NUPI is a one-AMP operation that may lead to multiple rows with the desired
NUPI value. The system creates a spool file during a NUPI access, if needed. NUPI access is efficient if the
number of physical block reads is small.

No Primary Index (NoPI)


A NoPI table is simply a table without a primary index. As rows are inserted into a NoPI table, they are
appended at the end of the table and never inserted in the middle of a hash sequence. Retrievals are
likely to be full table scans (FTS) unless other index or table structures are in place e.g. columnar.

Non-Unique Secondary Index (NUSI)


The non-unique secondary index (NUSI) is efficient only if the number of rows accessed is a small
percentage of the total data rows in the table. NUSI access is an all-AMPs operation since the NUSI
subtables must be scanned on each AMP. It is a multiple rows operation since there can be many rows
per NUSI value. A spool file will be created, if needed.

Full-Table Scan (FTS)


The Full-Table Scan is efficient in that each row is scanned only once. Although index access is generally
preferred to a FTS, there are cases where they are the best way to access the data. Like the situation
with NUPIs and USIs, Full Table Scans can sometimes be more efficient than a NUSI. Since V2R5 a
Cylinder read which reads all the data blocks in that cylinder just needs 1 I/O. A FTS will only read each
data block once whereas a NUSI would read the index subtable then access the data blocks; this is
efficient only when the NUSI is nearly unique. Statistics will give the optimizer the demographic
information to make the right decision to select a FTS.

Secondary Indexes

There are three general ways to access a table; two of which have been covered already:
• Primary Index access (one AMP access)
• Full Table Scan (all AMP access).

15
The third option is via a Secondary Index access (two or all AMP access). As we discussed earlier, there
are two types of secondary indexes USI (unique) and NUSI (non unique).

Provide an Alternate path to the data


Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire
table providing an alternate path to the rows of a table. One table can have from 0 to 32 secondary
indexes.

Improve Table Performance


As we previously mentioned, secondary indexes require at least two and possibly all AMPs, depending
on the index and the type of operation. A benefit is that searches will typically be less expensive than a
full table scan. However, secondary indexes add overhead to the table, both in terms of disk space and
maintenance (due to sub-tables) but may be dropped when not needed and recreated whenever they
would be helpful.

Do not affect table distribution


A Secondary Index is like a Primary Index in that it allows the user to locate rows, but different in that it
has no influence on the way rows are distributed among AMPs. A unique secondary index can be used to
maintain uniqueness within a column or set of columns.

Join Indexes

A join index is a system-maintained index table that stores and maintains the joined rows of one or
more tables (multiple table join index) and, optionally, aggregates selected columns, referred to as an
aggregate join index.

There are multiple ways in which a join index can be used. Three common ways include:
• Single table Join Index — Distribute the rows and columns of a single table on the hash value of a
foreign key value
• Multi-Table Join Index — Pre-join multiple tables; stores and maintains the result from joining two
or more tables.
• Aggregate Join Index — Aggregate one or more columns of a single table or multiple tables into a
summary table

NoPI Tables

Data is effectively appended to an AMP in a NoPI table. In addition, a NoPI table:


• Has no duplicate row checking. Duplicate rows can exist in a NoPI table after a TPT LOAD operation.
• A table without a primary index. Data is distributed to the AMPs by methods other than the primary
index (i.e., Random Generator).
• Is useful in staging and in data labs when the data is not known and there is no obvious primary
index.
• Is useful as a staging table and can be loaded quickly using Teradata Parallel Transporter (TPT) LOAD.

16
How the Teradata Database Distributes Rows
The Teradata Database uses a Shared-Nothing Architecture - each AMP is responsible for its own portion
of the database and does not share it with any other AMP. The Teradata Database uses hashing to
randomly distribute data across all AMPs. The Primary Index choice determines whether the rows of a
table will be evenly or unevenly distributed across the AMPs. Even distribution is desirable because in
an operation involving all rows of the table (such as a full-table scan), each AMP will have an equal
portion of work to do resulting in evenly distributed workloads.

The Teradata Database's automatic hash distribution eliminates costly data maintenance tasks. There is
no specific order to the placement of the data. The benefits of having unordered data are that they
don't need any maintenance to preserve order, and they are independent of any query being submitted.
There is never a need to reorganize data or index space which makes the database easier to manage and
maintain. As a result, strategic business data is more accessible to the users.

Primary Key (PK) vs. Primary Index (PI)


Some tables may use the same columns for both the PK and the PI. A well-designed database will often
use a PI that is different from the PK. Here is a comparison between the two:

Primary Key:
• Logical concept of data modeling
• Teradata doesn't need to recognize
• No limit on number of columns
• Documented in data model (Optional in CREATE TABLE)
• Must be unique
• Identifies each row
• Values should not change
• May not be NULL – requires a value
• Does not imply an access path
• Chosen for logical correctness

Primary Index:
• Physical mechanism for access and storage
• Each table can have (at most) one primary index
• 64 column limit
• Defined in CREATE TABLE statement
• May be unique or non-unique
• Identifies a maximum of 1 (UPI) or multiple rows (NUPI)
• Values may be changed (Delete + Insert)
• May be NULL
• Defines most efficient access path
• Chosen for physical performance

17
Distribution of Rows

Ideally, the rows of every large table will be distributed among all of the AMPs. There may be some
circumstances where this is not true. A column-partitioned table or join index may have a primary AMP
index as an alternative to having no primary index. A primary AMP index (PA) allows for single-AMP
access when the values of the primary AMP index column(s) are specified or to provide better
distribution of the rows.

Choosing a Primary Index for a table is perhaps the most critical decision a database designer
makes. The choice will affect the distribution of the rows of the table and, consequently, the
performance of the table in a production environment. Unique Primary Indexes (UPI’s) are often
desirable because they guarantee the uniform distribution of the rows of that table.

Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a
column (or columns) which have non-unique values; that is, there are duplicate values. This type of
index is called a Non-Unique Primary Index or NUPI. While not a guarantor of uniform row
distribution, the degree of uniqueness of the index will determine the degree of uniformity of the
distribution.

Row Distribution via Hashing

The value of the Primary Index for a specific row determines the AMP
assignment for that row.

This is done using a hashing algorithm. A hashing algorithm is a function


that converts a data string into a numeric string output of fixed length.

In the Teradata Database the hashing algorithm takes any length input
string and produces a fixed length (32 bit value) output called the row
hash. In Teradata, a hash bucket is simply an entry in a hash map.

Each hash map entry points to a single AMP.

Changing the number of AMPs does not require any adjustment to the
hashing algorithm. The hash map is adjusted and any affected rows
redistributed.

18
UPI NUPI Row Distribution

UPI Row Distribution


The same value stored in the same data type will always produce the same hash value. If the Primary
Index is unique, Teradata can distribute the rows evenly.

NUPI Row Distribution


If, however, the order number was used as a non-unique Primary Index (NUPI), then the row distribution
among AMPs will possibly be uneven. All rows with the same primary index value (in other words, with
the same order number) will be stored on the same AMP. If the Primary Index is slightly non-unique,
that is, there are only four or five rows per index value; the table will still distribute evenly. But what if
there are hundreds or thousands of rows for some index values? The rows with the same primary index
value would all be sent to the same AMP resulting in skewed data.

Row Partitioning

Indexing can improve performance for different types of queries and workloads. For example, secondary
indexes, join indexes, or hash indexes may be utilized to improve performance for known queries.
Primary indexes can be partitioned or non-partitioned.

The Teradata Database also provides the ability to row partition a table. A row partitioned table permits
rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for
range queries that are predicated on primary index values and partitioned columns. It also Increases
query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary
indexes.

In a partitioned table, each row is uniquely identified by the following:

•Row ID = Partition # + Row Hash + Uniqueness Value


•Row Key = Partition # + Row Hash (e.g., Row Key will appear in Explain plans)–In a partitioned table,
data rows will have the Partition # included as part of the data row.

For tables that are row partitioned, Teradata utilizes a 3-level scheme to distribute and later locate
the data.

The 3 levels are:


1. Rows are distributed across all AMPs (and accessed via the Primary Index) based upon HBN
(Hash Bucket Number) which is the first part of the Row Hash.
2. At the AMP level, rows are first ordered by their partition number.
3. Within the partition, data rows are logically stored in Row ID sequence.

Some common business queries generally require a full-table scan of a large table, even though
it's predictable that a fairly small percentage of the rows will qualify.

An Example

A trend analysis application that compares current month sales to the previous month, or to the same
month of the previous year, using a table with several years of sales detail.

19
Multi-Level Row Partitioning (MLPPI)
Allows each partition at a given level to be further partitioned into sub-partitions. Each partition for a
level is sub-partitioned using a partitioning expression defined for the next lower level.

Teradata Columnar
Teradata Columnar offers the ability to partition a table by column. It can be used alone or in
combination with row partitioning in multilevel partitioning definitions and can automatically compress
physical rows where appropriate. The major advantage of using column partitioning is to improve the
performance of queries that access a subset of the columns from a table.

Normalization

Normalization is a set of rules and a methodology for making sure that the attributes in a design are
carried in the correct entity to map accurately to reality, eliminate data redundancy and minimize
update abnormalities.

Normalization is a technique for placing non-key attributes in tables in order to:


• Minimize redundancy
• Provide optimum flexibility
• Eliminate update anomalies

Referential Integrity

Referential integrity refers to the accuracy and consistency of data within a relationship and is a reliable
mechanism to prevent accidental database corruption. We discussed in a previous course that data is
linked between two or more tables. Referential integrity requires that, whenever a foreign key value is
used it must reference a valid, existing primary key in the parent (or primary) table. Because of this, we
need to ensure that data on both sides of the relationship remain intact.

Summary

So far you have learned about the Teradata and the SQL Engine, reviewed some additional Teradata
concepts, and you just learned about data distribution and access.

20

You might also like