The Teradata Database - Part 3 Usage Fundamentals PDF
The Teradata Database - Part 3 Usage Fundamentals PDF
The Teradata Database - Part 3 Usage Fundamentals PDF
1
Welcome to the third in the series of courses designed to provide you with an introduction to the
Teradata Database.
The first course, Database Basics, served as a refresher, or provided a foundation, to those who are new
to the world of databases. The second course, Teradata Database Fundamentals, builds on that
foundation and transitions to exploring the functionality. This course will take your knowledge one step
further.
By the end of this course you should be able to:
• Identify some common methods of handling analytics
• Explain how Teradata Everywhere works
• Identify some Teradata concepts
• Recognize the functions and types of compression
• Identify tools and utilities used for distribution and access of data
• Explain the uses of secondary and join indices
2
Module 1 - Teradata and The SQL Engine
The following will be covered during this module:
Teradata provides access to the most popular analytic tools from Teradata's SQL-based tools used by
power users. The foundation of the Teradata SQL Engine is shown here.
Types of Analysis
3
More about types of analysis
4
Type Examples
Update a checking account to reflect a deposit.
OLTP
Debit transaction takes place against current balance to reflect
amount of money withdrawn at ATM.
How many child size blue jeans were sold across all of
DSS
our Eastern stores in the month of March?
What were the monthly sale shoes for retailer X?
OLAP Show the top ten selling items across all stores for 2018.
Show a comparison of sales from this week to last week.
Data
Which customers are most likely to leave?
Mining
Which customers are most likely to respond to this promotion?
Both cursor and set processing define sets of rows of the data to process; but, while a cursor processes
the rows sequentially, set processing takes its sets all at once. Both types of processing can be
performed with a single command.
Until recently, most business decisions were based on summary data. The problem is that summarized
data cannot answer some questions with accuracy. Here's a real-life example that most of us can relate
to.
Your monthly bank statement records your checking account activity. What if your bank statement only
showed you the total amount of deposits and withdrawals and you wanted to know if a certain check
had cleared? To answer that question, you would need a list of every check received by your bank.
In order to perform analytics and answer business questions, decision-makers must have four things:
• The right data
• Enough detail data
• Proper data structure
• Enough computer power to access and produce reports on the data
Detail data provides a more accurate and granular information which results in the ability to make the
best informed business decisions.
Next let’s take a look at how Teradata is helping customers meet their data and analytic needs.
5
Teradata Everywhere™
Deploy Anywhere
Provides analytic processing across flexible deployment
options, including the Teradata Cloud and public clouds, as
well as on-premises via Teradata hardware or commodity
hardware.
Query Types
Strategic queries face different types of scalable performance challenges compared to direct access
queries, such as OLTP or highly tuned tactical transactions.
Direct access queries process very few rows, and the work is done in a localized area, usually involving
one or a small number of parallel units.
Analytic queries read considerably more data, often scanning all of the rows of one or many tables and
perform significantly more complex operations on the data.
6
Row vs. Set Processing
Row-by-row processing can make for a slow query because one row is retrieved at a time, all
calculations are performed on it, then it is updated or inserted. Then the next row is retrieved and
processed, and so on.
For example, let's say that you wanted to sum all payment rows with balances of $100 or less. Using
row-by-row processing rows are selected one by one looking for those that meet the criteria.
Set processing, on the other hand, processes relational data set-by-set, without a cursor. A single SQL
statement completely processes all rows that meet the condition as a set.
For example, let's say that you wanted to sum all payment rows with balances of $100 or less. With set
processing, a single SQL statement completely processes all rows that meet the condition as a set.
With sufficient rows to process, this can be 10 to 30 or more times faster than the row-at-a-time
processing.
When we think of how fast a transaction is and determining the speed of that transaction, we can also
measure how long it takes to complete the task or you can measure how much gets done per unit time.
These terms are called Response Time and Throughput.
Response Time
The phrase “response time” is often used in operating system contexts and referred to as access
time, transmission time, or execution time, depending on the context.
Throughput
For operating systems, throughput is often measured as tasks or transactions per unit time. For storage
systems or networks, throughput is measured as bytes or bits per unit time. For processors, the number
of instructions executed per unit time is an important component of performance.
7
Request for Data
Hopefully it is apparent how important accurate and quick return of data is. Let’s take a look at exactly
how a request for data is processed. This graphic simply demonstrates the path the request takes
through the process.
Let's say that you have received a request from your boss who wants to
know how many widgets had more than 15% profit margin in the last
month, but he only wants to know this for the Eastern region (USA).
Below is how the request would process:
Summary
So far you have learned more about how the Teradata Database is used for analytics. Additionally, you
learned about Teradata EverywhereTM and how it works.
In the next module you will review some additional Teradata concepts.
8
Module 2 - Additional Teradata Concepts
The following will be covered during this module:
Traditional Approach
Each AMP owns the same number of specific disk drives and places its data on those drives – all drives
are required to be identical in a clique. Data is placed on each drive without consideration of data
characteristics or usage.
9
Teradata Virtual Storage
With Teradata Virtual Storage, the storage is owned by Teradata Virtual Storage and is allocated to
AMPs in small pieces from a shared pool of disks in the clique. Cliques can now have multiple drive sizes
and types. Data is automatically and transparently migrated within storage based on data temperature.
As data is accessed less frequently it is referred to as data aging or becoming "cooler". But just
because data is older doesn't imply that it will only continue to get cooler. In fact, cooler data
can become warm/hot again as access increases. As portions of the data change from hot to
cold or vice versa, Teradata Virtual Storage automatically moves the data to the most
appropriate storage type. This can improve system performance over time.
For example, sales data from this quarter may remain hot until several months in the future as current
quarter/previous quarter comparisons are run in different areas of the business. After six months, that
sales data may cool off until nearly a year later when it becomes increasingly queried (i.e., becomes
hotter) by comparison against the new current quarter's sales data.
Multi-Temperature Data
To summarize, Teradata Virtual Storage (TVS) enhances performance with multi-temperature data
placement by placing the hot data on fast storage and cold (frozen) data on the slow storage. The key
concepts of multi-temperature data are performance of storage (differences between drives on
different controllers, hard disk drives - HDD vs. solid state drives SSD) and data access patterns
determined by the frequency of access and updates.
10
This diagram illustrates that TVS can, in a hybrid disk drive system, place the hot data (red) onto
faster solid-state drives (SSD) and the cooler less used data (green) onto slower hard disk drives
(HDD).
Whenever data on the very hot data list is read from disk during query processing, a copy is kept in
Teradata Intelligent Memory's (TIM) extended memory area for future use.
When data in Intelligent Memory is needed by another query, Teradata Database automatically looks to
Intelligent Memory, eliminating the need for solid state disk (SDD) or hard disk drive (HDD) I/O, which is
up to 3,000 times slower than memory access.
If hotter data is used, it replaces the coolest data in Intelligent Memory so that the very hottest data is
always available in memory for fast query performance with the lowest possible amount of disk I/O. But,
since data is often used actively for an extended period, data is frequently kept in Intelligent Memory for
hours, days, or even weeks at a time and used many thousands of times without further physical disk
I/O.
11
Teradata Compression Comparison
Compression is useful to free up additional disk space, but it may or may not enhance performance
depending on the type of compression used, the workload, the frequency of data access, and the system
capabilities. Compression generally affects performance as follows:
• The cycle of decompressing and recompressing required to access and store data may
significantly increase CPU usage compared to uncompressed data.
• Storing data in dense, compressed form may reduce I/O requirements.
If the table is stored in columnar format - an organization can choose any combination or all three on a
column/table.
12
Linear Scalability
Teradata is linearly expandable. Linear scalability enables systems to grow in order to support more
users, data, queries, and query complexity without experiencing performance degradation. As the
configurations grow, performance increase happens linearly. It is smooth and predictable.
Summary
So far you have learned about the Teradata and the SQL Engine and reviewed some additional Teradata
concepts.
In the next module you will learn about data distribution and access.
13
Module 3 - Data Distribution and Access
The following will be covered during this module:
• Accessing data
• Secondary and Join Indexes
• How the Teradata Database distributes rows
• Row distribution vs. hashing
• UPI NUPI row distribution
• Row partitioning
• Normalization and Referential integrity
Before we move into accessing data, let’s first do a review of primary and secondary indexes.
P *Primary indexes are required in a table, with one exception (NoPI tables discussed later) - Primary
indexes are used to distribute rows to an AMP - Secondary indexes are optional.
14
Comparison of Primary and Secondary Indexes
Accessing Data
Data is accessed in a number of different ways. The Optimizer chooses the fastest access method based
on a case-by-case basis to determine which method is more efficient.
USI retrieval will always retrieve one row at most whereas the number of rows a NUPI will retrieve
depends upon the selectivity of the PI value. The performance of a NUPI will depend upon how non
unique the PI is.
Secondary Indexes
There are three general ways to access a table; two of which have been covered already:
• Primary Index access (one AMP access)
• Full Table Scan (all AMP access).
15
The third option is via a Secondary Index access (two or all AMP access). As we discussed earlier, there
are two types of secondary indexes USI (unique) and NUSI (non unique).
Join Indexes
A join index is a system-maintained index table that stores and maintains the joined rows of one or
more tables (multiple table join index) and, optionally, aggregates selected columns, referred to as an
aggregate join index.
There are multiple ways in which a join index can be used. Three common ways include:
• Single table Join Index — Distribute the rows and columns of a single table on the hash value of a
foreign key value
• Multi-Table Join Index — Pre-join multiple tables; stores and maintains the result from joining two
or more tables.
• Aggregate Join Index — Aggregate one or more columns of a single table or multiple tables into a
summary table
NoPI Tables
16
How the Teradata Database Distributes Rows
The Teradata Database uses a Shared-Nothing Architecture - each AMP is responsible for its own portion
of the database and does not share it with any other AMP. The Teradata Database uses hashing to
randomly distribute data across all AMPs. The Primary Index choice determines whether the rows of a
table will be evenly or unevenly distributed across the AMPs. Even distribution is desirable because in
an operation involving all rows of the table (such as a full-table scan), each AMP will have an equal
portion of work to do resulting in evenly distributed workloads.
The Teradata Database's automatic hash distribution eliminates costly data maintenance tasks. There is
no specific order to the placement of the data. The benefits of having unordered data are that they
don't need any maintenance to preserve order, and they are independent of any query being submitted.
There is never a need to reorganize data or index space which makes the database easier to manage and
maintain. As a result, strategic business data is more accessible to the users.
Primary Key:
• Logical concept of data modeling
• Teradata doesn't need to recognize
• No limit on number of columns
• Documented in data model (Optional in CREATE TABLE)
• Must be unique
• Identifies each row
• Values should not change
• May not be NULL – requires a value
• Does not imply an access path
• Chosen for logical correctness
Primary Index:
• Physical mechanism for access and storage
• Each table can have (at most) one primary index
• 64 column limit
• Defined in CREATE TABLE statement
• May be unique or non-unique
• Identifies a maximum of 1 (UPI) or multiple rows (NUPI)
• Values may be changed (Delete + Insert)
• May be NULL
• Defines most efficient access path
• Chosen for physical performance
17
Distribution of Rows
Ideally, the rows of every large table will be distributed among all of the AMPs. There may be some
circumstances where this is not true. A column-partitioned table or join index may have a primary AMP
index as an alternative to having no primary index. A primary AMP index (PA) allows for single-AMP
access when the values of the primary AMP index column(s) are specified or to provide better
distribution of the rows.
Choosing a Primary Index for a table is perhaps the most critical decision a database designer
makes. The choice will affect the distribution of the rows of the table and, consequently, the
performance of the table in a production environment. Unique Primary Indexes (UPI’s) are often
desirable because they guarantee the uniform distribution of the rows of that table.
Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a
column (or columns) which have non-unique values; that is, there are duplicate values. This type of
index is called a Non-Unique Primary Index or NUPI. While not a guarantor of uniform row
distribution, the degree of uniqueness of the index will determine the degree of uniformity of the
distribution.
The value of the Primary Index for a specific row determines the AMP
assignment for that row.
In the Teradata Database the hashing algorithm takes any length input
string and produces a fixed length (32 bit value) output called the row
hash. In Teradata, a hash bucket is simply an entry in a hash map.
Changing the number of AMPs does not require any adjustment to the
hashing algorithm. The hash map is adjusted and any affected rows
redistributed.
18
UPI NUPI Row Distribution
Row Partitioning
Indexing can improve performance for different types of queries and workloads. For example, secondary
indexes, join indexes, or hash indexes may be utilized to improve performance for known queries.
Primary indexes can be partitioned or non-partitioned.
The Teradata Database also provides the ability to row partition a table. A row partitioned table permits
rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for
range queries that are predicated on primary index values and partitioned columns. It also Increases
query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary
indexes.
For tables that are row partitioned, Teradata utilizes a 3-level scheme to distribute and later locate
the data.
Some common business queries generally require a full-table scan of a large table, even though
it's predictable that a fairly small percentage of the rows will qualify.
An Example
A trend analysis application that compares current month sales to the previous month, or to the same
month of the previous year, using a table with several years of sales detail.
19
Multi-Level Row Partitioning (MLPPI)
Allows each partition at a given level to be further partitioned into sub-partitions. Each partition for a
level is sub-partitioned using a partitioning expression defined for the next lower level.
Teradata Columnar
Teradata Columnar offers the ability to partition a table by column. It can be used alone or in
combination with row partitioning in multilevel partitioning definitions and can automatically compress
physical rows where appropriate. The major advantage of using column partitioning is to improve the
performance of queries that access a subset of the columns from a table.
Normalization
Normalization is a set of rules and a methodology for making sure that the attributes in a design are
carried in the correct entity to map accurately to reality, eliminate data redundancy and minimize
update abnormalities.
Referential Integrity
Referential integrity refers to the accuracy and consistency of data within a relationship and is a reliable
mechanism to prevent accidental database corruption. We discussed in a previous course that data is
linked between two or more tables. Referential integrity requires that, whenever a foreign key value is
used it must reference a valid, existing primary key in the parent (or primary) table. Because of this, we
need to ensure that data on both sides of the relationship remain intact.
Summary
So far you have learned about the Teradata and the SQL Engine, reviewed some additional Teradata
concepts, and you just learned about data distribution and access.
20