0% found this document useful (0 votes)
24 views22 pages

EB3053 New

Uploaded by

Chinar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

EB3053 New

Uploaded by

Chinar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Born to be Parallel, and Beyond

Teradata’s Enduring Performance Advantage

Carrie Ballinger, Performance Engineering, Teradata Labs


02.16 EB3053 DATA WAREHOUSING
Table of Contents Introduction
2 Introduction When you hear about the Teradata® Database these days
it is often with an emphasis on recently-implemented,
3 Multidimensional Parallel Capabilities highly-visible features, such as Teradata QueryGrid™
or Unified Data Architecture™. This paper asks you to
6 Parallel-Aware Optimizer step back in time, say to the early 80’s, when Teradata
Database was first emerging from its startup days. Focus
8 The BYNET’s Considerable Contribution
your attention for a moment on the original foundation
12 A Flexible, Fast Way to Find and Store Data and key characteristics of what was at that time a new
parallel database.
16 Work Flow Self-Regulation
The basic building blocks that were put in place initially
20 Workload Management remain in place, and they continue to elevate and extend
the major advantages of Teradata Database today.
21 Conclusion
Teradata Database’s surprisingly enduring performance
22 Endnotes advantage is a direct result of these early, somewhat
unconventional design decisions made by a handful of
imaginative architects.

This paper describes and illustrates some of these key


fundamental components of Teradata Database that are
as critical to performance now as they were then, and
upon which today’s new features and capabilities rest.

Discussions of these specific areas are included in this paper:

•• Multidimensional parallel capabilities


•• A parallel-aware query optimizer
•• The BYNET’s considerable contribution
•• A flexible and fast way to find and store data
•• Internal self-regulation of the flow of work
•• Managing the flow of work externally with Workload
Management

The scope of this whitepaper is limited to important,


foundational components of the database. It is not a
comprehensive discussion of all the aspects of Teradata
Database.

There have been several important capabilities introduced


over the years that rest on top of that foundation, but that
are not discussed in this paper. They include user-defined
functions, table operators, data types like XML, JSON/
BSON/UBJSON for semi-structured data, geospatial,
columnar, temporal, Teradata Intelligent Memory, and in-
memory optimizations.

2 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Multidimensional Parallel Capabilities Types of Query Parallelism
While the AMP is the fundamental unit of parallelism,
Emerging in the late 1970’s, Teradata Database was the first there are two additional parallel dimensions woven into
commercially available SQL-based “parallel processing” Teradata Database, specifically for query performance.
machine designed from the base up to support user busi- These are referred to here as “within-a-step” parallel-
ness queries. Since then, parallel processing has become a ism, and “multi-step” parallelism. The following sections
necessity for any serious database offering, as demand for describe these three dimensions of parallelism:
data analytics continues to drive even higher volumes,
greater numbers of users, and more real-time performance. Parallel Execution Across AMPs
Probably the most recognizable type of parallelism is
With a design goal of eliminating single-threaded parallel execution across multiple AMPs. It involves break-
operations, the original architects of Teradata Database ing the request into subdivisions, and working on each
parallelized everything, from the entry of SQL statements subdivision at the same time, with one single answer
to the smallest detail of their execution. The database’s delivered. Parallel execution can incorporate all or part of
entire foundation was constructed around the idea of the operations within a query, and can significantly reduce
giving each component in the system many look-alike the response time of an SQL statement, particularly if the
counterparts. Not knowing where the future bottlenecks query reads and analyzes a large amount of data.
might spring up, early developers weeded out all possible
single points of control and effectively eliminated the con- Parallel execution is usually enabled in Teradata by
ditions that can breed gridlock in a system. hash-partitioning the data across all the AMPs defined in
the system. Once data is assigned to an AMP, the AMP
Limitless interconnect pathways, and multiple optimizers, provides all the database services on its allocation of data
host channel connections, gateways, and units of paral- blocks. All relational operations such as table scans, index
lelism are supported in Teradata, increasing flexibility and scans, projections, selections, joins, aggregations, and
control over performance that is crucial to large-scale sorts execute in parallel across the AMPs simultaneously.
data analytics today. Each operation is performed on an AMP’s data indepen-
dently of the data associated with the other AMPs.
Teradata’s basic unit of parallelism is the AMP (Access
Module Processor), a virtual processing unit that manages
all database operations for its portion of a table’s data.
Many AMPs are typically configured on a given node. AMP 4

From 20 to 40 or more AMPs per node is common today. AMP 3


AMP 2
AMP 1
Once configured, data loads, backups, index builds, in
fact everything that happens in a Teradata system, is Row Locking Sorting
distributed across a pre-defined number of AMPs. The
Reading/Writing Aggregating
parallelism is predictable and understandable.
Building
Indexes
Transaction
Journaling …
Each AMP acts like a microcosm of the database, sup- Loading AMP 1’s Data Backup and
Recovery
porting such things as data loading, reading, writing,
journaling and recovery for all the data that it owns (see

Figure 1). The parallel units also have knowledge of each Figure 1. Inside a unit of parallelism.
other and work cooperatively together behind the scenes.
This teamwork among parallel units is an unusual strength
of Teradata Database, driving higher performance with
minimal overhead.

3 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Within-a-Step Parallelism Multi-Step Parallelism
A second dimension of parallelism that will naturally Multi-step parallelism is enabled by executing multiple
unfold during query execution is an overlapping of “steps” of a query simultaneously, across all the participat-
selected database operations referred to here as within- ing units of parallelism in the system. One or more tasks
a-step parallelism. The optimizer splits an SQL query into are invoked for each step on each AMP to perform the
a small number of high level database operations called actual database operation. Multiple steps for the same
“steps” and dispatches these distinct steps for execution query can be executing at the same time to the extent
to the AMPs, one after another. that they are not dependent on results of previous steps.

A step can be a small piece or a large chunk of work. It Figure 3 is a representation of how all of these three types
can be simple, such as “scan a table and return the result” of parallelism might appear in a query’s execution.
or complex, such as “scan two tables and apply predicates
to each, join the two tables, redistribute the join result on
specified columns, sort the redistributed rows, and place
Within-a-step
the redistributed rows in an intermediate table.” parallelism
Multiple operations
are pipelined
Join Product and 1. Scan Product
Within each of these potentially large chunks of work Scan Inventory 2. Scan Inventory
Stores
that we call steps, multiple relational operations can be Redistribute
3. Join Product
1.1 1.2 and Inventory
processed in parallel by pipelining. While a table scan is 4. Redistribute
taking place, rows that are selected from the scan can be joined rows

pipelined into a join process immediately. Pipelining is the Join


spools
Join Items and
Multi-step
Orders
parallelism
ability to begin one task before its predecessor task has Redistribute Do step 1.1 and 1.2
Redistribute
2.1 2.2
completed and will take place whenever possible within and also steps
2.1 and 2.2
each distinct step (see Figure 2). simultaneously

Join
spools Query execution
parallelism
Redistribute Four AMPs
3 perform each step
on their data
AMP 4 blocks at the
same time
AMP 3
Sum
AMP 2 step
4
AMP 1
Select and project Product table

Select and project Inventory table Return


final
answer
Join Product and Inventory tables 5
Time 1 – Start step

Redistribute joined rows to other AMPs


Figure 3. Multiple types of parallelism combined.
Step done
Time 4 –
Time 3
Time 2

The figure shows four AMPs supporting a single query’s


Figure 2. Pipelining of 4 operations within one query step. execution, and the query has been optimized into 7 steps.
Step 1.2 and Step 2.2 each demonstrate within-a-step
parallelism, where two different tables are scanned and
joined together (three different operations are per-
This dynamic execution technique, in which a second formed). The result of those three operations is pipelined
operation jumps off of a first one to perform portions of into a sort and then a redistribution, all in one step. Steps
the step in parallel, is key to increasing the basic query 1.1 and 1.2 together (as well as 2.1 and 2.2 together) dem-
parallelism. The relational-operator mix of a step is care- onstrate multi-step parallelism, as two distinct steps are
fully chosen by the Teradata optimizer to avoid stalls chosen to execute at the same time, within each AMP.
within the pipeline.

4 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


This multifaceted parallelism is not easy to choreograph as common sub-expression elimination, this means that if
unless it is planned for in the early stages of product six select statements were bundled together into a single
evolution. An optimizer that generates three dimensions request, and all contained the same sub-query, that sub-
of parallelism for one query, such as described here, must query would only be executed once. Even though these
be intimately familiar with all the parallel capabilities that SQL statements are executed in an interdependent, over-
are available and know how and when to use them. But lapping fashion, each query in a multi-statement request
most importantly, Teradata Database applies these mul- will return its own distinct answer set (see Figure 4).
tiple dimensions of parallelism automatically, without user
intervention or special setup. Evolution
Many features have been added to the Teradata Database
Multi-Statement Requests over the years that take advantage of the database’s
In addition to the three dimensions of parallelism shown inherent parallelism. Things like the FastExport utility,
in Figure 3, Teradata Database offers an SQL extension which pulls large volumes of data out of the database
called a Multi-Statement Request that allows several across all AMPs in parallel, or ordered analytic functions,
distinct SQL statements to be bundled together and sent that perform complex windowing on top of the parallel
to the optimizer as if they were one unit. Teradata Data- foundation are a few examples.
base will attempt to execute these SQL statements in
parallel, as long as there are no dependencies among the Teradata QueryGrid, a more recent feature relies on
statements. Teradata Database parallelism as well when accessing
rows from a foreign server. The QueryGrid feature pro-
When this feature is used, any sub-expressions that the vides a single interface to combine data across different
different SQL statements have in common will be exe- systems (including heterogeneous systems), minimizing
cuted once, and the results shared among them. Known the need for data duplication.

For example, with QueryGrid you can issue queries from a


Teradata Database that access, filter, and return rows from
Individual
SQL statements
a Hadoop platform. You can then join data from Teradata
performed A multi-statement request tables to those rows brought in from Hadoop, if required,
one at a time performs individual SQL statements in parallel
all in a single SQL statement.
Access Access Access Access
Pricing Data Pricing Data Customer Data Store Data
The parallelism of the Teradata Database contributes to
Return answer set the performance of QueryGrid when a moderate or a
Return all three answer sets large number of rows are being accessed from a foreign
Access server. Multiple streams of data can be brought over from
Customer Data
Hadoop in parallel, directly connecting to and coming to
Return answer set
rest on different AMPs in the Teradata configuration. Each
AMP receives and spools its subset of the Hadoop data.
Access All AMPs are working in parallel as though taking in and
Store Data processing the data that just arrived from Hadoop was
part of just another query step. Without the indigenous
Return answer set
parallelism of the Teradata Database, QueryGrid could not
offer the same level of efficiencies when importing data
Figure 4. A multi-statement request. from a foreign server.

5 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Parallel-Aware Optimizer One unique capability built into the original Teradata
optimizer is the ability to access and join multiple tables
simultaneously, constructing a wide or “bushy” query
Having an array of parallel techniques can turn into a
plan, rather than a serial one-join-at-a-time plan. Those
disadvantage if they are not carefully applied around the
six tables discussed above could be joined in a strictly
needs of each particular request. Envisioning the power of
linear fashion: join table1 to table2, then join their result
these combined dimensions of parallelism, early architects
to table3, then join their result to table4, etc. as shown in
of Teradata Database constructed a query optimizer that
Figure 5. This will spread the resource usage required by
was fully in tune with these choices and had the smarts to
the query over a longer period of time, potentially impact-
know when and how to apply them.
ing elapsed time.

Join Planning
Teradata’s optimizer seeks out tables within the query that
When the optimizer begins the task of building a query
have logical relationships between them, such as Items
plan, one of its goals is to maximize the throughput of
and Orders in Step 2.2 of Figure 3. It also groups tables
this particular piece of work. Think of a query that has to
that can be accessed and joined independently from the
access six tables to build its answer set. One of the jobs
other subsets of tables. Those are often candidates to
of the optimizer is to determine which tables to access
execute within parallel steps. Figure 5 illustrates the differ-
and join first, and which tables to access and join later
ences when optimizing a six-table join between a plan that
in the plan. It also has to consider what type of join to use
is restricted to linear joins, and one that has the option of
to bring the rows of two tables together, and the method
performing some of the joins in parallel.
of accessing each table (indexed access or table scan,
for example).

Table 1 Table 2 Table 1 Table 2 Table 3 Table 4

Join Table 3 Join Table 5 Join Table 6

Join Table 4 Join Join

Join Table 5 Join Plan with parallel joins

Join Table 6

Plan with serial joins Join

Figure 5. A bushy query plan vs. a serial plan.

6 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Sizing up the Environment Hiding Complexity
But understanding the dimensions of parallelism by itself One thing customers have always liked about Teradata
was found to be inadequate when it came to creating Database’s optimizer is that it alleviates the user who
an optimizer suitable for complex queries on a parallel submits the query from having to get involved in directing
database. Early optimizer architects made sure that other the query plan. Optimization happens behind the scenes
“environmental” factors were considered during plan and all influence over a plan, other than which statistics
building as well. to collect, is taken out of the hands of the user. There is
complete freedom to submit very complex ad hoc ana-
The optimizer knows about the number of AMPs and the lytic queries, or canned tactical dashboard queries, or
number of nodes in the current configuration. It considers quick single-row look-up queries, because the optimizer
the processing power of the hardware, and uses costing will adjust to whatever is thrown at it.
algorithms in devising estimated costs of potential plans.
Putting all this information together, the optimizer comes Evolution
up with a price in terms of resources expected to be used Many years of feedback and experience with the Teradata
for each of several candidate query plans, then picks the data warehouse users has helped developers discover
least costly candidate. The lowest cost plan is the plan ways to enhance the optimizer capabilities to better
which will take the least system resources to execute. meet real-world needs. During this evolution, existing
These final cost estimates are externalized in prose-like components are often expanded or used in a new way,
query “explain” text, for the user to read. without the need to start all over or change the underly-
ing foundation. Row-IDs, for example, were repurposed
An important piece of information that the optimizer so they could support no primary index tables and were
always looks for when building a plan is statistics: data expanded from the original 8 bytes to 16 bytes in order to
demographics about the tables and columns that par- support new table partitioning opportunities.
ticipate in a query. This statistical data, which is stored
as histograms in the data dictionary, helps to determine There has been a continuous stream of new optimizer
what the best order of joins, and it helps the optimizer features over the last 30 years, all building on the original
assess the size of the data set that results from joining two foundation. Some of the key ones include:
tables. This information is used to select the best method
of implementing the joins. •• More sophisticated statistics collection options,
including sampling, thresholds for recollections, and
Thinking in Parallel optimizer-initiated statistics collection skipping.
The Teradata optimizer was born into a parallel world. •• Extrapolation of statistical information at run-time
Because it was built on top of a shared-nothing architec- when statistic collections are outdated.
ture, it has been forced to think with a completely parallel
mind set. •• Join indexes (materialized views) synchronized with
base tables whose access is managed by the optimizer.
For example, before it settles on a step that does a table •• New types of joins, including a star join that brings
duplication1 the optimizer evaluates the number of AMPs together unrelated small tables first before joining to a
that the table will be copied to, estimates the number of fact table, and in-memory hash joins to take advantage
rows that will be fanned out, and considers the total data of in-memory processing possible on today’s large
load across the BYNET. Then it gives that an estimated memory processors.
cost and compares it to other alternatives that might •• Incremental Planning and Execution (IPE), where
involve less movement of data or a different join order. Its the optimizer builds a partial plan, executes that first
entire focus is to deliver a query plan that will execute a fragment, then based on the output of the first frag-
user-submitted query with the least possible effort. ment, builds and optimizes the second and subsequent
fragments.

7 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


The BYNET’s Considerable BYNET. These original YNet benefits were essential for
unveiling the original Teradata Database, and have proven
Contribution to be equally indispensable today:

Teradata Database was designed as a shared-nothing •• Message Delivery: Sends, optimizes, and guarantees
architecture, the hardware as well as the software. AMPs message arrival
and parsing engines2 (PE) operate in isolation from one
•• Multi-AMP Coordination: Oversees step completion
another. Messages are the means of communication
and error handling when multiple AMPs are working on
among these many moving parts and the interconnect
the same query step
is the glue that holds the pieces together and speeds
along all of the parallel activities that are running on the •• Final Answer Set Ordering: Efficient, dynamic merge of
Teradata Database (see Figure 6). final answer set across parallel units, bypassing expen-
sive sort/merge routines
•• Isolation: Insulates the database from configuration
details, recognizes and adjusts to hardware failures
Parsing
Engine •• Resource Conservation: Streamlines message traffic
– Identifies and minimizes the number of AMPs
involved in receiving a given message by setting up
dynamic BYNET groups

AMP 0 AMP 1 AMP 2 – Buffers up multiple small messages going to the


same AMP or node, sending fewer large messages
•• Congestion Control: Regulates high-volume messages
to prevent overruns or bottlenecks
Figure 6. AMPs and PEs communicate using messages.
•• The following sections provide more detail about a few
of these specific benefits performed by the BYNET.

From the beginning, the interconnect was treated as Messaging


something more than a delivery device for messages. A key role of the BYNET is to support communication
Instead, the design of Teradata Database widely exploited between the PEs and AMPs and also from AMPs to other
the interconnect to increase performance and simplify AMPs. This communication comes in many forms, some of
user interaction with the database wherever possible. it very straightforward.
Functionality provided by the original Teradata intercon-
nect, known as the YNet, lived in the hardware and the •• Sending a step from the dispatcher module on the PE
hardware driver code on the individual nodes. to AMPs in order to initiate a query step
•• Redistributing rows from one AMP to another in order
The YNet did much more than a standard interconnect
to support different join geographies
usually does. The BYNET, in use today, inherited these
same capabilities. Beyond just passing messages, the •• Sorting a final answer set from multiple AMPs
BYNET is a bundle of intelligence and low level functions
that aid in efficient processing at practically each point These simple message-passing requirements are per-
in a query’s life. It offers coordination as well as oversight formed using a low level messaging approach, bypassing
and control to every optimized query step. more heavyweight protocols for communication. For
example, making a costly TCP/IP connection between
This section explores distinctive characteristics that were AMPs and PEs every time a message needs to be sent is
built into the YNet and that were passed on to the current never required. There is no connection setup or teardown

8 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


cost whenever a process like row redistribution or table and the first step is ready to be dispatched, a message will
duplication needs to happen. And when rows are being be automatically sent across the BYNET, but directed only
redistributed or duplicated, outbound rows are never sent to the AMPs that are actually needed in doing that step’s
one-at-a-time. Like car-poolers, individual messages are work. This may be all the AMPs in the system, or it may be
bundled up so fewer messages ever need to be sent. a subset, or just one. Receipt of this step message causes
the AMP to be automatically enrolled in the BYNET group
Even though message protocols are low-cost, Teradata without the database software having to initiate a sepa-
Database goes further by minimizing interconnect traf- rate communication.
fic. Same AMP, localized activity is encouraged wherever
possible. AMP-based ownership of data keeps activities Group AMP functionality is an optimizer opportunity that
such as locking and some of the simple data processing takes advantage of dynamic BYNET groups to better
local to the AMP. Hash partitioning that supports co- service tactical queries. This feature eliminates some all-
location of to-be-joined rows reduces data transporting AMP steps and replaces them with a step that engages
prior to a join. All aggregations are ground down to the just a subset of the AMPs, if the subset is all that the query
smallest possible set of sub-totals at the local (AMP) level requires. This reduces the resources required for such
first before being brought together globally via messag- a query, and frees up unneeded AMPs to service other
ing. And even when BYNET activity is required, use of work, thus increasing throughput.
dynamic BYNET groups (originally called dynamic YNet
groups) keeps the number of AMPs that must exchange Semaphores
messages down to the bare minimum. Even though the BYNET uses a light touch with message
passing, it offers an even less intrusive technique called
Even More Performance Benefits channels which it uses for behind-the-scenes inter-AMP
Teradata Database always uses broadcasts across the coordination during the execution of a query step.
BYNET in cases where all or most of the AMPs in the sys-
tem require the same information, such as the duplication As a step begins to execute, one or more channels are
of the rows of a table, or sending a dispatcher message established that loosely associate all AMPs in the dynamic
for an all-AMP operation such as applying a table-level BYNET group that is executing the step. The channels
lock. But the cheaper point-to-point messaging option is use monitoring and signaling semaphores in order to
considered when a smaller subset of AMPs is required. communicate things like the completion or the success/
failure of each participating AMP. Semaphores are parallel
The BYNET point-to-point communication is similar to a infrastructure objects that are globally available because
standard phone call over the public telephone network. they live within the BYNET. Each completion semaphore,
These mono-cast circuits connect one sender node to one for example, contains a count that reflects how close
receiver node. Generally known as a non-collision archi- that BYNET group’s AMPs are to completing that step, as
tecture, this approach minimizes the total volume of data shown in Figure 7.
in motion. And because it understands the hardware con-
figuration and which AMPs are on which node, the BYNET The semaphores’ jobs are to signal when the first AMP in
can further optimize the process by delivering only one the group completes the optimizer step being worked on,
message to each physical node with pointers to each of and when the last AMP in the group completes the same
the AMPs on that node. This reduces message sending step. This eliminates the need for every AMP to send a
tremendously. And it eliminates the need for receiving message to the dispatcher and having a bunch of code to
and collating delivery confirmations from each AMP in collate and figure out when everyone has responded.
the system.
The completion semaphore’s count is reduced by one
Dynamic BYNET Groups when an AMP reports in and will be reduced to zero when
The Dynamic BYNET Group is simply an on-the-spot asso- the last AMP completes. When the final AMP completes
ciation of the AMPs that will be working on one specific its work, it sees that the semaphore is registering zero and
query step. It is possible for many of these BYNET groups it knows it is the last one done. Because it is the last par-
to exist at any point in time. When a query is optimized ticipant to complete this step, this AMP sends a message

9 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Step 1
Done Done Done
Work

Step 1

3 AMPs
BYNET

BYNET

BYNET

BYNET
begins Step 1 Step 1
Done Done
across Work Work
3 AMPs
Message to
Step 1 Step 1 Step 1 dispatcher
Done for next step
Work Work Work

Semaphore
Software
BYNET

Completion Completion Completion Completion


for step semaphore semaphore semaphore semaphore
completion Semaphore
Count = 3 Count = 2 Count = 1 Count = 0
is established disbanded
Time 1 Time 2 Time 3 Time 4

Figure 7. A completion semaphore.

to the dispatcher to send out the next optimized step for Final Answer Set Sort/Merge
that query. This “last-done” message is the only actual Never needing to materialize a query’s final answer set
message sent back to the parsing engine concerning this inside the database has long been a differentiator of the
step, whether the dynamic BYNET group is composed of Teradata Database. The final sort/merge of a query takes
three or 3000 AMPs. place within the BYNET as the answer set rows are being
funneled up to the client.
Some queries fail to complete. If the cause of a failure
is limited to a single AMP, it is important that the other Three levels participate in returning a sorted answer set
participating AMPs hear about the failure immediately and (see Figure 8):
stop working on that query. If a tight coordination does
not exist among AMPs in the same BYNET group, then the •• Level 1 (AMP-level): Each AMP performs a local sort in
problem-free AMPs will continue to work on the doomed parallel with other AMPs and creates a spool file for its
query step, eating up resources in unproductive ways. part of the answer set.
Semaphores provide the means of alerting all AMPs in the
•• Level 2 (Node-level): Each node merges and sorts one
group if one AMP should, for example, run out of spool,3
buffer’s worth of data from all its contributing AMPs.
or otherwise terminate the step, using signaling across the
BYNET in lieu of messaging. •• Level 3 (PE-level): The PE receives and sorts node-level
buffers, building one buffer’s worth of sorted data to
Without the BYNET’s ability to combine and consolidate return to the client.
information from across all units of parallelism, each AMP
would have to independently talk to each other AMP in The highest values are sent up through the 3 tiers first,
the system about each query step that is underway. As while the part of the answer set that contains the lower
the configuration grows, such a distributed approach values remains in the spool files at the AMP level until the
to coordinating query work would quickly become a higher values have been returned the client.
bottleneck.

10 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


CLIENT
PARSING PE-level
ENGINE A single sorted buffer of rows are built up
1, 2, 3 from buffers sent by each node

One buffer at a time


1, 2, 3
is returned to the client

1 2
3
Node-level
1, 3, 4 2, 5, 6 Builds 1 buffer’s worth of sorted data off
the top of the AMP’s sorted spool files
4 1 3 6 2 5

AMP-level
4, 8 1, 7 3, 11 6, 10 2, 12 5, 9 Rows sorted and spooled on each AMP
AMP 1 AMP 2 AMP 3 AMP 4 AMP 5 AMP 6
NODE 1 NODE 2

Figure 8. Merging/sorting a final answer set within the BYNET.

During this process there is minimal physical I/O per- Recent enhancements to BYNET functionality include:
formed. For a given row, only a single write—into the
AMP’s spool table—is ever performed. And only a single •• Converting Ynet hardware functionality into software
read of that same row is ever needed—to retrieve the data capabilities: A virtualized BYNET allows the Teradata
for the user in the correct order. Database to embrace whatever is the optimal general
purpose interconnect functionality at any point in time.
A big benefit of this approach is that the final answer set •• Moving BYNET hardware into software allows for
never has to be brought together in one location for a transparent and consistent inter-AMP communications,
possibly-large final sort. Rather, the answer set is finalized making it possible for AMPs on the same node to con-
and returned to the client a buffer at a time. A potential tact each other without going over the interconnect,
“big sort” penalty has been eliminated; or actually, it thereby reducing delays and traffic congestion.
never existed.
•• Support for multiple Teradata systems sharing the
same network infrastructure, with intra-system com-
Evolution
munications isolated over private partitions.
The original architecture of the YNet interconnect easily
evolved into the current BYNET, and as it did, it under-
went several transformations. The BYNET brought greater
availability and reliability, supporting multiple virtual
broadcast trees and multiple paths.

11 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


A Flexible, Fast Way to Find and How Data is Organized
Teradata permanently assigns data rows to AMPs using
Store Data a simple scheme that lends itself to an even distribution
of data—hash partitioning. As initially designed, the value
The previous sections discussed the original functionality
found in the columns selected by the DBA to be that
of the Teradata Database in terms of the parallelism, the
table’s primary index are put through a hashing algo-
optimizer, and the BYNET. Another very important factor
rithm and two outputs are produced (see Figure 9), often
behind the enduring performance of Teradata Database is
referred to as the “hash” of the row:
how space is managed. What is the effort of locating a row
in the database? What happens when a row is inserted and
•• A hash bucket, which maps to one AMP when applied
there is no room in the data block where it belongs?
against a pre-defined hash map

This section will focus on the original architecture of the •• A hash-ID, which becomes part of the row’s unique
sub-system that handles space management in the data- “row-ID.”4
base, a sub-system called the “file system.” The file system is
responsible for the logical organization and management of In addition to being a distribution technique, this hash
the rows, along with their reliable storage and retrieval. approach to data placement serves as an indexing strat-
egy, which reduces the amount of DBA work normally
At first glance, managing space seems like a trivial exer- required to set up direct access to a row. To retrieve a
cise, something a robot could easily be programmed to row, the primary index data value is passed to the hashing
do. But the file system in Teradata was architected to be algorithm, which generates the two hash outputs: 1.) the
extremely adaptable, simple on the outside but surpris- hash bucket which points to the AMP; and 2.) the hash-
ingly inventive on the inside. It was designed from Day ID which helps to locate the row within the file system
One to be fluid and open to change. The file system’s structure on that AMP. There is no space or processing
built-in flexibility is achieved by means of: overhead involved in either building a primary index or
accessing a row through its primary index value, as no
•• Logical addressing, which allows blocks of data to special index structure needs to be built to support the
be dynamically shifted to different physical locations primary index.
when needed, with minimal impact to active work.
•• The ability for data blocks to expand and contract on
demand, as a table matures. A customer
row is inserted
•• Reliance on an array of unobtrusive background tasks
Hashing algorithm produces
that do continuous space adjustments and clean-up. 1. A hash bucket
2. A hash-ID

Teradata was architected in such a way that no space is


allocated or set aside for a table until such time as it is
needed. Rows are stored in variable length data blocks
The hash
that are only as big as they need to be. These data blocks bucket
can dynamically change size and be moved to different points to
one AMP
locations on the cylinder or even to a different cylinder,
without manual intervention or end user knowledge.
AMP 1 AMP 5 AMP 9 AMP 13
AMP 2 AMP 6 AMP 10 AMP 14
This section takes a close look at how file system frees up AMP 3 AMP 7 AMP 11 AMP 15
the administrator from mundane data placement tasks, AMP 4 AMP 8 AMP 12 AMP 16
and at the same time provides an environment that is
friendly to change. NODE 1 NODE 2 NODE 3 NODE 4

Figure 9. A row’s primary index hash bucket points to the AMP that
owns it.

12 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Hashed data placement is very easy to use and requires Cylinder Index
no setup. The only effort a DBA makes is the selection of The cylinder index (CI) is the second level. Contrary to the
the columns that will comprise the primary index of the master index, cylinder indexes may or may not be held in
table. From that point on, the process is completely auto- AMP memory. While there is only a single MI per AMP, there
mated. No files need to be allocated, sized, monitored, or may be thousands of CIs, many of which may rarely be used.
named. No DDL needs to be created beyond specifying
the primary index in the original CREATE TABLE state- Physically, the cylinder index is part of the cylinder itself.
ment. No unload-reload activity is ever required. Think of it as the first data block in the cylinder, as it can
be read into memory and updated just like any other data
Once the owning AMP is identified by means of the hash block. There are actually two cylinder indexes on each cyl-
bucket, the hash-ID is used to look up the physical loca- inder, so that when a CI is being updated, another version
tion of the row on disk. Which cylinder and sector holds is available to read, and no blocking will take place.
the row is determined by means of a tree-like three-level
indexing structure, as shown in Figure 10, and further A CI is composed of a sorted list of data block descriptors.
explained in detail in following sections. These data block descriptors hold the first table-ID and
row-ID for each data block, the number of rows in the
block, its physical location and length, as well as a free
block list indicating where free space exists on the cylinder.
Master Index

Sorted List Data Blocks


of Cylinder 1 per AMP
Indexes The third level in the tree is the data block itself. Data
blocks contain a series of rows that are from the same
table. The rows within the data blocks are always sorted
Cylinder Indexes
by hash. No space is allocated ahead of time for a data
Many per AMP block, but rather space is dynamically made available at
Sorted List of the time data is loaded or a new row is inserted.
Data Blocks

Although there is an effort to keep all data blocks from


Data Blocks one table ordered physically by hash on a cylinder, growth
Many
that causes blocks to exceed their maximum allowable
Rows Sorted Rows Sorted Rows Sorted per size and to split sometimes makes that difficult. But
cylinder
by Row-ID by Row-ID by Row-ID because the entries are logically sorted in the master
index and the cylinder index, any data block within a cyl-
inder can be directly accessed even when it is physically
Figure 10. A three-level indexing structure identifies a row’s location
on an AMP. out of sequence with other data blocks. See Figure 12 for
an example.

The data block is the physical I/O unit of work in the


Master Index Teradata file system. Block sizes are not expected to be
The master index (MI) is the topmost level of the tree-like uniform in length, which alleviates the need for the admin-
structure. An instance of the MI is memory-resident on istrator to spend time optimizing storage or tuning row
each AMP. The MI lists all cylinders managed by that AMP sizes. But each table has a block size specification, which
in sorted ascending sequence, and contains a series of acts as an upper limit on the size of a multi-row data
cylinder index descriptors. These descriptors contain the block. Although this may increase from release to release,
lowest table-ID and row-ID, and the highest table-ID and the system will enforce a maximum data block size that
row-ID for the data residing on each individual cylinder. will cap how large a physical data block may be; this maxi-
Each descriptor provides a pointer to the physical location mum is at least as large as the largest row size allowed.
of the cylinder index for that cylinder. The MI also includes The administrator may choose a smaller data block size
a list of free cylinders available on that AMP. as the maximum multi-row data block size, on a table-by-
table basis.

13 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Easy Accommodation of Data Growth Contrast this to the fixed-size page approach to stor-
Teradata Database was built using a logical address- ing rows. The physical address of such rows can only be
ing model as a low impact way to adjust to data growth. changed by removing and re-inserting the rows, which
Data in a Teradata system is stored in flexibly-sized data usually requires the database to be brought down for a
blocks that are loosely anchored together via logical-to- data reorganization procedure. And if overflow pages are
physical indexing, and are able to be floated from one required (a concept unknown to Teradata Database), this
physical disk location to another as needed. Multi-row can cause performance degradation because overflow
data blocks that grow beyond a DBA-specified maximum pages add additional I/O to accessing and inserting rows.
size automatically split to make room for more rows. If a
particular block needs to grow beyond the space it has on Table Scan Efficiency with Logical Addressing
its cylinder, that block can be moved to a different loca- This section illustrates an example of how logical address-
tion on the same cylinder, or to a different cylinder. When ing works when a query is scanning a table whose data
this happens, the appropriate cylinder index is updated to blocks have matured over time. Every effort is made by
reflect the new physical location change. Figure 11 explains the file system to keep rows from the same table physi-
this behavior visually. cally co-located and in row hash sequence. But as a table
grows, and some blocks split because they get too large,
This adaptable behavior delivers numerous benefits. strict physical ordering is not always sustainable. This is
Random growth is accommodated at the time it happens. where the advantage of logical addressing arises.
Rows can easily be moved from one location to another
without affecting in-flight work or any other data objects In the example (Figure 12), the first data block and the
that reference that row. There is never a need to stop second data block are immediately adjacent to one
activity and re-organize the physical data blocks, or another. Data block 1 has no room for growth, and there-
adjust pointers. fore has to split into two smaller data blocks in order to
accommodate a new row and still maintain row hash order
This flexibility to consolidate or expand data blocks any- within the data block.
time allows Teradata Database to do many space-related
housekeeping tasks in the background and avoid table To keep a semblance of order across what could end up
unloads and reloads common to fixed-sized page data- being a messy free-for-all if logical addressing ran amok,
bases. This advantage increases database availability and
translates to less maintenance tasks for the DBA.

If there is space If there is no free space Before a new row After a new row
on the cylinder… on the cylinder… with row hash ID 18 with row hash ID 18
Scan is inserted Scan is inserted
sequence sequence

Data
Block 1st Data Block 1st Data Block
Data Data
1 1
Row Hash 11 Row Hash 11
Block Block Cylinder 1 Row Hash 13 Row Hash 13
Row Hash 19
Unused Space

Insert a row 2 2nd Data Block 3 2nd Data Block


and the data Row Hash 32 Row Hash 32
Data Data Row Hash 35 Row Hash 35
block expands Row Hash 36 Row Hash 36
Block-1 Block-2

Cylinder 1 Cylinder 2 2 3rd Data Block


Unused Row Hash 18
Space Row Hash 19
Insert a row and the data block
splits across 2 cylinders Unused Space

Figure 12. Table scan using logical addressing after a new row
Figure 11. A new row is inserted into an existing data block. is inserted.

14 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


the file system imposes several internal conventions on the Evolution
ordering of rows within data blocks and across cylinders: In the original Teradata release each AMP logically
and physically owned and managed its own data on
1. A data block can only contain rows from one table. its own physical set of disks. As time passed and more
2. The row hash values present in a given data block are advanced disk technologies became available, the logi-
either all greater than, or all less than, the row hashes cal whereabouts of the data as seen by the file structure
maintained on any other data block of the same table. components and its actual physical address on disk
became abstracted. Logical addressing, which was
3. The key values for a cylinder’s cylinder index are either
already in place as the key file system approach for
all greater than, or all less than, the values maintained
accommodating every-day data growth, made this decou-
on any of the other cylinders within an AMP.
pling straightforward.

As a result, when a table is scanned in row hash order,


One of the most noteworthy features contributing to
all the rows in one data block can be processed before
virtualization of the file system is called Teradata Virtual
moving onto the next data block. And while the next data
Storage (TVS) which, when it was introduced, enhanced
block may not be physically adjacent to the current data
multi-temperature capabilities by optimizing data place-
block within the cylinder, all data blocks on one cylinder
ment on the disks to speed access to the most frequently
can be read before moving on to the next cylinder. There
used hot data. Within the file system TVS has taken over
is never a need to go back and read part of an already-
all responsibility for the physical organization/manage-
read cylinder. And no pointer chains are ever needed to
ment of data on disk. Using TVS, for example, a cylinder
locate logically adjacent data blocks. This also means that
might be migrated to a different physical storage device
when using a special I/O-saving feature known as a cyl-
depending on its access pattern.
inder read, all data blocks that are on the cylinder can be
processed in the correct order using a single I/O.
TVS enables the mixing of drive sizes and device types
(e.g., spinning and solid state disks) within a single Mas-
Background Clean-up Tasks sively Parallel Processing (MPP) system. If there is a mix
From the very first instance of Teradata Database, back- of devices, hot data is moved to fast storage, while cold
ground tasks have played an important role. Transparent data is moved to slower devices. All of this relocation
to the user and DBA the database continuously performs activity happens behind the scenes, in the background, by
clean-up tasks, such as consolidating small pockets of free TVS. TVS is integrated with the Teradata Database and is
space on a cylinder (called “defrag”). This housekeeping data object aware: it is able to understand, for example,
work is done only when it is required, and at a low priority, the difference between temporary spool files and user
so as not to impact other active work. Consequently, DBA table data.
intervention or system down time is typically not required
to keep the rows organized. This decoupling, this virtualization of the disk space
management (as shown in Figure 13), makes the Teradata
The list below shows a few of these self-managing tasks: Database even more scalable and higher performing. And
it builds on, rather than replaces, the original way of doing
•• Defrag: Consolidate fragmented space, one cylinder at
things. The DBA still defines the table logically and the
a time—the fuller the disks, the more work the task will do.
database takes care of allocating file space and where the
•• AutoCylPack: Combines adjacent, sparsely filled cylinders, data resides at any given time.
tends to run these “cylpacks” when the system is idle.
•• Transient journal purging: Removes transaction jour-
nal entries from disk periodically, after the commit has
taken place, rather than holding up transaction comple-
tion for this work.
•• FSG 5 cache purging: Looks for older data blocks that
have been updated but are resident in the FSG cache,
and writes them to disk.

15 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


MI MI MI MI
Logical
association
of data stays
CI CI CI CI
the same
over time
DB DB DB DB DB DB DB DB DB DB DB DB

Physical
location of
data becomes
separated
from its logical
representation
over time
Originally JBOD Then RAID Finally, Teradata Virtual Storage

Evolution in the abstraction of physical storage

Figure 13. Disassociation of physical from logical addressing.

Some of the features in addition to TVS, which were built on Work Flow Self-Regulation
and rely upon the original file system architecture, include:
A shared-nothing parallel database has a special chal-
•• Table partitioning that supports row, column or a lenge when it comes to knowing how much new work it
combination of both approaches and dynamic parti- can accept, and how to identify congestion that is starting
tion elimination, with all partitioned rows stored and to build up inside one or more of the parallel units. With
manipulated using the same original internal file system the optimizer attempting to apply multiple dimensions of
conventions, including the row-ID layout. parallelism to each query that it sees, it is easy to reach
•• Block-level compression, in which all rows within a data very high resource utilization within a Teradata system,
block are compressed as a unit. even with just a handful of active queries.
•• Temperature-based block-level compression (com-
press on cold), an extension of block level compression Designed for stress, the Teradata Database is able to
where only data blocks residing on infrequently function with large numbers of users, a very diverse mix
accessed cylinders undergo compression. of work, and a fully-loaded system. Being able to keep on
functioning full throttle under conditions of extreme stress
•• No Primary Index tables that offer a fast approach relies on internal techniques that were built inside the
for dumping data randomly across the AMPs that database to automatically and transparently manage the
bypasses hash partitioning, well-suited for temporary flow of work, while the system stays up and productive.
staging tables.
Even though the data placement conventions in use with
Teradata Database lend themselves to even placement
of the data across AMPs, the data is not always accessed
by queries in a perfectly even way. During the execution
of a multi-step query, there will be occasions when some
AMPs require more resources for certain steps than do

16 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


other AMPs. For example, if a query from an airline com- Queuing Up Arriving Messages
pany site is executing a join based on airport codes, you Each AMP contains a set of structures that aid in perform-
can expect whichever AMP is performing the join for rows ing database work. Among those structures are a defined
with Atlanta (ATL) to need more resources than does the number of AMP worker tasks (AWTs). Each AMP has by
AMP that is joining rows with Anchorage (ANC). default 80 AWTs.

Does the database keep pushing work down to the AMPs AMP Worker Tasks
even when some AMPs can’t handle more? At what point AWTs are the tasks inside of each AMP that get the data-
does the database put on the brakes and stop sending base work done. This database work may be initiated
new work messages to the AMPs? Is a central coordinator by the internal database software routines, such as dead-
task necessary to poll the work levels of each AMP, then lock detection or other background tasks. Or the work
mandate when more work can be sent down or when it’s may originate from a user-submitted query. These pre-
time to put on the brakes? A central coordinator task is allocated AWTs are assigned to each AMP at startup
something to avoid in a shared-nothing parallel database and, like taxi cabs queued up for fares at the airport, they
because it becomes a non-parallel operation which can wait for work to arrive, do the work, and come back for
become a bottleneck as activity, database size, or number more work.
of units of parallelism increases.
Because of their stateless condition, AWTs respond
AMP-Level Control quickly to a variety of database execution needs. There
Teradata Database manages the flow of work that enters is a fixed number of AWTs on each AMP. For a task to
the system in a highly-decentralized manner, in keeping start running it must acquire an available AWTs. Having
with its shared-nothing architecture. There is no central- an upper limit on the number of AWTs per AMP keeps
ized coordinator. There is no message-passing between the number of activities performing database work within
AMPs to determine if it’s time to hold back new requests. each AMP at a reasonable level. AWTs play the role of
Rather, each AMP evaluates its own ability to take on both expeditor and governor.
more work, and temporarily pushes back when it experi-
ences a heavier load than it can efficiently process. And As part of the optimization process, a query is broken into
when an AMP does have to push back, it does that for the one or many AMP execution steps. An AMP step may be
briefest moments of time, often measured in milliseconds. simple, such as read one row using a unique primary index
or apply a table level lock. Or an AMP step may be a very
This bottom-up control over the flow of work was fun- large block of work, such as scanning a table, applying
damental to the original architecture of the database selection criteria on the rows read, redistributing the rows
as designed. All-AMP step messages come down to that are selected, and sorting the redistributed rows.
the AMPs, and each AMP will decide whether to begin
working on it, put it on hold, or ignore it. This AMP-level The Message Queue
mindfulness is the cornerstone of the database’s ability When all AMP worker tasks on an AMP are busy servicing
to accept impromptu swings of very high and very low other query steps, arriving work messages are placed in a
demand, and gracefully and unobtrusively manage what- message queue that resides in the AMP’s memory. This is
ever comes its way. a holding area until an AWT frees up and can service the
message.
Individual AMPs have two techniques at their disposal
when they experience stress: This queue is sequenced first by message work type,
which is a category indicating the importance of the work
1. Queuing up arriving messages message. Within work type the queue is sequenced by the
priority of the request the message is coming from. Mes-
2. Turning away new messages
sages that start new work are always placed at the end
of the message queue, adhering to one of the underlying
Both techniques provide a short-lived breather for the
philosophies of Teradata Database: It’s more important to
AMP, allowing it to focus on finishing off the tasks at hand
complete work that is already active than it is to start new
before taking on new work.
work (see Figure 14).

17 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Spawned Messages New Messages

Figure 14. The work message queue is ordered by work type (descending).

It is normal during busy processing times to have some Turning Away New Messages
AMPs run out of AWTs and begin to queue arriving mes- The message queue is the first level of push-back on
sages. This condition is usually temporary, and in most an AMP. However, the message queue cannot endlessly
cases is transparent to the user submitting queries. expand in cases where more and more messages are
landing on the queue and fewer and fewer are being
Messages representing a new query step are broadcast released to run. Teradata Database’s flow control mecha-
to all participating AMPs by the dispatcher. In such a case, nisms go a step further than queuing up work messages.
some AMPs may provide an AWT immediately, while other
AMPs may have to queue the message. Some AMPs may In order to conserve memory, the length of the mes-
dequeue their message and start working on the step sage queue is controlled. When the number of messages
sooner than others. This is typical behavior on a busy queued reaches a threshold, the AMP will turn to a second
system where each AMP is managing its own flow of work. mechanism of relief—sending newly-arriving messages
back to their source.
Once a message has either acquired an AWT or been
accepted onto the message queue across each AMP in the Messages have senders and receivers. In the case of a new
dynamic BYNET group, then it is assumed that each AMP step being sent to the AMPs, the dispatcher is the sender
will eventually process it, even if some AMPs take longer and some number of AMPs are the receivers. However, in
than others. The sync point for the parallel processing of cases such as spawned work,6 one AMP is the sender and
each step is at step completion, when each AMP signals another AMP or AMPs are the receivers.
across the completion semaphore that it has completed
its part. The BYNET channels set up for this purpose are Flow Control Gates
discussed more fully in the BYNET section of this paper. Each AMP has flow control gates that monitor and man-
age messages arriving from senders. There are separate

Flow control gate for broadcast


spawned messages is open


Reject now
3 Spawned Messages 20 New Messages

Retry later

Flow control gate for broadcast


Figure 15. Flow control gates close when a threshold of messages is reached. new messages is closed

18 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


AMP 1 starts new work AMP 2 delays new work AMP 3 rejects new work

AMP 1 AMP 2 AMP 3

AWT

AWT AWT

Exhausted AWTs In flow control


AWTs are available Messages are queued up Messages are being retried

Figure 16. AMPs handle messages independently from one another.

flow control gates for each different message work type.7 turning on and turning off the flow of messages is kept
New work messages will have their own flow control local—only the AMP hit by an over-abundance of mes-
gates, as will spawned work messages. The flow control sages at that point in time throttles back temporarily.
gates keep a count of the active AWTs of that work type
as well as how many messages are queued up waiting for Riding the Wave of Full Usage
an AWT. Teradata was designed as a throughput engine, able to
exploit parallelism in order to maximize resource usage
Once the queue of messages of a certain work type grows of each request when only a few queries are active, while
to a specified length, new messages of that type are no at the same time able to continue churning out answer
longer accepted and that AMP is said to be in a state sets in high demand situations. To protect overall system
of flow control, as shown in Figure 15. The flow control health under extreme usage conditions, highly-decen-
gate will temporarily close, pulling in the welcome mat, tralized internal controls were put into the foundation, as
and arriving messages will be returned to the sender. discussed in this section.
The sender, often the dispatcher module within the PE,
continues to retry the message, until that message can be The original architecture related to flow control and AMP
received on that AMP’s message queue. worker tasks has needed very little improvement or even
tweaking over the years. 80 AWTs per AMP is still the
Getting Back to Normal default setting for new Teradata systems. Message work
Because Teradata Database is a message-passing data- types, the work message queue, and retry logic all work
base, there are many different types of message queues the same as they always did.
within the system architecture. All of these different
queues (sometimes referred to as “mailboxes”) have limits There have been a few extensions in regards to AMP
set on the number of messages they can accommodate at worker tasks that have emerged over time, including:
any point in time. Any of them, including the work mes-
sage queue already discussed, can go into a state of flow •• Setting up reserve pools of AWTs exclusively for use
control if the system is temporarily overwhelmed by work. by tactical queries, protecting high priority work from
Every AMP handles these decisions independently of being impacted when there is a shortage of AWTs.
every other AMP as illustrated in Figure 16.
•• Automatic reserve pools of AWTs just for load utilities
that become available when the number of AWTs per
Because the acceptance and rejection of work messages
AMP is increased to a very high level, intended to reduce
happens at the lowest level, in the AMP, there are no lay-
resource contention between queries and load jobs for
ers to go through when the AMP is able to get back to
enterprise platforms with especially high concurrency.
normal message delivery and processing. The impact of

19 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Workload Management administrator wanted to give a favored user a higher pri-
ority, all that was involved was manually adding one of the
priority identifiers into the user’s account string.
The second section in this whitepaper called attention
to the multifaceted parallelism available for queries on
Teradata Database. The subsequent section discussed Background tasks discussed in the section about space
how the optimizer uses those parallel opportunities in management were designed to use priorities as well.
smart ways to improve performance on a query by query Some of these tasks, like the task that deletes transient
basis. And the previous section illustrated internal AMP- journal rows that are no longer needed, were designed to
level controls to keep high levels of user demand and an start out at the low priority, but increase their priority over
over-abundance of parallelism from bringing the system time if the system was so busy that they were not able
to its knees. to get their work accomplished. This approach kept such
tasks in the background most of the time, except when
their need to complete becomes critical.
In addition to those automatic controls at the AMP level,
Teradata has always has had some type of system-level
workload management, mainly priority differences, that Impact of Mixed Workloads
are used by the internal database routines. The simple approach to priorities was all the internal
database tasks required. And early users of the data-
base were satisfied running all their queries at the default
The Original Four Priorities
medium priority. But requirements shifted over time as
One of the challenges faced by the original architects of
users of the Teradata Database began to supplement their
Teradata Database was how to support maximum levels of
traditional decision support queries with new types of
resource usage on the platform, and still get critical pieces
more varied workloads.
of internal database code to run quickly when it needed
to. For example, if there is a rollback taking place due to
an aborted transaction, it benefits the entire system if In the late 1990’s, a few Teradata sites began to issue
the reversal of updates to clean up the failure can be direct look-up queries against entities like their Inventory
executed quickly. tables or their Customer databases, at the same time as
their standard decision support queries were running.
Call centers started using data in their Teradata Database
It was also important to ensure that background tasks
to validate customer accounts and recent interactions.
running inside the database didn’t lag too far behind. If
Tactical queries and online applications blossomed, at the
city streets are so congested with automobile traffic that
same time as more sites turned to continuous loading to
the weekly garbage truck can’t get through and is delayed
supplement their batch windows, giving their end users
for weeks at a time, a health crisis could arise.
more timely access to recent activity. Service level goals
reared their head. Stronger, more flexible workload man-
The solution the original architects found was a simple
agement was required.
priority scheme that applied priorities to all tasks running
on the system. This rudimentary approach offered four
priority buckets, each with a greater weight than the one Evolution of Workload Management
that came before: L for Low, M for Medium, H for High and While the internal management of the flow of work has
R for rush. The default priority was medium, and indeed changed little, the capabilities within system-level work-
most work ran at medium, and was considered equally- load management have expanded dramatically over the
important to other medium priority work that was active. years. As the first step beyond the original four priorities,
Teradata engineering developed a more extensive priority
However, database routines and even small pieces of code scheduler composed of multiple resource partitions and
could assign themselves one of the other three priorities, performance groups, and the flexibility of assigning your
based on the importance of the work. Developers, for own customized weighting values. These custom weight-
example, decided to give all END TRANSACTION activ- ings and additional enhancements make it easier to match
ity the rush priority, because finishing almost-completed controls to business workloads and priorities than the
work at top speed frees up valuable resources sooner, and original capabilities designed more for controlling internal
was seen as critical within the database. In addition, if the system work.

20 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


Additional workload management features and options Conclusion
that have evolved over the years include:
Foundations are important. If you remember the fairy tale
•• Concurrency control mechanisms, called throttles, that of the three little pigs, the first little pig built his house
can be placed at multiple levels and tailored to specific of straw, the second little pig built his house of sticks,
types of queries or users. and the third little pig built his house of bricks. When the
•• An improved and more effective priority scheduler to big bad wolf came along, he was able to blow down the
accompany the Linux SLES 11 operating system that houses built of straw and sticks, but not the house built of
can protect short, critical work more effectively from bricks. So only the third little pig survived to live happily
more resource-intensive lower-priority jobs. ever after.
•• Rules to reject queries that are poorly-written or that
are inappropriate to run at certain times of the day. Teradata Database is a survivor. Its ability to grow in new
directions and continue to sustain its core competencies is
•• Ability to automatically change workload settings by a direct result of its strong, tried-and-true foundation.
time of day or system conditions.
•• Ability to automatically reduce the priority of a run- An interesting pattern has emerged over the years as the
ning query which exceeds the threshold of resources Teradata Database has matured, a pattern that under-
consumed for its current priority. scores the unusual adaptability of the database: Logical
•• Two complete collections of workload manage- components and their physical implementation have
ment features referred to as Teradata Active System become more and more disassociated.
Management and Teradata Integrated Workload
Management. •• In Version1 Release 1 each AMP was a physical node
that actually owned its own disk drives and directly
•• A user-friendly front-end GUI called Viewpoint Work-
managed how data was located on its disks. Today an
load Designer that supports ease of setup and tuning.
AMP is a software virtual processor that co-exists on
an SMP with other such virtual processors, all of whom
Workload management in Teradata has proven to be rap- share the node resources. Yet each AMP maintains
idly expanding area, indispensable to customers that are its shared-nothing characteristics, same as in the first
running a wide variety of work on their Teradata platform. release.
While internal background tasks and subsets of the data-
base code continue to run at the four different priority •• The YNet was a proprietary, physical interconnect,
levels initially defined for them, many Teradata sites have supported with bits of code on each AMP. But today,
discovered that their end users experiences are better and what was the Ynet has evolved into the BYNET. Due
they can get more work through the system when taking to recent virtualization techniques, the BYNET is free
advantage of the wider workload management choices to run on whatever off-the-shelf interconnect hard-
today. And many do just that. ware offers the most benefit. Although there have
been many enhancements over time, all of the original
reliability, coordination, and performance capabilities
designed into the YNet remain intact.
•• The master index, cylinder index and data block file
system structures originally were put in place to point
to actual physical locations of rows on an AMP’s disks.
Today the underlying storage is completely managed
by TVS, yet the same index structures remain in place
to keep the logical associations in order. Even though
new features have been added over the years, the
essential building blocks of the file system are the same.

21 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM


The natural evolution towards the virtualization of key Endnotes
database functionality is significant because it broadens
the usefulness of the Teradata Database. For much of
1 Table duplication is a join choice where the smaller
its history the Teradata Database has run on purpose-
table is duplicated in its entirety to all other AMPs
built hardware, where the underlying platform has been
working on that step.
optimized to support high throughput, critical SLAs, and
solid reliability. While those benefits remain well-suited 2 The parsing engine is the virtual processing unit that
for enterprise platforms, this virtualization opens the door communicates with clients and with AMPs. It per-
for the Teradata Database to participate in more portable, forms session management, parsing, optimization and
less demanding solutions. Public or private cloud archi- enforces workload management rules. There may be
tectures, for example, can now enjoy the core Teradata one or more parsing engines per node and each can
Database capabilities as described in this white paper. support 120 sessions at a time.
3 A spool file is an intermediate answer set that tempo-
This white paper attempts to familiarize you with a few rarily holds results from one query step that feed into a
of the features that make up important building blocks subsequent step. Users have limits on how much spool
of the Teradata Database, so you can see for yourself space their queries may use, and a query that exceeds
the elegance and the durability of the architecture. This its spool limit will be aborted.
paper points out recent enhancements that have grown
4 A row-ID is the unique identifier of the row within the
out of this original foundation, building on it rather than
logical file system structure. It includes the hash-ID,
replacing it.
and in addition, detail to help locate the row, such as
the partition number and the hash uniqueness value.
These foundational components have such a widespread Row-ID and hash-ID are often used synonymously.
consequence that they simply cannot be tacked on as an
afterthought. The database must be born with them. 5 FSG refers to the file system sub-system and it’s
manipulations of segments of data. FSG stands for
“File Segment.” FSG cache is the primary and original
cache on the Teradata Database.
6 Spawned work takes place when a query step requires
more than one AWT to get its work done, such as is
the case during row redistribution, where one AMP
is required to read and redistribute the rows to other
AMPs, and a second AWT is required to receive rows
being sent to it from other AMPs.
7 A work type is given to each arriving work message,
in order to reflect the importance of the work to be
performed. “New” messages and “spawned” messages
use different work types, for example.

10000 Innovation Drive, Dayton, OH 45342     Teradata.com

QueryGrid and Unified Data Architecture are trademarks, and Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S.
and worldwide. Teradata continually improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications
without prior notice. All features, functions, and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com
for more information.

Copyright © 2016 by Teradata Corporation     All Rights Reserved.     Produced in U.S.A.

02.16 EB3053

22 BORN TO BE PARALLEL, AND BEYOND TERADATA.COM

You might also like