EB3053 New
EB3053 New
A step can be a small piece or a large chunk of work. It Figure 3 is a representation of how all of these three types
can be simple, such as “scan a table and return the result” of parallelism might appear in a query’s execution.
or complex, such as “scan two tables and apply predicates
to each, join the two tables, redistribute the join result on
specified columns, sort the redistributed rows, and place
Within-a-step
the redistributed rows in an intermediate table.” parallelism
Multiple operations
are pipelined
Join Product and 1. Scan Product
Within each of these potentially large chunks of work Scan Inventory 2. Scan Inventory
Stores
that we call steps, multiple relational operations can be Redistribute
3. Join Product
1.1 1.2 and Inventory
processed in parallel by pipelining. While a table scan is 4. Redistribute
taking place, rows that are selected from the scan can be joined rows
Join
spools Query execution
parallelism
Redistribute Four AMPs
3 perform each step
on their data
AMP 4 blocks at the
same time
AMP 3
Sum
AMP 2 step
4
AMP 1
Select and project Product table
Join Planning
Teradata’s optimizer seeks out tables within the query that
When the optimizer begins the task of building a query
have logical relationships between them, such as Items
plan, one of its goals is to maximize the throughput of
and Orders in Step 2.2 of Figure 3. It also groups tables
this particular piece of work. Think of a query that has to
that can be accessed and joined independently from the
access six tables to build its answer set. One of the jobs
other subsets of tables. Those are often candidates to
of the optimizer is to determine which tables to access
execute within parallel steps. Figure 5 illustrates the differ-
and join first, and which tables to access and join later
ences when optimizing a six-table join between a plan that
in the plan. It also has to consider what type of join to use
is restricted to linear joins, and one that has the option of
to bring the rows of two tables together, and the method
performing some of the joins in parallel.
of accessing each table (indexed access or table scan,
for example).
Join Table 6
Teradata Database was designed as a shared-nothing •• Message Delivery: Sends, optimizes, and guarantees
architecture, the hardware as well as the software. AMPs message arrival
and parsing engines2 (PE) operate in isolation from one
•• Multi-AMP Coordination: Oversees step completion
another. Messages are the means of communication
and error handling when multiple AMPs are working on
among these many moving parts and the interconnect
the same query step
is the glue that holds the pieces together and speeds
along all of the parallel activities that are running on the •• Final Answer Set Ordering: Efficient, dynamic merge of
Teradata Database (see Figure 6). final answer set across parallel units, bypassing expen-
sive sort/merge routines
•• Isolation: Insulates the database from configuration
details, recognizes and adjusts to hardware failures
Parsing
Engine •• Resource Conservation: Streamlines message traffic
– Identifies and minimizes the number of AMPs
involved in receiving a given message by setting up
dynamic BYNET groups
Step 1
3 AMPs
BYNET
BYNET
BYNET
BYNET
begins Step 1 Step 1
Done Done
across Work Work
3 AMPs
Message to
Step 1 Step 1 Step 1 dispatcher
Done for next step
Work Work Work
Semaphore
Software
BYNET
to the dispatcher to send out the next optimized step for Final Answer Set Sort/Merge
that query. This “last-done” message is the only actual Never needing to materialize a query’s final answer set
message sent back to the parsing engine concerning this inside the database has long been a differentiator of the
step, whether the dynamic BYNET group is composed of Teradata Database. The final sort/merge of a query takes
three or 3000 AMPs. place within the BYNET as the answer set rows are being
funneled up to the client.
Some queries fail to complete. If the cause of a failure
is limited to a single AMP, it is important that the other Three levels participate in returning a sorted answer set
participating AMPs hear about the failure immediately and (see Figure 8):
stop working on that query. If a tight coordination does
not exist among AMPs in the same BYNET group, then the •• Level 1 (AMP-level): Each AMP performs a local sort in
problem-free AMPs will continue to work on the doomed parallel with other AMPs and creates a spool file for its
query step, eating up resources in unproductive ways. part of the answer set.
Semaphores provide the means of alerting all AMPs in the
•• Level 2 (Node-level): Each node merges and sorts one
group if one AMP should, for example, run out of spool,3
buffer’s worth of data from all its contributing AMPs.
or otherwise terminate the step, using signaling across the
BYNET in lieu of messaging. •• Level 3 (PE-level): The PE receives and sorts node-level
buffers, building one buffer’s worth of sorted data to
Without the BYNET’s ability to combine and consolidate return to the client.
information from across all units of parallelism, each AMP
would have to independently talk to each other AMP in The highest values are sent up through the 3 tiers first,
the system about each query step that is underway. As while the part of the answer set that contains the lower
the configuration grows, such a distributed approach values remains in the spool files at the AMP level until the
to coordinating query work would quickly become a higher values have been returned the client.
bottleneck.
1 2
3
Node-level
1, 3, 4 2, 5, 6 Builds 1 buffer’s worth of sorted data off
the top of the AMP’s sorted spool files
4 1 3 6 2 5
AMP-level
4, 8 1, 7 3, 11 6, 10 2, 12 5, 9 Rows sorted and spooled on each AMP
AMP 1 AMP 2 AMP 3 AMP 4 AMP 5 AMP 6
NODE 1 NODE 2
During this process there is minimal physical I/O per- Recent enhancements to BYNET functionality include:
formed. For a given row, only a single write—into the
AMP’s spool table—is ever performed. And only a single •• Converting Ynet hardware functionality into software
read of that same row is ever needed—to retrieve the data capabilities: A virtualized BYNET allows the Teradata
for the user in the correct order. Database to embrace whatever is the optimal general
purpose interconnect functionality at any point in time.
A big benefit of this approach is that the final answer set •• Moving BYNET hardware into software allows for
never has to be brought together in one location for a transparent and consistent inter-AMP communications,
possibly-large final sort. Rather, the answer set is finalized making it possible for AMPs on the same node to con-
and returned to the client a buffer at a time. A potential tact each other without going over the interconnect,
“big sort” penalty has been eliminated; or actually, it thereby reducing delays and traffic congestion.
never existed.
•• Support for multiple Teradata systems sharing the
same network infrastructure, with intra-system com-
Evolution
munications isolated over private partitions.
The original architecture of the YNet interconnect easily
evolved into the current BYNET, and as it did, it under-
went several transformations. The BYNET brought greater
availability and reliability, supporting multiple virtual
broadcast trees and multiple paths.
This section will focus on the original architecture of the •• A hash-ID, which becomes part of the row’s unique
sub-system that handles space management in the data- “row-ID.”4
base, a sub-system called the “file system.” The file system is
responsible for the logical organization and management of In addition to being a distribution technique, this hash
the rows, along with their reliable storage and retrieval. approach to data placement serves as an indexing strat-
egy, which reduces the amount of DBA work normally
At first glance, managing space seems like a trivial exer- required to set up direct access to a row. To retrieve a
cise, something a robot could easily be programmed to row, the primary index data value is passed to the hashing
do. But the file system in Teradata was architected to be algorithm, which generates the two hash outputs: 1.) the
extremely adaptable, simple on the outside but surpris- hash bucket which points to the AMP; and 2.) the hash-
ingly inventive on the inside. It was designed from Day ID which helps to locate the row within the file system
One to be fluid and open to change. The file system’s structure on that AMP. There is no space or processing
built-in flexibility is achieved by means of: overhead involved in either building a primary index or
accessing a row through its primary index value, as no
•• Logical addressing, which allows blocks of data to special index structure needs to be built to support the
be dynamically shifted to different physical locations primary index.
when needed, with minimal impact to active work.
•• The ability for data blocks to expand and contract on
demand, as a table matures. A customer
row is inserted
•• Reliance on an array of unobtrusive background tasks
Hashing algorithm produces
that do continuous space adjustments and clean-up. 1. A hash bucket
2. A hash-ID
Figure 9. A row’s primary index hash bucket points to the AMP that
owns it.
If there is space If there is no free space Before a new row After a new row
on the cylinder… on the cylinder… with row hash ID 18 with row hash ID 18
Scan is inserted Scan is inserted
sequence sequence
Data
Block 1st Data Block 1st Data Block
Data Data
1 1
Row Hash 11 Row Hash 11
Block Block Cylinder 1 Row Hash 13 Row Hash 13
Row Hash 19
Unused Space
Figure 12. Table scan using logical addressing after a new row
Figure 11. A new row is inserted into an existing data block. is inserted.
Physical
location of
data becomes
separated
from its logical
representation
over time
Originally JBOD Then RAID Finally, Teradata Virtual Storage
Some of the features in addition to TVS, which were built on Work Flow Self-Regulation
and rely upon the original file system architecture, include:
A shared-nothing parallel database has a special chal-
•• Table partitioning that supports row, column or a lenge when it comes to knowing how much new work it
combination of both approaches and dynamic parti- can accept, and how to identify congestion that is starting
tion elimination, with all partitioned rows stored and to build up inside one or more of the parallel units. With
manipulated using the same original internal file system the optimizer attempting to apply multiple dimensions of
conventions, including the row-ID layout. parallelism to each query that it sees, it is easy to reach
•• Block-level compression, in which all rows within a data very high resource utilization within a Teradata system,
block are compressed as a unit. even with just a handful of active queries.
•• Temperature-based block-level compression (com-
press on cold), an extension of block level compression Designed for stress, the Teradata Database is able to
where only data blocks residing on infrequently function with large numbers of users, a very diverse mix
accessed cylinders undergo compression. of work, and a fully-loaded system. Being able to keep on
functioning full throttle under conditions of extreme stress
•• No Primary Index tables that offer a fast approach relies on internal techniques that were built inside the
for dumping data randomly across the AMPs that database to automatically and transparently manage the
bypasses hash partitioning, well-suited for temporary flow of work, while the system stays up and productive.
staging tables.
Even though the data placement conventions in use with
Teradata Database lend themselves to even placement
of the data across AMPs, the data is not always accessed
by queries in a perfectly even way. During the execution
of a multi-step query, there will be occasions when some
AMPs require more resources for certain steps than do
Does the database keep pushing work down to the AMPs AMP Worker Tasks
even when some AMPs can’t handle more? At what point AWTs are the tasks inside of each AMP that get the data-
does the database put on the brakes and stop sending base work done. This database work may be initiated
new work messages to the AMPs? Is a central coordinator by the internal database software routines, such as dead-
task necessary to poll the work levels of each AMP, then lock detection or other background tasks. Or the work
mandate when more work can be sent down or when it’s may originate from a user-submitted query. These pre-
time to put on the brakes? A central coordinator task is allocated AWTs are assigned to each AMP at startup
something to avoid in a shared-nothing parallel database and, like taxi cabs queued up for fares at the airport, they
because it becomes a non-parallel operation which can wait for work to arrive, do the work, and come back for
become a bottleneck as activity, database size, or number more work.
of units of parallelism increases.
Because of their stateless condition, AWTs respond
AMP-Level Control quickly to a variety of database execution needs. There
Teradata Database manages the flow of work that enters is a fixed number of AWTs on each AMP. For a task to
the system in a highly-decentralized manner, in keeping start running it must acquire an available AWTs. Having
with its shared-nothing architecture. There is no central- an upper limit on the number of AWTs per AMP keeps
ized coordinator. There is no message-passing between the number of activities performing database work within
AMPs to determine if it’s time to hold back new requests. each AMP at a reasonable level. AWTs play the role of
Rather, each AMP evaluates its own ability to take on both expeditor and governor.
more work, and temporarily pushes back when it experi-
ences a heavier load than it can efficiently process. And As part of the optimization process, a query is broken into
when an AMP does have to push back, it does that for the one or many AMP execution steps. An AMP step may be
briefest moments of time, often measured in milliseconds. simple, such as read one row using a unique primary index
or apply a table level lock. Or an AMP step may be a very
This bottom-up control over the flow of work was fun- large block of work, such as scanning a table, applying
damental to the original architecture of the database selection criteria on the rows read, redistributing the rows
as designed. All-AMP step messages come down to that are selected, and sorting the redistributed rows.
the AMPs, and each AMP will decide whether to begin
working on it, put it on hold, or ignore it. This AMP-level The Message Queue
mindfulness is the cornerstone of the database’s ability When all AMP worker tasks on an AMP are busy servicing
to accept impromptu swings of very high and very low other query steps, arriving work messages are placed in a
demand, and gracefully and unobtrusively manage what- message queue that resides in the AMP’s memory. This is
ever comes its way. a holding area until an AWT frees up and can service the
message.
Individual AMPs have two techniques at their disposal
when they experience stress: This queue is sequenced first by message work type,
which is a category indicating the importance of the work
1. Queuing up arriving messages message. Within work type the queue is sequenced by the
priority of the request the message is coming from. Mes-
2. Turning away new messages
sages that start new work are always placed at the end
of the message queue, adhering to one of the underlying
Both techniques provide a short-lived breather for the
philosophies of Teradata Database: It’s more important to
AMP, allowing it to focus on finishing off the tasks at hand
complete work that is already active than it is to start new
before taking on new work.
work (see Figure 14).
Figure 14. The work message queue is ordered by work type (descending).
It is normal during busy processing times to have some Turning Away New Messages
AMPs run out of AWTs and begin to queue arriving mes- The message queue is the first level of push-back on
sages. This condition is usually temporary, and in most an AMP. However, the message queue cannot endlessly
cases is transparent to the user submitting queries. expand in cases where more and more messages are
landing on the queue and fewer and fewer are being
Messages representing a new query step are broadcast released to run. Teradata Database’s flow control mecha-
to all participating AMPs by the dispatcher. In such a case, nisms go a step further than queuing up work messages.
some AMPs may provide an AWT immediately, while other
AMPs may have to queue the message. Some AMPs may In order to conserve memory, the length of the mes-
dequeue their message and start working on the step sage queue is controlled. When the number of messages
sooner than others. This is typical behavior on a busy queued reaches a threshold, the AMP will turn to a second
system where each AMP is managing its own flow of work. mechanism of relief—sending newly-arriving messages
back to their source.
Once a message has either acquired an AWT or been
accepted onto the message queue across each AMP in the Messages have senders and receivers. In the case of a new
dynamic BYNET group, then it is assumed that each AMP step being sent to the AMPs, the dispatcher is the sender
will eventually process it, even if some AMPs take longer and some number of AMPs are the receivers. However, in
than others. The sync point for the parallel processing of cases such as spawned work,6 one AMP is the sender and
each step is at step completion, when each AMP signals another AMP or AMPs are the receivers.
across the completion semaphore that it has completed
its part. The BYNET channels set up for this purpose are Flow Control Gates
discussed more fully in the BYNET section of this paper. Each AMP has flow control gates that monitor and man-
age messages arriving from senders. There are separate
…
Reject now
3 Spawned Messages 20 New Messages
Retry later
AWT
AWT AWT
flow control gates for each different message work type.7 turning on and turning off the flow of messages is kept
New work messages will have their own flow control local—only the AMP hit by an over-abundance of mes-
gates, as will spawned work messages. The flow control sages at that point in time throttles back temporarily.
gates keep a count of the active AWTs of that work type
as well as how many messages are queued up waiting for Riding the Wave of Full Usage
an AWT. Teradata was designed as a throughput engine, able to
exploit parallelism in order to maximize resource usage
Once the queue of messages of a certain work type grows of each request when only a few queries are active, while
to a specified length, new messages of that type are no at the same time able to continue churning out answer
longer accepted and that AMP is said to be in a state sets in high demand situations. To protect overall system
of flow control, as shown in Figure 15. The flow control health under extreme usage conditions, highly-decen-
gate will temporarily close, pulling in the welcome mat, tralized internal controls were put into the foundation, as
and arriving messages will be returned to the sender. discussed in this section.
The sender, often the dispatcher module within the PE,
continues to retry the message, until that message can be The original architecture related to flow control and AMP
received on that AMP’s message queue. worker tasks has needed very little improvement or even
tweaking over the years. 80 AWTs per AMP is still the
Getting Back to Normal default setting for new Teradata systems. Message work
Because Teradata Database is a message-passing data- types, the work message queue, and retry logic all work
base, there are many different types of message queues the same as they always did.
within the system architecture. All of these different
queues (sometimes referred to as “mailboxes”) have limits There have been a few extensions in regards to AMP
set on the number of messages they can accommodate at worker tasks that have emerged over time, including:
any point in time. Any of them, including the work mes-
sage queue already discussed, can go into a state of flow •• Setting up reserve pools of AWTs exclusively for use
control if the system is temporarily overwhelmed by work. by tactical queries, protecting high priority work from
Every AMP handles these decisions independently of being impacted when there is a shortage of AWTs.
every other AMP as illustrated in Figure 16.
•• Automatic reserve pools of AWTs just for load utilities
that become available when the number of AWTs per
Because the acceptance and rejection of work messages
AMP is increased to a very high level, intended to reduce
happens at the lowest level, in the AMP, there are no lay-
resource contention between queries and load jobs for
ers to go through when the AMP is able to get back to
enterprise platforms with especially high concurrency.
normal message delivery and processing. The impact of
QueryGrid and Unified Data Architecture are trademarks, and Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S.
and worldwide. Teradata continually improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications
without prior notice. All features, functions, and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com
for more information.
02.16 EB3053