Introduction To MIMD Architectures
Introduction To MIMD Architectures
MIMD (Multiple Instruction stream, Multiple Data stream) computer system has a number of
independent processors operate upon separate data concurrently. Hence each processor has its
own program memory or has access to program memory. Similarly, each processor has its own
data memory or access to data memory. Clearly there needs to be a mechanism to load the
program and data memories and a mechanism for passing information between processors as
they work on some problem. MIMD has clearly emerges the architecture of choice for general-
purpose mutiprocessors. MIMD machines offer flexibility. With the correct hardware and
software support, MIMDs can function as single user machines focusing on high performance for
one application, as multiprogrammed machines running many tasks simultaneously, or as some
combination of these functions. There are two types of MIMD architectures: distributed memory
MIMD architecture, and shared memory MIMD architecture.
<>
Without accessibility to shared memory architectures, it is not possible to obtain memory access
of a remote block in a multicomputer. Rather than having a processor being allocated to different
process or having a processor continue with other computations (in other words stall),
Distributed MIMD architectures are used. Thus, the main point of this architectural design is to
develop a message passing parallel computer system organized such that the processor time
spent in communication within the network is reduced to a minimum.
II. Components of the MultiComputer and Their Tasks
Within a multicomputer, there are a large number of nodes and a communication network linking
these nodes together. Inside each node, there are three important elements that do have tasks
related to message passing :
The router’s main task is to transmit the message from one node to the next and
assist the communication processor in organizing the communication of the
message in through the network of nodes.
The development of each of the above three elements took place progressively, as
hardware became more and more complex and useful. First came the Generation I
computers, in which messages were passed through direct links between nodes,
but there were not any communication processors or routers. Then Generation II
multicomputers came along with independent switch units that were separate from
the processor, and finally in the Generation III multicomputer, all three
components exist.
Below is the picture for the three categories of interconnection topologies (one dimensional, two
dimensional, and three dimensional) and the designs that fall under each one. Along with this,
there is a table comparing each topology based on the tradeoffs listed above
B. Switching
This mechanism of message storing and movement can be though as the way the
mail service, and is probably is just as slow… A message is split into "packets"
that can be stored in the buffer, contained within each intermediate node(the
intermediate node between the source node sending the message and the correct
receiver of the message) A "packet" usually will contain a header and some data.
The header portion works like a tag telling the switching unit which node to
forward the data to, in order to reach the destination node.
where P is the packet length, B is the channel bandwidth and D is the number of
nodes within the network.
1. Circuit Switching
The circuit switching method is like a telephone system in which the message passage path
develops from source node to destination node as the message travels through its path. The three
phases of this type of switching scheme are the circuit establishment phase. This term is exactly
what it sounds like, setting up of a circuit from source node to destination node as you move
from node to node. In this scheme, the entire message can be sent at once through a special
probe. There is one rule that is followed as the message is being passed and it is used whenever
more than one message is being transmitted at a given time. In this case, a collision may result
between the two messages if there is an intermediate node that both messages need to pass
through to reach their respective destinations. However circuit switching does not allow more
than one message to pass through a channel at a time, so the message that got their last needs to
wait.
NL = (P/B)*D + M/B
P = packet length, B= channel bandwidth, D is number of nodes in the network, and M stands for
the message length
The transmission phase transmits the message to the next phase and during the termination phase
the circuitry is broken once the destination is reached
The major advantage of circuit switching is that the elimination of buffering reduces memory
consumption.
In this scheme, the message is initially divided into small subunits, a set of flow control digits. If
the message s traveling and reaches a busy channel, the flits, or portions of the message are
buffered at the respective intermediate node where the channel was unavailable.
NL = HF/B * D + (M/B)
where the B= channel bandwidth, D = number of nodes in the network, M is the message length
and HF is the length of the header flit
4. Wormhole Routing
This method of switching is similar to the Virtual cut Through mechanism, where the difference
being is that the given flit size that is buffered fits into the buffer at the intermediate node
perfectly. Also, packet replication is now possible, meaning that copies of the flits can be sent
out through several output channels(multi-directional). Lastly, wormhole routing introduces us to
the concept of virtual channels, where multiple messages can now use the same channel to
transmit their respective messages.
These virtual channels have several advantages, one of them being that their is an increase in
network throughput by reducing the time nodes spend within the physical channels they pass
through, guarantee a communication bandwidth to system related functions, such as debugging,
and are used to avoid deadlock avoidance, a concept explained in the next paragraph.
In a deadlock, a subset of the messages(the flits, I believe) are all blocked at once and are
awaiting a free buffer to be released by a message that follows it.This problem can be resolved
by rerouting the message from the node where you are as occurs in the wormhole routing
solution, or move the flit all the way back to the source node moving in another path towards the
destination, or the last solution, the virtual channel.
C.Routing
Routing is the determination of a path, whether it is the shortest or the longest or somewhere in
between, from the source to the destination node. Two broad categories of routing exist---
deterministic and adaptive. In deterministic routing, the path is determined by the source and
destination nodes and in adaptive routing the intermediate nodes can "see" a blocked channel on
their way to their destination and can reroute to a new path. This works similar to the way people
normally alter their paths of traveling when hearing about the accident on I -495.
Deterministic routing has two more sub-categories that fall within it. One of them is named is
source routing, a method in which the source node determines the routing path to the destination.
The other method, distributed routing, considers all intermediate nodes and the source node in
determining the best path towards the destination, avoiding blocked channels.
Street Sign routing requires that the message header carries the entire path information.
At each turn of the node, the path towards the destination has been predetermined. However each
message header also has the capability to choose a default direction in which case an
intermediate node’s channel is in use. It does this by comparing the node’s address to see if a
miss occurred.
Dimension ordered routing is best explained through a diagram provided in Figure 17.14 and is
a routing method in which a message moves along a "dimension" until it reaches the a certain
coordinate and moves through another dimension. This scheme only works if the source node
and destination node lie along different dimensions.
Table Lookup Routing
In this type of routing, a table is formulated so that any given node can go where to forward its
message. Of course, this style of routing eats away at a lot of hardware, but is good in software
usage. This is because a very huge lookup table will require a very big chip area.
There are two general categories of adaptive routing. One is referred to as profitable routing in
which the only channels that are taken as the message travels from source node to destination
node are the ones that always move closer to the destination. Profitable routing ensures a
minimum path length and tends to minimize the possibility of a deadlock, discussed earlier. The
other category of adaptive routing misrouting protocols is more flexible and allows for usage of
the most cost efficient means of getting the message to its destination whenever it is deemed
appropriate in the network "situation". Also, adaptive routing types may vary on whether the
algorithm they utilize allows for backtracking the message to get out of a nasty situation. The
header of the message must carry a tag within it so that it will not continue looping the same path
if backtracking is permitted. If the header becomes very large because this, time is lost during the
search. Lastly, any given "protocol" in adaptive routing can either be partially adaptive, meaning
it can choose from a part of the channels to move through or completely adaptive, in which case
it moves to any channel it wants.
The creation of the J-machine at MIT fulfilled two important goals: it supported many parallel
programming models , such as the data flow model and actor based object oriented model
through the MDP (Message Driven Processor) chip, and showed that combing many MDP chips
into several nodes can produce what is called a fine grain system.
The MDP has combined a processor, memory unit, and communication unit into a single chip.
The node connecting method chosen is a be a 3D mesh, using a deterministic and wormhole
routing scheme.
The above diagram is the design layout of a typical MDP chip with six bi-directional bit parallel
channels. There are six network ports corresponding to the six directions possible in a 3-D
grid(+X, -X, +Y, -Y, +Z, -Z). Here, messages are divided into flits, corresponding to the
addresses X, Y, and Z. If a message comes to a node in dimension D, the address at the node
is compared with the flit. If there is a match, all the flits behind the match "move one over" in the
direction of the destination node. (Recall that wormhole routing is in use) By the way it is the
router unit that makes this determination, and in the above diagram there are three such router
units. (for each of the three dimensions)
If a message is blocked, it remains compressed within the router unit and uses a bit flit buffer.
Once the path clears, the message becomes uncompressed and can now move forward. Higher
priority messages are transferred more quickly, by using separate virtual channels to arrive to
their destination faster.
The MDP system also contains a network input and output unit. The output unit is used to store
an eight word buffer every time flit of the message is moved from one node to the next. The
input buffer is a collection unit for the flit’s message portion.
Architecture of Message Driven Processor
The MDP’s value in the fine grain system has three aspects to it: it reduced overhead time for
receiving a message, it reduced context switch time, and provided hardware support for object
oriented programming.
There are three "levels" of execution based on "need" for the message to be sent over from least
quickly needed to most--- background priority, priority 1, and priority 2.
Type checking of the message is also possible in the MDP system through a 4 bit tag. Tags
named Cfut and Fut are used for synchronization checks, ensuring flits are passed correctly.
The network instructions here include SEND, SENDE, SEND2, SEND2E and are used to send
messages between processors.
Instruction one sends the X,Y, and Z coordinates of the destination processor to R0. The second
instruction sets R1(register 1) and R2(register 2) for transmission. Finally instruction 3 combines
a word from memory with the end of the message to let the J-Machine know the message has
finished being sent.
Together, the message unit (MU ) and instruction unit (IU) determine which message
has the highest priority and allows that one to be passed first by a suspending the former
instruction currently in action. Messages can be stored in queue fashion, thus achieving the well
known FIFO property.
The J-Machine has several flaws within it. It has access to too few register sets R0-R3 and makes
many memory reference calls and some messages are locked out for a long time if they end up at
the back of the queue. The external bandwidth of the MDP structure is three times smaller than
the one found in many network designs. Along with this, the inability to use floating point in the
J-Machine is a big down side, so we turn to……
The medium grain system is based on the Transputer, a complete microcomputer that contains a
central chip within it, a floating point processor, static RAM, interface for external memory, and
several communication links into a given chip. Transputers are used to develop a synchronous
parallel programming model.
Two generations of Transputers have been developed in the passed few decades. The first
generation was mainly used for applications involving signal processing and were small systems
with quick communication links. Its main advantage was that it eliminated usage of a multi-bus
system, yet was not much quicker in terms of computations. The newer generation incorporates a
C104 router chip and a better processing node structure.
The major features of Transputer9000 include a 32 bit integer processor, a 64 bit floating point
unit, and 16KB of CPU cache, four serial communication links, virtual channel processor, and a
programmable memory interface. (PMI) An internal crossbar connects the CPU cache, PMI, and
virtual channel processor these to the four banks of the main cache.
This particular Transputer has a small register set(6 of them) yet it still maintains fast contexting
switch. Register A is the stack top. Register four, W, points to the workspace area. The fifth
register acts as a PC(program counter) and register six is responsible for "elaborating" operands.
Processes can be in one of three states: Active, Ready, and Inactive. In the active stage, the
Program Counter is loaded into the register PC and the workspace gets pointed to W by W. In
the ready stage, the PC is stored in its workspace register, W, and the also stores two Ready
lists(one for high priority and one for low priority messages). The Ready list, like a normal
queue, has a front and a back "pointer" for its head and its tail.
The last stage in a transputers’ processes is the inactive stage, and is awaiting one of the
following situations to take place: completed execution of channel operations, the execution to
reach a specified time, or getting access to an open semaphore.
In order to switch from one process to another, the following steps are required:
Each Transputer has something within it called an OS -Link, and is used to chain together many
Transputers. The data between the two Transputers is sent byte by byte The OS link has a
relatively slow speed of operation compared to other hardware available on the market, but is
still cheaper to buy if costs are a concern. However, problems may result with the OS link if the
process mapping strategy for linking the Transputers becomes complex
Due to this a new concept, known as virtual links was developed. Virtual links are built off the
concepts of bit level protocols, token level protocols, packet level protocols, and message
passing protocols.
Bit level protocols contain within them DS links rather than OS Links and have four wires, a data
strobe line going in both the input and output directions. (In bit level protocol, transmission of
messages is faster than before, but the receiving rate of messages is still somewhat slow.
Token level protocol’ s are used prevent the process sending the message from overusing the
entire input buffer for message transmission. Here, the receiver of a message lets the sender of
the process know that it is ‘ok’ to send forth more of the message to reach the destination. Some
of the commands associated with this include:
00 Flow control token(FCT)
01 End of packet(EOP)
10 End of message(EOM)
11 Escape Token(ESC)
Here packetizing of messages takes place rather than usage of the flit method, but is essentially
the same concept as the token level protocol
Here, a "virtual link" list is used that is similar to the idea of a ready list. Whenever a message is
sent out or received, the VCP unit of the T9000 "deschedules" the message currently being
worked on. The identifier of the new process is also saved and execution of it begins. Once this
is complete the process that had been waitlisted now continues at the command of the VCP
unit.In this way, the VCP lets through messages of higher priority.
In a remote system, the packets of messages to be distributed use the header in the same way any
longer . Instead, a C104 chip is used to develop a different routing switch scheme. The C104
chip, FIGURE 17.25 is shown below.
The C104 chip uses a deterministic, distributed, minimal routing protocol called interval
labeling. Each interval for a given router is non overlapping. An interval’s destination address
determines where the message packet will be sent. The routing decision unit for each packet is
named the interval selector unit. It has a thirty-five limit and base comparator within it. More
than one interval can be assigned to it. When a packet arrives to the C104 router its header is sent
to the interval selector unit when the crossbar switch matches the message header’s data. After
this, all the tokens can pass through until the terminating token reaches its destination.
The Parasytec GC
Characteristics:
a. Data network
b. Control Network
c. I/O network
The major topology designs that are mainly used on coarse grain systems include the following:
mesh, cube , and fat tree due to their lower implementation costs and has better than average
scalability.
A. Homogeneous Architectures
Intel Paragon
The Intel paragon was based on the hypercube topology. The Paragon contains three node types :
compute nodes, service nodes, and input/output nodes. The compute nodes sole responsibility is
computations and are considered multiprocessor nodes, in contrast to the other two node types,
thought of as general purpose nodes. Each general purpose node contains an expansion port for
addition of an input and output interface . The input/output nodes themselves serve to support
input/output connectivity and the service nodes are involved in providing interactive use for
many users.
The Multi Processor (MP) node in this architecture is a small sized shared memory
multiprocessor . The nodes memory is shared in a four way system of processors and each has a
25 KB second level cache. The MP node also has a i860 chip used as a message processor.
The message passing process is started at the application processor but is performed mostly by
the message processor. The software used in message passing is executed from the internal cache
just discussed.
The architecture also consists of a Message routing unit known as the Mesh Router
Controller(MRC) formulated through a connected 2D mesh. The MRC allows message passing
at the high speed rate of 200 Mbyte/second . The MRC is made up of a 5X 5 crossbar switch
with flit buffers and many input controllers. Along with this, two block transmission engines are
included to support communication in a node. The Network Interface Controller(NIC) provides a
pipelined interface between the MRC of the node and the node’s memory.
B. Hybrid Architectures
The node on this system is comparable to the one found in the Transputer T9000 system. The
new CPU 601 is much more powerful than the CPU found in the T9000 though. Instead of
having virtual channels as in the T9000, four T805 Transputers are used. There are sixteen bi-
directional links on the system instead of the four found on the T9000. Also, the amount of
communication processors, the number of CPU’s , and the quantity of node memory provided
are based on customer need. THE CPU and VCP in this system send signals to each other to
reduce overhead time due to excess communication. The multi-chip concept in the architecture
along with the usage of the 3D mesh interconnection topology arrangement are also provided. A
single cube on the GC/PowerPlus machine as four sets of 5 nodes connected through 16 C004
switch units. This system also uses a wormhole routing scheme in message distribution.
Essentially, the coarse grain part of the system is derived through the multi-threaded and it
borrows the virtual channel concept from the T9000. Through the communication processor, the
application threads provide communication by placing a channel operation command and
parameters in the shared memory, thus sending an interrupt to the VCP.