NOC BOOK 6
NOC BOOK 6
Here we assume that: congestion information stored in nodes are always up-to-date.
1. A router only decides the distribution of traffic to its
next-hop routers.
2. The ratios are per-destination basis, i.e., for a given
node, all arrived packets destined for the same node use
same ratio while packets using the same output ports but
going to different destinations will be distributed
independently by different ratio.
3. Minimal routing is used in our algorithm, thus for every (a) 1st step (b) 2nd step
node, there are at most two ports to a destination and the
sum of port ratios for a destination equals to one and if
there is only one permitted output port, all traffic is
forced to be routed on that port.
A. Distributed delay measurement and propagation
Next, we illustrate the measurement and the propagation of
the global delay information using an example in a 4x4 mesh
topology in Figure 1 (a). Assume all nodes in the network need (c) 3rd step (d) 4th step
to measure the delay to node 9. Figure 1 Example of the Distributed delay propagation
Firstly, each node periodically estimates the local waiting B. Adaption of traffic split ratio
time in the input queues for all five output ports. For every The purpose of the traffic split ratio is to use the global
output port, this time is considered as the local queuing delay congestion information, which are measured and propagated to
l[p] through port p and is approximated by counting the number each node, to uniformly balance the traffic load in the whole
of flits in the input buffers which have already requested a network. For each node in the network, the adaption process of
virtual channel to the next-hop router. the per destination traffic split ratios will be triggered upon the
Then at the 1st clock cycle, delay from node 9 to itself is just delay information from valid downstream routers is received by
the queuing delay on the ejection port of node 9. 𝐴𝑣𝑔9 [9] stands the current node. The same adaption algorithm will be repeated
for the average delay from node 0 to itself and equal to: for all nodes in the network.
𝐴𝑣𝑔9 [9] = 𝑙[𝐸𝑗] (1) Suppose at node i, there are two output ports 𝑝𝑥 and 𝑝𝑦
This delay information 𝐴𝑣𝑔9 [9] is then propagate to all connected to the destination j along paths which are permitted
neighbors of node 9 at 2nd clock cycle. Node 8, 10, 5 and 13 by the minimal routing. As we discussed at part A, A[x][j] and
receive 𝐴𝑣𝑔9 [9] through their east (E), west (W), south (S) and A[y][j], which are the delay to node j through ports 𝑝𝑥 and 𝑝𝑥
north (N) ports respectively, as shown in Figure 1 (b). Each of respectively, could be estimated by the current node. Here, we
these nodes estimate their delay to node 9 by adding 𝐴𝑣𝑔9 [9] assume that the delay from x port is higher that from y port,
with their locally measured delays on the port leading to node which means that
9. For instances, at node 10, only west port could go to node 9 𝐴[𝑥][𝑗] > 𝐴[𝑦][𝑗]
and the average delay from node 10 to node 9 is given as: Then we use these information to update our traffic spilt ratio
with the below equations.
𝐴𝑣𝑔10 [9] = 𝑙[𝑊] + 𝐴𝑣𝑔9 [9] (2) 𝐴[𝑥][𝑗]−𝐴[𝑦][𝑗]
Upon all one-hop routers finished the measurements of path ∆= min(0.25 ∗ ( ) , 𝑊[𝑥][𝑗]) (6)
𝐴[𝑥][𝑗]
delay, at 3rd clock cycle all two-hop routers 12, 14, 11, 4, 6 and 𝑊[𝑥][𝑗]𝑛𝑒𝑤 = 𝑊[𝑥][𝑗] − ∆; 𝑊[𝑥][𝑗]𝑛𝑒𝑤 = 𝑊[𝑥][𝑗] + ∆ (7)
1 receive updates for the delay to node 9. For instances, node 6 The basic idea of the above equation is to increase the traffic
receives updates about the average delay to node 9 from nodes split ratio of the port with lower downstream delay and decrease
5 and 10 connected to the north and west port respectively. Then the ratio of the ports with higher delay. To avoid ratios
node 6 could estimate its average delay by computing a becoming negative, we chose the minimal value between the
weighted mean of the delays through the north and west ports, ratio difference and current higher ratio.
the weights given by the traffic split ratio along these ports at 2.2 Runtime Fault tolerant mechanism
node 6.
The mechanism to handle with soft/permanent faults in the
𝐴[𝑁][9] = 𝐴𝑣𝑔10 [9] + 𝑙[𝑁] (3)
network during the runtime is necessary for modern routing
𝐴[𝑊][9] = 𝐴𝑣𝑔5 [9] + 𝑙[𝑊] (4) algorithm to deal with potential hard errors in the lifetime. And
𝐴𝑣𝑔6 [9] = 𝑊[𝑁] ∗ 𝐴𝑣𝑔10 [9] + 𝑊[𝑊]𝐴𝑣𝑔5 [9] (5) in our project, we propose and implement a runtime mechanism
Here, 𝐴[𝑁][9] and 𝐴[𝑊][9] represent the delay through to cope with the potential permanent link failures.
north and west ports respectively and W[N] and W[W] stand Since the broken links always mean a topology change, the
for the traffic split ratio at node 6 to destination node 9. original routing table may lead to error state and reconfiguration
Carrying on in this manner, after some clock cycles all nodes is necessary to ensure the complete reachability for all surviving
in the network are able to measure their delay to node 9 through nodes. In general, there are two families based on their method
candidate output ports permitted by the minimal routing. This of the reconfiguration. One is deploying the routing tables and
process will repeat periodically to ensure that the global logic that are updated upon each fault occurrence in the runtime.
EECS 578 Final Project Report 3
The second solution based on the offline software to complete (a) 1st step (b) 2nd step
the reconfiguration upon any fault link detected and then
communicate with surviving topology with a central node. Our
solution is built based on the first family while combing with
the global congestion information forwarding. And we assume
that when a link failure occurs, the node connected with that
link will detect this fault and stop the new packet/flit injection
until the reconfiguration is finished. The routing table
reconfiguration works as follows:
Firstly, if a link error is detected, every node in the network (c) 3rd step (d) 4th step
works as a root node, starting to broadcast a reconfiguration flag Figure 2 Example of the reconfiguration process
to all other nodes in the network only through the healthy links
hop-by-hop. Meanwhile the delay measurement and III. DEADLOCK RECOVERY MECHANISM
propagation process as we discussed in 2.1 is also initialed at We use the escape virtual channel to realize the deadlock-
this node so the delay information Avg[i] are also transmitted. free feature in GCA. The key idea for it is to provide an escape
Then, for each node received the reconfiguration flag: path (escape virtual channel) for every deadlock packet. The
Stall the router pipeline. If receiving a reconfiguration routing algorithm for the escape path should be deadlock-free.
flag, that node should stop the pipeline and freeze the Thus, when a packet is checked to be stuck in deadlock, we can
virtual channel allocation & switch allocation until the send it on to the escape path and then the packet can use this
reconfiguration complete for all nodes. deadlock-free path to its destination.
Update the routing table. For ports receiving the flag,
calculate and store the new traffic split ratio W[x][i] A. How escape virtual channel works:
based on the propagated delay information from The approach to dealing with deadlock is not to avoid it, but
downstream nodes. For ports not receiving the flag, rather to recover from it. There are two key phases to any
invalid current split ratio and set to zero. Then deadlock recovery algorithm: detection and recovery [1]. And
calculate the average delay from current node to the in our algorithm, we’d like to separate it into three stages:
root node. This step provides the safe paths as well as Detection, Filtering and Recovery.
the global congestion information for the current node. 1. Detection:
This step is illustrated in Table 1. In the detection phase, the network must be able to detect
Flag forwarding. Nodes send the reconfiguration flag if itself has reached a deadlock situation. Determining
to its neighbors only through those ports which didn’t exactly whether the network is in deadlock requires finding
receive a flag or connect to a faulty link. a cycle in resource wait-for graph. It’s difficult and costly, so
For nodes detecting a permanent link error, repeat the above we use a conservative detection mechanism - timeout
process to obtain an updated routing table with safe paths from counters. Each input port of the router will be equipped with
other nodes to this faulty node as well as the network congestion a timeout counter. There are only two cases that we will reset
information, which is used to select these safe paths adaptively. the counter: (1) when the input port receives a flit, (2) when
This reconfiguration algorithm makes use of some ideas of we detect the deadlock and allocate an escape virtual channel
our global congestion propagation process, both transmitting
for that packet. Except for the two cases above, we just
information from one destination to every possible source. Thus
increase the counter by 1 per step. When the counter gets to
if any link error occurs, the reconfiguration process co-work
with distributed delay propagation to obtain fully reachability the specified deadlock upper bound, a filtering stage will be
to all surviving as well as the global congestion states. Figure trigger.
2 illustrates an example while one link break in a 4x4 mesh 2. Filtering:
topology network. In this phase, the network needs to figure out whether the
recovery requests are real deadlock or just false positive. The
Table 1 Traffic split ratio update based on the flag signal and delay way we do it is to check the virtual channel’s state. As we
information during reconfigurations know there are four states for the virtual channel: idle,
routing, virtual channel allocation (vc_alloc) and active. If
Destination (i) West North East North there is any virtual channel in idle state or there is a packet
Ratio (W) 0.6 0.55 0.4 0 0 0.45 0 just ready for ejection, we think the deadlock is not true (false
Flag received Yes No Yes No positive), otherwise, we will allocate escape virtual channel
for those virtual channels in vc_alloc states (It means if all
the virtual channels are in their active states, we will not
allocate any escape virtual channel for this input either).
3. Recovery:
In this phase, we have selected those input virtual channels
whose inner packets (head flits) have been waiting for an
available virtual channel for a long time (>deadlock timeout).
We apply a priority selector here to help us determine which
EECS 578 Final Project Report 4
virtual channel should be the first to obtain the escape virtual In order to implement this algorithm, we need firstly add a
channel. After allocating the escape virtual channel, we will bit (named root_arrived) in flit which indicates whether the flit
clear the timeout counter on that input port. Using FSM to has passed through the root node.
describe the process in Figure 3. The reason why this algorithm is deadlock free is that we are
based on GCA table which will always give us a closer-to-dest
direction even when there are permanent faults in NoC. So
when we use GCA to send flit from source to root and then from
root to destination, we actually disallow those paths that include
traversing a down link followed by an up link. In this way, the
algorithm implemented is deadlock-free.
input buffer can have a choice of at most two output ports which network is conducted in comparison with some extant routing
maps to one of the four quadrants, and split ratios are algorithms in BookSim, as well as a comparison in saturation
normalized such that they always add up to one. throughput. For evaluation of performance on faulty network,
an increasing number of fault is inserted into network with the
C. Adapt Split Ratio
random fault generator at a fixed injection rate, thus fault
The computations involved with adaptation of split ratios are tolerance of the proposed routing algorithm is tested.
given as follows:
A. Evaluation of GCA algorithm in non-faulty network
Dimension-order, min-adaptive and xy_yx-adaptive are used
for a comparison with the proposed routing algorithm in non-
() faulty network, as they are the typical deterministic/adaptive
To simplify the implementation of these computations in routing algorithms on mesh network. For four different traffic
hardware we always assume λ = 0.25 which reduces the patterns – uniform, shuffle, bitrev and transpose, average packet
multiplication to a shift operation. The division is also avoided latency is measured for the three extant algorithms as well as
by extracting only the most significant bit of L[ph][j] that is set the proposed GCA routing algorithm. The result is shown in
and ignoring the remaining less significant bits. This reduces Figure 7.
division to a shift operation.
Uniform Shuffle
500 500
GCA GCA
Dimension Order Dimension Order
400 400
Average latency
Average latency
xy-yx Adaptive xy-yx Adaptive
Min Adaptive Min Adaptive
300 300
200 200
100 100
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5
Injection Rate Injection Rate
Transpose Bit Reverse
500 GCA 500
Figure 6 Logic for Adaption of Weights Dimension Order
GCA
Dimension Order
400 xy-yx Adaptive
Average latency
Average latency
Min Adaptive Min Adaptive
D. Reconfiguration flag forwarding unit 300 300
unit is needed to receive and forward the flag signal where the 100 100
GCA Dimension Order xy_yx Adaptive Minimal Adaptive This [3] [4] [5] [7]
0.5 work
36
[1] R. Ramanujam and B. Lin, "Destination-based congestion awareness
34 for adaptive routing in 2D mesh networks", ACM Transactions on
Design Automation of Electronic Systems, vol. 18, no. 4, pp. 1-27,
32
2013.
30
[2] K. Aisopos, A. DeOrio, L. Peh, and V. Bertacco, “ARIADNE:
28 Agnostic Reconfiguration In A Disconnected Network Environment”,
0 2 4 6 8 10 International Conference on Parallel Architectures and Compilation
Number of faults Techniques (PACT), Galveston Island, TX, October 2011.
Figure 9 Average delay vs. number of fault
[3] P. Gratz, B. Grot and S. Keckler, "Regional Congestion Awareness for
Seen from Figure 9, due to the effective reconfiguration Load Balance in Networks-on-Chip", HPCA, 2008.
stage in dealing with faults, the average latency increases
slowly as the number of fault increases. [4] D. Seo, A. Ali, W. Lim, N. Rafique and M. Thottethodi, "Near-
Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh
Networks", ACM SIGARCH Computer Architecture News, vol. 33,
C. Comparison between proposed work and some published no. 2, pp. 432-443, 2005.
routing algorithms
[5] A. Singh, W. Dally, B. Towles and A. Gupta, "Globally Adaptive
At last, a table is presented for a comparison between the Load-Balanced Routing on Tori",IEEE Comput. Arch. Lett., vol. 3, no.
proposed GCA routing algorithm and some published routing 1, pp. 2-2, 2004.
algorithms. [6] S. Jovanovic, C. Tanougast, S. Weber, and C. Bobda, “A new
As we can see from Table 2, the proposed routing algorithm deadlock-free fault-tolerant routing algorithm for NoC
works well even in comparison with some published work. interconnections”, in Proc. Int. Conf. Field Program. Logic Appl.,
Aug.–Sep. 2009, pp. 326–331.