Streaming Application On Many-Core Systems
Streaming Application On Many-Core Systems
on many-core systems
Ali Abuassal∗
∗ deptartment of Electronic Engineering, The University of York, York, UK
[email protected]
Data Pool
Data Pool
PU PU PU
PU PU
Fig. 3. Microprocessors Trends Data [10]. PU
Data Pool
Data Pool
PU PU PU
X
R
R
TX RX TX RX TX RX
Layer NoC OS.
TX
TX
TX
Elements
TX RX TX RX TX RX
RX
NoC TX RX
NoC TX RX
NoC TX
RX TX RX TX RX TX
X
R
R
TX RX TX RX TX RX
TX
TX
TX
TX RX TX RX TX RX
RX
NoC RX
NoC RX
NoC
TX TX Text
TX
Router Router
RX TX RX TX RX TX HW Layer Layer
Packet routing Switches
X
R
R
TX RX TX RX TX RX
TX
TX
TX
Physical Physical Physical trans. Interconnect
TX RX TX RX TX RX Layer Layer of phits links
RX
NoC RX
NoC RX
NoC
TX TX TX
RX TX RX TX RX TX
in 2016 as announced by the University of California, Davis optimized and controlled. NoCs use packets to route data from
[35]. To connect such complicated architectures, Networks- the source to destination processor through a network fabric.
on-Chip (NoC), introduced by Benini and De Micheli [5], has Packets are sent in a pipelined fashion, which enhances the
become the default standard. operating frequency and solves the signal integrity problem,
1) From Bus to NoC: Many recent electronics devices con- which is a set of measurements to ensure that all signals
tain complete systems embedded in a chip. As a result, intra- transmitted are received correctly and they do not interfere
chip communication requirements have become crucial. Point- with one another. The basic NoC architecture, proposed by
to-point links are often adopted to satisfy the requirements of [5] and [54], was inspired by the modern land development
on-chip communication that connects the top level modules, process, in which road and communication infrastructures are
but, as the complexity of the systems increases, wire density laid first, then buildings are placed and built. By mimicking
and length have grown. Thus point-to-point architectures have the modern city, the chip is divided into rectangular tiles
become more and more infeasible, due to high propagation (processing elements) and roads between tiles, where on-chip
delays, high power dissipation, poor scalability, and lack of network routers are placed. Fig.6 provides an overview of a
reliability. In particular, global wires across the chip, which typical 2-D mesh NoC, which consists of 5-port routers that
can not be scaled with technology [52], have become a can exchange data through two unidirectional channels with
bottleneck in bus-based interconnections. Moreover, in deep the four neighbouring routers (north, east, south, and west)
sub-micron VLSI technology, delay, power, and reliability are as well as to the local processing node/core. A router is not
crucial issues [53]. Therefore, centralized approaches to global only responsible for its associated tile but it also routes signals
communication across top level modules are no longer suitable from/ to other routers.
for advanced architectures that contain a large number of 3) Network-on-Chip layers: Network-on-Chip can be di-
cores. Current on-chip bus interconnect templates, for instance vided into the following layers:
the AMBA and CoreConnect buses from ARM and IBM Application Layer: In this layer, applications are broken
(respectively), are widely used in multiprocessor system on down into a set of computation and communication tasks. The
chip (MP-SoC) but do not scale to large numbers of cores as functionalities of this layer include message synchronization
they allow only one communication transaction at the same and management, for instance, the performance factors like
time. Consequently, the average communication bandwidth energy and speed can be optimized.
of every core is inverse to the total number of cores in the Transport layer: According to Fig.7, the transport layer
chip [52]. This implies that the on-chip bus architecture is provides the end-to-end communication (between switches)
inherently not scalable for many-core devices. and delivery of data using the network router layer. Therefore,
The Network-on-Chip (NoC) approach has been introduced its functionalities include packetization of data at the source
by [5] to provide solutions for the aforementioned issues and depacketization at the destination.
and to introduce a structured and scalable communication Network router layer: this layer provides the service of com-
architecture [54]. municating a packet from one resource to another using the
2) Concept of Network-on-Chip : A typical Network-on- network of switches. Buffering of packets and taking routing
Chip (Fig.6) is a scalable general purpose on-chip communi- decisions in the intermediate switches are the functionalities
cation network that enables the design of increasingly complex of this layer.
multi-processor systems. NoC divides the long cross chip Physical layer This layer is concerned with physical char-
wires into a smaller segments, allowing their properties to be acteristics of the NoCs for connecting switches and resources
with each other.
4) Topology: topology is the one of fundamental features of
NoC design because it plays a crucial role in overall network
performance and cost. The topology determines the physical
connections between the nodes and it also determines the
number of alternative paths between the nodes, and as a result
it affects the network traffic distribution. Several topologies
(a) (b)
have been proposed for high-performance parallel computing (a) (b)
(for example, hypercubes) but 2D meshes, rings, or tori are the
most common topologies in current integrated circuits [55].
Fig.8 shows the three most used topologies.
The ring topology (fig.8(a)) is commonly used in small
networks with a small number of nodes, for instance in the
six-node Ivy Bridge [56] and Cell processors [57]. Since the
ring network has a simple structure, the routing logic and flow
control are easy to implement. Moreover, a centralized control
(c)
(c)
mechanism can be applied [56] [57]. Scalability is limited in
the ring topology but it can be improved using multiple rings, Fig. 8. Three typical NoC topologies. (a) Ring. (b) 2D mesh. (c) 2D torus.
as Intel did in its Knights Corner processor, where ten rings
are used [58].
The 2D mesh topology (fig.8(b)) is highly compatible with [72] provides an extensive study evaluating various topologies
wire routing in CMOS technology and it can also potentially for on-chip communication and raises interesting conclusions
reduce or even eliminate deadlocks (9) in the NoC. These such as the fact that connecting more than one core to a single
factors, in addition to its high scalability, make the 2D mesh router is an efficient method to reduce latency. Connecting
topology the most commonly utilized NoC topology in indus- routers that are not neighbours by adding physical channels
try and academia. Some examples include the U.T. Austin can also reduce latency. However, authors in [73], while
TRIPS [59], Intel Teraflops [60], and Tilera TILE64 [61] generally agreeing with [72], argue that adding more channels
chips. Most researchers assume 2D mesh topologies for their and connecting more cores to the same router increases the
theoretical research and simulations. The congestion at the design complexity dramatically.
centre of 2D mesh is the most obvious limitation, becoming a
bottleneck for large number of cores. However, according to Another important research aspect for topology design is
[62], this issue might be reduced by assigning more wiring improving network throughput. According to [74], the central
resources to the centre portion, in other words designing portions of a mesh network are more likely to be congested,
asymmetric topologies. as a result limiting throughput. Therefore, [74] proposes a
The torus topology (fig.8(c)) helps to balance network heterogeneous mesh network that adds higher bandwidth to
utilization, as the wraparound links reduce the latency and the routers in the centre. This heterogeneous network requires
congestion in the centre. However, deadlock may still occur the division of large packets into smaller ones and also
and additional mechanism are often used, such as multiple the combination of small packets into large ones, in order
virtual channels (VCs) [63] or bubble flow control (BFC) [64] to provide communication between narrow and wide channels.
[65]. 2D and higher-dimension tori are utilized in many-core
systems such as the K Computer [66], Blue Gene [67] [68], The last important research field in topology design is the
the Cray T3E [69], and SpiNNaker chip [70] reduction of area and power overheads. Regarding reducing
Other topologies have also been proposed, such as star and power consumption, [75] proposes a cubic ring topology, in
tree. While in the former, all the nodes are connected to a which 30% of the routers can be turned off dynamically to
central (common) node, which is usually referred to as super- reduce power consumption. [76] places a ring alongside with
node, the latter has a central root node which is connected a mesh topology, so the ring part acts as backup connection
to one or more nodes of a lower hierarchy. According to [71] in case mesh routers are turned off. In this proposal, all mesh
both star and tree topologies have a very limited scalability due routers can be shut down to the same power. Virtual channels
to the high cost of links implementation. However, they can that can dynamically configure links between routers to form
profit from 3D topology, where the average distance between a bus structure to reduce power and latency are proposed by
the nodes is dramatically reduced. [77].
Research on topologies has mainly focused on reducing 5) Router architecture: The architecture of the router de-
latency, improving throughput, and reducing area and power fines the area overhead, power consumption, and routing
overhead [55]. Reducing the latency in the NoC is probably delay. A canonical router is composed of input units, routing
the most interesting avenue for topology design because it computation logic, switch allocators, a crossbar, and output
has a huge impact on the overall performance of the NoC. units. Fig.9 illustrates the architecture of common router.
l nt
ca e
lo lem F1 F2 F3
/t o g e
0
om in F1 F2 F3
Fr ess Router
1
oc F1 F2 F3
pr 2
3 F1 F2 F3
0 1 2 3 4 5 6 7 8 9 10 11
(a)
From other routers
Crossbar Cycle
Output channels
To other routers
Input channels
.
. .
. .
.
0 F1 F2 F3
1 F1 F2 F3
Router
2 X F1 F2 F3
3 F1 F2 F3
Routing &
Arbitration
(b) 0 1 2 3 4 5 6 7 8
Cycle
Output channels
To other routers
Input channels
circuit switching reserves physical link(s) from source to .
. .
.
from source to destination, the header flit reserves the physical Routing &
Arbitration
links. When the destination receives the header flit, it sends
an acknowledgement to the source which then sends the rest
of the data.
b0
Store-and-forward (SAF): In the SAF flow control [85] the
entire packet is received first and then forwarded to the next .
Ch . Ch
router. Fig.10(a) illustrates an example of SAF, in which a .
b3
packet consisting of three flits (F1 to F3) is routed from router
0 to router 3. At each router the entire packet is received flit VC Buffer
by flit before it is forwarded to the next router. As a result
SAF introduces a latency of at least N clock cycles at every
hop (where N is the number of flits in a packet). Fig. 11. Router architecture with virtual channel.
Virtual cut-through (VCT): In order to reduce the serializa-
tion latency, another type of flow control was introduced by
[86]. Virtual cut-through flow control forwards the packet flit that handles resource congestion at the flit granularity. If
by flit as soon as the header flit (F1) is received by the router if more than one flit compete for the same port, only one flit
the link is available. As can be seen in Fig.10(b), every router is granted the port and the rest will be stored or dropped.
forwards the packet immediately after receiving the header flit. Mitchell et. al. [89] took bufferless flow control further
However, as VCT allocates the buffers at packet granularity, by proposing a single-cycle adaptive routing and bufferless
it requires the downstream router to have enough buffer space network (SCARAB) in which the non-granted flits are
for the entire packet prior to forwarding the header flit. For dropped and a negative acknowledgement is sent back to
instance, in Fig.10(b) at cycle 2 the target buffer does not have the source node to retransmit. An adaptive flow control that
sufficient free slots for the entire packet, therefore the packet combines buffered and bufferless techniques is proposed in
waits until there is enough space (after 3 more cycles). [90]. It allows routers to buffer the flits in case of high load,
Wormhole : Unlike SAF and VCT, wormhole flow control but it turns off the buffers to apply a bufferless scheme to
(Fig.10(c)) requires only one buffer slot to be available in save power in case of low load. [91] shows that bubble flow
the downstream router before starting to forward the packet. control exhibits a reduction in base latency values of over
Generally, if there is no congestion in the network, both 40% with respect to the corresponding wormhole. A research
wormhole and VCT perform equally. When congestion occurs, group at the National University of Defence Technology [92]
however, they behave differently because wormhole requires has proposed a hybrid flow control similar to VCT in terms
only one buffer slot while VCT requires buffer slots for the of injecting packets, but using packet movement typical of
whole packet before forwarding the it [87]. wormhole flow control.
Virtual channels (VC) can be used with any of the
aforementioned flow controls. VCs consist of buffers which To summarize, research on flow control techniques mainly
can hold one or more flits of a packet. Several virtual focuses on reducing the packet transmission latency, reducing
channels might share the bandwidth of a single physical power consumption by using bufferless methods, and avoiding
channel Ch. VCs can reduce or avoid deadlock: for instance, deadlock in the network.
as it can be seen in (Fig.11) if a blocked packet A fills
buffer b0, other buffers b1, b2, and b3 are available allowing
J. Mapping Applications on Many-Core Processors
other packets to pass while packet A is holding buffer b0 [78].
The following processes are required before mapping appli-
Other flow controls have been proposed by various cations onto a many-core platform:
researchers to improve performance and achieve better • The parallelization of the application, including defin-
resources allocation. [88] proposes a bufferless flow control ing communication between the parallelized task and
synchronization. This can be done by one of standard with dynamic workloads, which need remapping or run time
application parallelization tools such as [93] [94] [95]. mapping.
• The transformation of the application into a task graph, 2) Run Time Mapping: In contrast to design-time mapping,
using for example task graph generators such as task run-time mapping must take into account the time taken
graph for free (TGFF) [96] and Synchronous Data-flow to remap tasks as this affects the overall execution time
Graphs (SDFGs) [97]. of the application. Moreover, in dynamic mapping tasks
• An analysis of the constraints of the application, such as are generally mapped one by one. Therefore, greedy
power consumption and performance. algorithms are normally utilized for efficient mapping so that
• When considering heterogeneous platforms, task binding performance metrics (communication latency, execution time,
is required to map tasks to suitable HW resources. power consumption, etc.) can be optimized. Furthermore,
Mapping application tasks on many-core platforms can be run-time mapping provides several advantages over static
carried out either statically (at design time) or dynamically mapping, as it can adapt to the available resources and
(at run time). Mapping at design time is used for applications also discard defective parts of the platform (allowing fault
with known computation processing and communication tolerance techniques).
behaviour, but is less efficient for dynamic workloads (for
instance, when adding new applications at run time). Run-time [103] presents a heuristic algorithm which is distributed
mapping methods, on the other hand, consider applications over the processors and therefore can be applied to systems
during their operation, generally using task migration to move of any size. Moreover, tasks added at run time can be handled
tasks in case the application requirements change or a new without any difficulty, allowing for online optimisation. Tasks
application is entered into the platform. can also be migrated based on local information on processor
workloads, task size, communication requirements, and link
1) Design Time Mapping: Design time mapping techniques contention. The mapping results for several example task
must have a whole picture of the system and application sets suggest that the performance of mappings obtained by
beforehand in order to make an appropriate decision for this algorithm is within 25% of that of the exact algorithm
using the resources. Because there are no computational or for a 3x3 mesh topology platform. Task allocation strategies
time restrictions involved, a high quality of mapping can be based on bin-packing algorithms with task migration ability
obtained compared to run-time mapping techniques, which are proposed by [104], where various types of algorithm are
are usually restricted to a local view and operate under combined to obtain better allocation results. The system can
tight constraints. Most of the literature on mapping covers shut down idle processors and apply dynamic voltage scaling
design-time methods. to processors with slack, thus, reducing power consumption.
[98] proposes the Communication Weighted Model To cope with the dynamism of application workloads at
(CWM) as a mapping technique to reduce the overall runtime and improve the efficiency of the underlying system
power consumption by reducing energy consumption in architecture, [105] presents a hybrid task mapping algorithm
communication. Unlike other mapping strategies, [99] that combines a static mapping exploration and a dynamic
takes into consideration the dynamic behaviour of the mapping optimization to achieve an overall improvement
target application and thus potential contentions in the of system efficiency. The algorithm was evaluated using a
communication between cores, showing that a 42% average heterogeneous MPSoC system with three real applications.
reduction in the execution time of the mapped application According to [105], the results reveal the effectiveness of the
can be obtained, together with a 21% average reduction in proposed algorithm: in test cases with three simultaneously
the total energy consumption. A mapping scheme based on active applications, the mapping solutions derived by the
a branch-and-bound algorithm is proposed in [100] to map approach have average performance improvements ranging
applications on hybrid NoCs. The results show that this from 45.9% to 105.9% and average energy savings ranging
scheme can reduce communication latency. from 14.6% to 23.5%.
A two-step genetic mapping algorithm is used in [101] To map streaming applications on a heterogeneous platform,
to optimize the application execution time. The algorithm [106] proposes a run-time spatial mapping technique that
proposes mathematical delay models and maps vertices of a contains four steps. This algorithm is implemented on an
multi-task graph to available cores so that every task can meet ARM926 operating at 100 MHz and is able to obtain
its respective deadline. A genetic mapping algorithm that significant improvements in latency. [107] suggests that
utilizes dynamic voltage scaling (DVS) is proposed in [102] allowing multiple tasks allocation per core will reduce the
to decrease power consumption. Considering DVS during the energy used by the NoC.
mapping optimization can save up to 51% of the energy.
One of the aims of dynamic mapping is to achieve load
As these mapping methods determine the placement of balance in NoCs, thus avoiding hotspots. Most of the literature
tasks at design time, they are not applicable to application focuses on local optimization such as minimizing the number
of hops between communicating tasks, which may lead to L. Faults in NoCs
hotspot zones and underutilization of resources. Recently, Packet-switched NoCs [5] are widely used instead of tra-
some researchers have tried to address these issues. For ditional shared bus for on-chip interconnects in many-core
instance, [108] proposes a runtime mapping heuristic which systems. However, failures that appear in any part of the NoCs,
has a cost function that targets temporal workload and energy can compromise the correct functionality of the entire system.
consumption balance in large scale systems. Therefore, it becomes advisable to introduce fault-tolerance
features.
However, using a centralized manager (CM) approach Broadly speaking, a fault can appear at three different layers
(which is the case in most of the literature) has several issues of NoC architectures, for each of which specific fault-tolerance
with respect to the proposed research including: a single point mechanisms are relevant. Various fault tolerance techniques
of failure, high monitoring traffic by the CM, a bottleneck have been suggested to tackle errors [114] at the transport
around the CM because each core is sending its status to the layer, router layer, and physical layer (Fig.7).
CM after mapping, and, in some instances, the fact that the 1) Fault Tolerance in the Transport (link) Layer: The
CM itself becomes a hotspot. transport layer, also known as end-to-end communication as it
links routers (ends) (Fig.7), provides communication services
K. Fault Classes between network routers. In order to achieve reliable end-to-
In general, three kinds of faults are the basis of most fault end communication for NoCs, fault tolerance is needed at this
models: layer. According to [112], fault-tolerance schemes in this layer
Transient faults occur for a short time. For instance the can be classified into four types;
change of value of one bit (bit-flip) can corrupt the header Automatic repeat request (ARQ) is basically a time re-
of a packet. For this type of fault, error control could be dundancy technique based on acknowledgements and re-
implemented at the link or in the transmitter and receiver sends to achieve reliable system operation and to correct
[85]. Transient faults are typically generated as a result of corrupted packets. The acknowledgement (ACK) or not-
terrestrial cosmic neutrons and alpha particles which come acknowledgement (NACK) signal is generated by the receiver
from radioactive impurities in the device or packaging material side when it decodes the packet, encoded by the predecessor.
[109] [110] [111]. A flipping of bits in the SRAM or DRAM If the packet is corrupted, then the receiver will send a NACK
memory might happen due to these particles. signal; upon receiving the NACK signal the sender will send
Intermittent faults appear frequently but are not permanent. the packet again. A time-out (predefined time) mechanism is
They are therefore not easy to distinguish from transient faults. also used to account for errors in ACK and NACK: if the
However, [109] proposes three features that can distinguish predecessor does not receive the ACK/NACK signal it will
between intermittent and transient faults: (1) intermittent faults keep retransmitting the packet until it receives the ACK/NACK
occur repeatedly at the same location, (2) errors induced by signal. In such a situation, a buffer is needed to store the
intermittent faults tend to occur in bursts, and (3) replacement packets [111].
of the offending circuit removes the intermittent fault. As In Forward error correction (FEC), the predecessor encodes
an example, electromagnetic interference such as crosstalk or the packets using an error correction code that lets the re-
self-coupling could cause intermittent faults on wires. ceiver decode and correct the error itself without sending an
acknowledgement signal. Block codes and convolution codes
Permanent faults represent defects in the hardware. Both
are the two main types of FEC [112]. One of the most widely
transient and intermittent faults could lead to logic faults which
used codes in NoC end-to-end communication is the Hamming
will progress into permanent faults over time [112]. Unlike
code, which is a classical block code. [115] studies various
the other types of faults, permanent faults do not disappear,
forward error correction methods including Hamming codes
because they represent permanent damage to the circuits or
for NoC communication.
wires which often manifest as short/open circuit errors due to
In Hybrid between ARQ and FEC (HARQ), in this type
ageing and physical failure. Permanent faults can be divided
of scheme, packets are encoded using an error correction
into two classes:
code, thus the receiver can decode packets, detect errors, and
• Logic faults, in which CMOS devices (transistors) or correct them, but if the error can not be corrected, then it will
wires are permanently open or shortcut. send a request to the predecessor to resend the packet. This
• Delay faults, which cause transistors or wires to be slower process is repeated as long as the error is not corrected by
than previously, as a result possibly causing set-up and the receiver. [116] proposes a HARQ that can be configured
or hold time violations, which generate incorrect logic to work in different modes (detection, correction, and mixed-
values. mode) based on the specific application to achieve various
In order for a fault to be handled, it is important to differen- quality of service levels. While the correction mode allows
tiate between the three aforementioned faults. While transient correcting corrupted packets and always forwards them with-
and intermittent faults can be handled by soft techniques (error out sending an acknowledgement to the source to retransmit,
correcting codes or multi-path routing) [113], permanent faults which reduces the latency, the detection mode bypasses or
are more challenging to correct. even disables the decoding part completely but sets a flag as
soon as it detects an error, so that the source will transmit
the packet again. This mode reduces power consumption by
switching off correction components, but increases latency. In
mixed mode, different error control approaches are applied to
various parts of the transmitted packets, for example errors in
headers can be corrected, while errors in the payload can only
be detected [110].
Spatial redundancy techniques use alternative links or routes
Fig. 12. Test configurations in [123]: (a) straight paths; (b) turning paths; (c)
when the current link is identified as faulty. A reconfigurable local resource connections.
network interface (NI) with two ports is proposed by [117]: a
main port and a spare port, used as backup. By reconfiguring
the NI, some internal faults and a broken primary port can be NoC by extending the use of test configurations for diagnostic
tolerated. One core with multiple NIs that connect to more than purposes. The algorithm employs three test configurations for
one router is presented in [118], improving the fault tolerance mesh-like NoCs: (a) straight paths, (b) turning paths, and
of the connections between NIs and routers but still suffering (c) local resource connections. These configurations cover the
from errors in the communication due to faulty behaviours of entire NoC as seen in Fig.12. The first configuration (a) drives
NI components. [119] proposes a functional fault model for the packets straight across the NoC, thus the faults in the
the NI components by evaluating their susceptibility to faults. straight connections will be checked, while (b) tests the routing
In [120] and [121] a spatial redundancy technique is paths by taking advantage of deterministic XY routing. Finally
utilized that includes transmitting a packet over disjoint paths. the links to the resources are covered by (c). Although these
In case of an error on one transmission path, the uncorrupted test configurations can locate faults in individual connections
packets that are transferred via the alternative paths can be between the routers and inside the switches, thus achieving
used. Multi-path routing [122] can decrease latency compared high fault coverage, it is not mentioned by the authors when
to a retransmission technique. However, multi-path routing to run this test. Obviously, it is infeasible to run an application
increases the utilization of NoCs due to spatial redundancy. during the test, and therefore, there is a trade-off between fault
detection and performance.
The three kinds of faults provided in the previous section 2) Fault Tolerance in the Network Router Layer: Fault
(transient, intermittent, and permanent) can affect the correct- tolerance at this level generally has higher cost compared to
ness of the payload or header, and therefore, can potentially that of the transport layer, because of the complexity in the
destroy the packets completely. Most of the literature regarding router architecture, and may require additional components or
detecting and protecting from errors in end-to-end communica- memory for sorting routing tables.
tion uses implicit models, which imply the existence of errors An approach for dynamic testing to detect up to 85% of
in some bits in the packet due to faults. [123] introduces a errors in routing logic, FIFOs control paths, and arbiter is
link faults technique to diagnose faulty links in NoCs based on presented in [124]. This methodology is implemented in a
functional fault models and implements packet-address-driven NoC with a basic packet switch and without considering QoS
test configurations. [123] labels link as faulty if a packet enters support.
a router and either a packet is corrupted or it is not being sent When a faulty router is detected and localized using one
to the relevant output port (corrupted header). of the faults detection techniques, it must be eliminated
To detect and correct errors in end-to-end communication, (bypassed). One of the strategies to bypass routers with
an error control coding can be used. For instance an encoder faults is to modify the routing, so that the packets can
can be added to the sender and a decoder can be added to be routed away from the defective router [112]. In [125]
the receiver. Parity codes or cyclic redundancy check (CRC) a basic XY routing technique is implemented. In order to
codes are used as error detecting codes. In such cases the avoid a faulty router, adaptive routing can be employed. This
sender network interface will have one or more buffers to store strategy, however, is not deadlock free for the commonly used
the transmitted packets. If the receiver detects an error in the wormhole switching NoCs. The adaptivity of the odd-even
packet, it will request the packet to be sent again from the turn model is presented in [126] to bypass rectangular areas. A
sender. However, this scheme has disadvantages: first of all, routing algorithm called Minimal and Adaptive Fault-tolerant
it can not precisely locate the position of the fault since the Algorithm (MAFA) is presented in [127] to route packets via
checking is done at the receiver only; moreover, it can only shortest paths in the presence of faulty links. [112] presents a
overcome transient faults, while in case of intermittent and table that lists and compares fault tolerant techniques.
permanent faults it is highly likely that the resent data will be A single-fault tolerant (SFT) technique is proposed by
affected by the same fault. Using error correction codes, some [128] for a 2D-Mesh NoC using various deterministic routing
faults can be handled by the receiver. However, ”only a limited algorithms on the same physical channel based on multiple
number of faults can be handled and it gets overwhelmed when virtual channels. [129] proposed an ant colony optimization-
permanent faults accumulate over time” [112]. based fault-aware routing (ACO-FAR) algorithm to avoid
[123] proposes a strategy to locate faulty links in the hotspots around faulty routers. ACO-FAR consists of three
steps: (1) detection of fault information (encounter), (2) search to be processed in parallel. The fundamental laws of parallel
for alternative paths (search), and (3) alternative path selection computing are explained together with the various levels of
(select). parallelism. The mechanisms of mapping applications onto
Despite the existence of fault tolerance techniques at the many-core platforms are discussed and relevant literature is
network router level, there are still some limitations because explored. Furthermore, fault-tolerance techniques in the three
they mainly focus on routing packets away from the faulty layers of NoC are presented. Even though extensive research
region, which might add additional traffic to the surrounding has already been carried out in the area of many-core systems
areas, consequently generating more congestion zones. and on how they can be utilized efficiently, a breakthrough is
3) Faults in the Physical Layer: The reliability of CMOS still needed at different levels of these systems.
devices can be affected by different kinds of physical failures.
Generally, these faults can be categorised into four groups: ACKNOWLEDGEMENTS
radiation, electrostatic discharge, electromagnetic interface, I am thankful to my supervisors Dr. Gianluca Tempesti and
and ageing. Dr. Martin Trefzer who provided guidance and expertise that
Radiation Radioactive impurities in the circuit and packag- greatly assisted the research. I am also grateful to Nizar and
ing materials generate alpha particles and terrestrial cosmic Pedro for assistance with technical help to understand and set-
which can cause soft errors [109] [110] [111]. A bit flip up the Graceful platform.
could happen to one or more bits in the memory due to these
particles. This is usually referred to as Single Event Upset R EFERENCES
(SEU) but if a particle drives a gate or wire to generate an [1] G. E. Moore, “Cramming more components onto integrated circuits,
incorrect level of voltage it is called a Single Event Transient reprinted from electronics, volume 38, number 8, april 19, 1965, pp.114
ff.” IEEE Solid-State Circuits Society Newsletter, vol. 11, no. 5, pp.
(SET). 33–35, Sept 2006.
Electrostatic Discharge: A breakdown of devices could [2] Andrs Vajda, Programming many-Core Chips. Springer Sci-
happen due to high electric current, which can enter via I/O ence+Business Media, 233 Spring Street, New York, USA: Springer,
2011.
pins or be introduced by strong electric fields. Electrostatic [3] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel, “Mapping on
breakdown is classified into three types; (1) dielectric oxide multi/many-core systems: Survey of current and emerging trends,” in
breakdown, (2) PN junction breakdown, and (3) wiring break- 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC),
May 2013, pp. 1–10.
down. However, internal components such as NoCs will rarely [4] J. Torrellas, “How to build a useful thousand-core manycore system?”
be affected due to the protection at the pins of the ICs [130]. in 2009 IEEE International Symposium on Parallel Distributed Pro-
Electromagnetic Interference: The crosstalk between long cessing, May 2009, pp. 1–1.
[5] L. Benini and G. D. Micheli, “Networks on chips: a new soc paradigm,”
parallel wires is said to be the main source of electromagnetic Computer, vol. 35, no. 1, pp. 70–78, Jan 2002.
Interference. With the scaling of the technology the wires [6] M. A. A. Faruque, R. Krist, and J. Henkel, “Adam: Run-time agent-
become thinner, thus increasing delay and resistance. Con- based distributed application mapping for on-chip communication,” in
2008 45th ACM/IEEE Design Automation Conference, June 2008, pp.
sequently, the coupling capacitance and inductance between 760–765.
parallel wires are growing. Moreover, the signal on one wire [7] P. Campos, N. Dahir, C. Bonney, M. Trefzer, A. Tyrrell, and G. Tem-
can influence the one on the next wire, which increases signal pesti, “Xl-stage: A cross-layer scalable tool for graph generation,
evaluation and implementation,” in 2016 International Conference on
delay, glitches, and damped voltage oscillations [131]. Embedded Computer Systems: Architectures, Modeling and Simulation
Ageing: The performance of CMOS devices decreases over (SAMOS), July 2016, pp. 354–359.
time due to a number of physical effects. For instance, some [8] J. Cavazos, “Lecture 1 The Multicore Revolution,” Dept. of Computer
& Information Sciences, University of Delaware.
of the carriers (electrons or holes) can make their way through [9] Jack Dongarra and others, The Sourcebook of Parallel Computing,
the insulating silicon oxide layer beneath the gate, in a process 1st ed. Elsevier, November 2002.
known as Hot Carrier Injection (HCI). As a consequence, [10] Karl Rupp, “40 years of microprocessor trend data,” June 2015.
[11] A. A. C. Shekhar Borkar, “The future of microprocessors,” Commu-
the switching characteristics of transistors gradually change, nication of The Association for Computing Machinery, vol. 54, no. 5,
particularly the threshold voltage [132]. pp. 67–77, May 2011.
Fault tolerance in the physical layer is not explored in this [12] F. N. Najm, “A survey of power estimation techniques in vlsi circuits,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
paper because it does not fall within the scope of the proposed vol. 2, no. 4, pp. 446–455, Dec 1994.
research. [13] M. J. Flynn, “Some computer organizations and their effectiveness,”
IEEE Trans. Comput., vol. 21, no. 9, pp. 948–960, Sep. 1972.
III. C ONCLUSION [Online]. Available: http://dx.doi.org/10.1109/TC.1972.5009071
[14] M. D. Hill and M. R. Marty, “Amdahl’s law in the multicore era,”
This paper has presented a literature review on some areas Computer, vol. 41, no. 7, pp. 33–38, Jul. 2008. [Online]. Available:
relevant to many-core systems and their applications. It starts http://dx.doi.org/10.1109/MC.2008.209
by explaining the scaling of transistor and how this is affected [15] J. L. Gustafson, “Reevaluating amdahl’s law,” Commun. ACM,
vol. 31, no. 5, pp. 532–533, May 1988. [Online]. Available:
by physical limitations, which led to the idea of putting more http://doi.acm.org/10.1145/42411.42415
than one core in the same chip. This hasted to the introduction [16] N. J. Gunther, “A new interpretation of amdahl’s law and geometric
of many-core systems, which provide an appealing architec- scalability,” https://arxiv.org/abs/cs/0210017, accessed: 2017-07-19.
[17] A. H. Karp and H. P. Flatt, “Measuring parallel processor
tures but also introduced new research challenges. In order performance,” Commun. ACM, vol. 33, no. 5, pp. 539–543, May
to utilize these new architectures, applications are required 1990. [Online]. Available: http://doi.acm.org/10.1145/78607.78614
[18] James Reinders, Intel Threading Building Blocks: Outfitting C++ for [44] “A 167-processor 65 nm computational platform with per-
Multi-core Processor Parallelism. OReilly Media, July 2007. processor dynamic supply voltage and dynamic clock
[19] A. Vajda, Programming Many-Core Chips, 1st ed. Springer Publishing frequency scaling,” accessed: 2017-07-23. [Online]. Available:
Company, Incorporated, 2011. http://vcl.ece.ucdavis.edu/pubs/2008.06.symp.vlsi/
[20] M. Frigo, C. E. Leiserson, and K. H. Randall, “The [45] “Nvidia tesla :a unified graphics and computing a rchitecture,”
implementation of the cilk-5 multithreaded language,” SIGPLAN http://www.serc.iisc.ernet.in/ vss/courses/PPP.old/GPU/tesla-
Not., vol. 33, no. 5, pp. 212–223, May 1998. [Online]. Available: architecture-ieeemicro.pdf, accessed: 2017-07-23.
http://doi.acm.org/10.1145/277652.277725 [46] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,
[21] Barbara Chapman; Gabriele Jost; Ruud van der Pas, Using J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C. C. Miao,
OpenMP:Portable Shared Memory Parallel Programming. MIT Press, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks,
2007. D. Khan, F. Montenegro, J. Stickney, and J. Zook, “Tile64 - processor:
[22] “The OpenMP API specification for parallel programming,” A 64-core soc with mesh interconnect,” in 2008 IEEE International
http://www.openmp.org/, accessed: 2017-07-22. Solid-State Circuits Conference - Digest of Technical Papers, Feb 2008,
[23] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel pp. 88–598.
Programming with the Message-passing Interface. Cambridge, MA, [47] D. C. Pham, T. A. rspach, D. Boerstler, M. Bolliger, R. Chaudhry,
USA: MIT Press, 1994. D. Cox, P. Harvey, P. M. Harvey, H. P. Hofstee, C. Johns, J. Kahle,
[24] Jason Sanders and Edward Kandrot, CUDA by Example: An Introduc- A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny,
tion to General-Purpose GPU Programming, July 2010. M. Riley, D. L. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock,
[25] “Cuda zone,” https://developer.nvidia.com/cuda-zone, accessed: 2017- S. Weitzel, D. Wendel, and K. Yazawa, “Overview of the architecture,
07-23. circuit design, and physical implementation of a first-generation cell
[26] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, processor,” IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp.
and S. Amarasinghe, “Petabricks: A language and compiler for 179–196, Jan 2006.
algorithmic choice,” in Proceedings of the 30th ACM SIGPLAN [48] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-
Conference on Programming Language Design and Implementation, wald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf,
ser. PLDI ’09. New York, NY, USA: ACM, 2009, pp. 38–49. M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe,
[Online]. Available: http://doi.acm.org/10.1145/1542476.1542481 and A. Agarwal, “The raw microprocessor: a computational fabric for
[27] “Petabricks,” http://projects.csail.mit.edu/petabricks/, accessed: 2017- software circuits and general-purpose programs,” IEEE Micro, vol. 22,
07-23. no. 2, pp. 25–35, Mar 2002.
[28] “Streamit,” http://groups.csail.mit.edu/cag/streamit/, accessed: 2017- [49] S. R. Vangal et al., “An 80-tile sub-100-w teraflops processor in 65-nm
07-23. cmos,” IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 29–41,
[29] “All Programmable SoC with Hardware and Software Jan 2008.
Programmability,” accessed: 2017-07-20. [Online]. Available: [50] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey,
https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html M. Mattina, C. C. Miao, J. F. B. III, and A. Agarwal, “On-chip
[30] “”worlds first 1,000-processor chip”,” interconnection architecture of the tile processor,” IEEE Micro, vol. 27,
https://www.ucdavis.edu/news/worlds-first-1000-processor-chip/, no. 5, pp. 15–31, Sept 2007.
accessed: 2017-07-20. [51] C. Ramey, “Tile-gx100 manycore processor: Acceleration interfaces
[31] V. Nollet, P. Avasare, H. Eeckhaut, D. Verkest, and H. Corporaal, and architecture,” in 2011 IEEE Hot Chips 23 Symposium (HCS), Aug
“Run-time management of a mpsoc containing fpga fabric tiles,” IEEE 2011, pp. 1–21.
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, [52] S.-J. Chen, Y.-C. Lan, W.-C. Tsai, and Y.-H. Hu, Communication
no. 1, pp. 24–33, Jan 2008. Centric Design. New York, NY: Springer New York, 2012, pp. 3–13.
[32] J. A. de Oliveira and H. van Antwerpen, The Philips Nexperia Digital
[53] R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,”
Video Platform. Boston, MA: Springer US, 2003, pp. 67–96.
Proceedings of the IEEE, vol. 89, no. 4, pp. 490–504, Apr 2001.
[33] “Many-core fabricated chips information page,”
[54] W. J. Dally and B. Towles, “Route packets, not wires: On-
http://vcl.ece.ucdavis.edu/misc/many-core.html, accessed: 2017-07-19.
chip inteconnection networks,” in Proceedings of the 38th
[34] “Intel Core i9-7900X X-series Processor,” accessed: 2017-07-23.
Annual Design Automation Conference, ser. DAC ’01. New
[Online]. Available: http://ark.intel.com/products/123613/Intel-Core-
York, NY, USA: ACM, 2001, pp. 684–689. [Online]. Available:
i9-7900X-X-series-Processor-13-75M-Cache-up-to-4 30-GHz
http://doi.acm.org/10.1145/378239.379048
[35] B. Bohnenstiehl, A. Stillmaker, J. Pimentel, T. Andreas, B. Liu,
[55] Sheng Ma and others, Networks-on-Chip: From Implementations to
A. Tran, E. Adeagbo, and B. Baas, “A 5.8 pj/op 115 billion ops/sec,
Programming Paradigms, 1st ed. Springer, 2015.
to 1.78 trillion ops/sec 32nm 1000-processor array,” in 2016 IEEE
Symposium on VLSI Circuits (VLSI-Circuits), June 2016, pp. 1–2. [56] S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey,
[36] “AMD Radeon R9 Series Gaming Graphics Cards with High- S. Sarkar, S. Siers, I. Stolero, and A. Subbiah, “A 22nm ia multi-
Bandwidth Memory,” accessed: 2017-07-23. [Online]. Available: cpu and gpu system-on-chip,” in 2012 IEEE International Solid-State
http://www.amd.com/en-us/products/graphics/desktop/ Circuits Conference, Feb 2012, pp. 56–57.
[37] “Intel Xeon Phi Processors,” accessed: 2017-07-23. [Online]. Available: [57] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer,
https://www.intel.com/content/www/us/en/support/processors.html and D. Shippy, “Introduction to the cell multiprocessor,” IBM J. Res.
[38] “TILE-Gx72 Processor Product Brief,” accessed: 2017- Dev., vol. 49, no. 4/5, pp. 589–604, Jul. 2005. [Online]. Available:
07-23. [Online]. Available: http://www.mellanox.com/related- http://dl.acm.org/citation.cfm?id=1148882.1148891
docs/prod multi core/PB TILE-Gx72.pdf [58] Intel Xeon Phi Coprocessor, Datasheet, Intel, 2013, rev. 002.
[39] “A clustered manycore processor architecture for embedded and [59] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, S. W.
accelerated applications,” accessed: 2017-07-23. [Online]. Available: Keckler, and D. Burger, “On-chip interconnection networks of the trips
http://ieee-hpec.org/2013/index htm files/44.pdf chip,” IEEE Micro, vol. 27, no. 5, pp. 41–50, Sept 2007.
[40] M. B. Healy et al., “Design and analysis of 3d-maps: A many-core 3d [60] S. Vangal et al., “An 80-tile 1.28tflops network-on-chip in 65nm cmos,”
processor with stacked memory,” in IEEE Custom Integrated Circuits in 2007 IEEE International Solid-State Circuits Conference. Digest of
Conference 2010, Sept 2010, pp. 1–4. Technical Papers, Feb 2007, pp. 98–589.
[41] R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd, “Power7: Ibm’s [61] D. Wentzlaff et al., “On-chip interconnection architecture of the tile
next-generation server processor,” IEEE Micro, vol. 30, no. 2, pp. 7– processor,” IEEE Micro, vol. 27, no. 5, pp. 15–31, Sept 2007.
15, March 2010. [62] A. K. Mishra, N. Vijaykrishnan, and C. R. Das, “A case for
[42] “QorIQ P4080 Communications Processor Prod- heterogeneous on-chip interconnects for cmps,” in Proceedings of the
uct Brief,” accessed: 2017-07-23. [Online]. Available: 38th Annual International Symposium on Computer Architecture, ser.
http://cache.freescale.com/files/32bit/doc/prod brief/P4080PB.pdf ISCA ’11. New York, NY, USA: ACM, 2011, pp. 389–400. [Online].
[43] M. Butts, “Synchronization through communication in a massively Available: http://doi.acm.org/10.1145/2000064.2000111
parallel processor array,” IEEE Micro, vol. 27, no. 5, pp. 32–40, Sept [63] William Dally and Brian Towles, Principles and Practices of Intercon-
2007. nection Networks , 1st ed. Elsevier, December 2003.
[64] M. H. Cho, M. Lis, K. S. Shim, M. Kinsy, T. Wen, and S. Devadas, [84] J. Duato, S. Yalamanchili, and N. Lionel, Interconnection Networks: An
“Oblivious routing in on-chip bandwidth-adaptive networks,” in 2009 Engineering Approach. San Francisco, CA, USA: Morgan Kaufmann
18th International Conference on Parallel Architectures and Compila- Publishers Inc., 2002.
tion Techniques, Sept 2009, pp. 181–190. [85] W. Dally and B. Towles, Principles and Practices of Interconnection
[65] L. Chen, R. Wang, and T. M. Pinkston, “Critical bubble scheme: An Networks. San Francisco, CA, USA: Morgan Kaufmann Publishers
efficient implementation of globally aware network flow control,” in Inc., 2003.
2011 IEEE International Parallel Distributed Processing Symposium, [86] P. Kermani, “Virtual cut-through: A new computer communication
May 2011, pp. 592–603. switching technique,” 1979.
[66] Y. Ajima, S. Sumimoto, and T. Shimizu, “Tofu: A 6d mesh/torus [87] W. J. Dally, B. F. Intel, A. N. Chips, and M. Plesiochronous, “The
interconnect for exascale computers,” Computer, vol. 42, no. 11, pp. torus routing chip,” Distributed Computing, pp. 187–196, 1986.
36–40, Nov 2009. [88] T. Moscibroda and O. Mutlu, “A case for bufferless routing in
[67] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. on-chip networks,” in Proceedings of the 36th Annual International
Giampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, Symposium on Computer Architecture, ser. ISCA ’09. New
T. Takken, M. Tsao, and P. Vranas, “Blue gene/l torus interconnection York, NY, USA: ACM, 2009, pp. 196–207. [Online]. Available:
network,” IBM J. Res. Dev., vol. 49, no. 2, pp. 265–276, Mar. 2005. http://doi.acm.org/10.1145/1555754.1555781
[Online]. Available: http://dx.doi.org/10.1147/rd.492.0265 [89] M. Hayenga, N. E. Jerger, and M. Lipasti, “Scarab: A single cycle adap-
[68] D. Chen et al., “The ibm blue gene/q interconnection fabric,” IEEE tive routing and bufferless network,” in 2009 42nd Annual IEEE/ACM
Micro, vol. 32, no. 1, pp. 32–43, Jan. 2012. [Online]. Available: International Symposium on Microarchitecture (MICRO), Dec 2009,
http://dx.doi.org/10.1109/MM.2011.96 pp. 244–254.
[69] S. L. Scott and et al., “The cray t3e network: Adaptive routing in a [90] S. A. R. Jafri, Y. J. Hong, M. Thottethodi, and T. N. Vijaykumar,
high performance 3d torus,” 1996. “Adaptive flow control for robust performance and energy,” in 2010
[70] “SpiNNaKer Chip,” accessed: 2017-07-31. [Online]. Available: 43rd Annual IEEE/ACM International Symposium on Microarchitec-
http://apt.cs.manchester.ac.uk/projects/SpiNNaker/SpiNNchip/ ture, Dec 2010, pp. 433–444.
[71] S. S. B. M. A. Gaikwad, “A comparative study of different topologies [91] V. Puente, R. Beivide, J. A. Gregorio, J. M. Prellezo, J. Duato, and
for network-on-chip architecture,” International Journal of Computer C. Izu, “Adaptive bubble router: a design to improve performance in
Applications, March 2013. torus networks,” in Proceedings of the 1999 International Conference
[72] J. Balfour and W. J. Dally, “Design tradeoffs for tiled on Parallel Processing, 1999, pp. 58–67.
cmp on-chip networks,” in Proceedings of the 20th Annual [92] S. Ma, Z. Wang, Z. Liu, and N. E. Jerger, “Leaving one slot empty: Flit
International Conference on Supercomputing, ser. ICS ’06. New bubble flow control for torus cache-coherent nocs,” IEEE Transactions
York, NY, USA: ACM, 2006, pp. 187–198. [Online]. Available: on Computers, vol. 64, no. 3, pp. 763–777, March 2015.
http://doi.acm.org/10.1145/1183401.1183430 [93] “Imec unveils tools to speed design of energy-efficient multi-
[73] P. Kumar, Y. Pan, J. Kim, G. Memik, and A. Choudhary, “Exploring processor soc platforms,” http://embedded-computing.com/news/imec-
concentration and channel slicing in on-chip network router,” in 2009 multi-processor-soc-platforms/, accessed: 2017-07-20.
3rd ACM/IEEE International Symposium on Networks-on-Chip, May [94] J. Ceng et al., “Maps: An integrated framework for mpsoc application
2009, pp. 276–285. parallelization,” in 2008 45th ACM/IEEE Design Automation Confer-
[74] A. K. Mishra, N. Vijaykrishnan, and C. R. Das, “A case for ence, June 2008, pp. 754–759.
heterogeneous on-chip interconnects for cmps,” in Proceedings of the [95] D. Cordes, O. Neugebauer, M. Engel, and P. Marwedel, “Automatic
38th Annual International Symposium on Computer Architecture, ser. extraction of task-level parallelism for heterogeneous mpsocs,” in 2013
ISCA ’11. New York, NY, USA: ACM, 2011, pp. 389–400. [Online]. 42nd International Conference on Parallel Processing, Oct 2013, pp.
Available: http://doi.acm.org/10.1145/2000064.2000111 950–959.
[75] B. Zafar, J. Draper, and T. M. Pinkston, “Cubic ring networks: A [96] R. P. Dick, D. L. Rhodes, and W. Wolf, “Tgff: task graphs for
polymorphic topology for network-on-chip,” in 2010 39th International free,” in Hardware/Software Codesign, 1998. (CODES/CASHE ’98)
Conference on Parallel Processing, Sept 2010, pp. 443–452. Proceedings of the Sixth International Workshop on, Mar 1998, pp.
[76] L. Chen and T. M. Pinkston, “Nord: Node-router decoupling for effec- 97–101.
tive power-gating of on-chip routers,” in 2012 45th Annual IEEE/ACM [97] S. Stuijk, M. Geilen, and T. Basten, “Sdf3: Sdf for free,” in Sixth
International Symposium on Microarchitecture, Dec 2012, pp. 270– International Conference on Application of Concurrency to System
281. Design (ACSD’06), June 2006, pp. 276–278.
[77] L. Huang, Z. Wang, and N. Xiao, “Vbon: Toward efficient on-chip [98] J. Hu and R. Marculescu, “Energy- and performance-aware mapping
networks via hierarchical virtual bus,” Microprocess. Microsyst., for regular noc architectures,” IEEE Transactions on Computer-Aided
vol. 37, no. 8, pp. 915–928, Nov. 2013. [Online]. Available: Design of Integrated Circuits and Systems, vol. 24, no. 4, pp. 551–562,
http://dx.doi.org/10.1016/j.micpro.2012.06.013 April 2005.
[78] W. J. Dally and C. L. Seitz, “Interconnection networks for [99] C. Marcon, A. Borin, A. Susin, L. Carro, and F. Wagner, “Time
high-performance parallel computers,” I. D. Scherson and A. S. and energy efficient mapping of embedded applications onto nocs,”
Youssef, Eds. Los Alamitos, CA, USA: IEEE Computer in Proceedings of the ASP-DAC 2005. Asia and South Pacific Design
Society Press, 1994, ch. Deadlock-free Message Routing in Automation Conference, 2005., vol. 1, Jan 2005, pp. 33–38 Vol. 1.
Multiprocessor Interconnection Networks, pp. 345–351. [Online]. [100] G. Jiang, Z. Li, F. Wang, and S. Wei, “Mapping of embedded
Available: http://dl.acm.org/citation.cfm?id=201173.201227 applications on hybrid networks-on-chip with multiple switching mech-
[79] J. Duato, “A new theory of deadlock-free adaptive routing in wormhole anisms,” IEEE Embedded Systems Letters, vol. 7, no. 2, pp. 59–62, June
networks,” IEEE Transactions on Parallel and Distributed Systems, 2015.
vol. 4, no. 12, pp. 1320–1331, Dec 1993. [101] T. Lei and S. Kumar, “Algorithms and tools for network on chip based
[80] G.-M. Chiu, “The odd-even turn model for adaptive routing,” IEEE system design,” in 16th Symposium on Integrated Circuits and Systems
Transactions on Parallel and Distributed Systems, vol. 11, no. 7, pp. Design, 2003. SBCCI 2003. Proceedings., Sept 2003, pp. 163–168.
729–738, 2000. [102] D. Wu, B. M. Al-Hashimi, and P. Eles, “Scheduling and mapping
[81] B. Fu, Y. Han, J. Ma, H. Li, and X. Li, “An abacus turn model of conditional task graph for the synthesis of low power embedded
for time/space-efficient reconfigurable routing,” in Proceedings of the systems,” IEE Proceedings - Computers and Digital Techniques, vol.
38th Annual International Symposium on Computer Architecture, ser. 150, no. 5, pp. 262–73–, Sept 2003.
ISCA ’11. New York, NY, USA: ACM, 2011, pp. 259–270. [Online]. [103] P. Zipf, G. Sassatelli, N. Utlu, N. Saint-Jean, P. Benoit,
Available: http://doi.acm.org/10.1145/2000064.2000096 and M. Glesner, “A decentralised task mapping approach for
[82] J. Duato, “A necessary and sufficient condition for deadlock-free homogeneous multiprocessor network-on-chips,” Int. J. Reconfig.
adaptive routing in wormhole networks,” IEEE Transactions on Parallel Comput., vol. 2009, pp. 3:1–3:14, Jan. 2009. [Online]. Available:
and Distributed Systems, vol. 6, no. 10, pp. 1055–1067, Oct 1995. http://dx.doi.org/10.1155/2009/453970
[83] S. Ma, N. D. E. Jerger, and Z. Wang, “Whole packet forwarding: [104] E. W. Brião, D. Barcelos, and F. R. Wagner, “Dynamic task allocation
Efficient design of fully adaptive routing algorithms for networks-on- strategies in mpsoc for soft real-time applications,” in Proceedings of
chip,” in HPCA, 2012. the Conference on Design, Automation and Test in Europe, ser. DATE
’08. New York, NY, USA: ACM, 2008, pp. 1386–1389. [Online]. [122] S. Murali, D. Atienza, L. Benini, and G. D. Micheli, “A multi-path
Available: http://doi.acm.org/10.1145/1403375.1403709 routing strategy with guaranteed in-order packet delivery and fault-
[105] W. Quan and A. D. Pimentel, “A hybrid task mapping algorithm tolerance for networks on chip,” in 2006 43rd ACM/IEEE Design
for heterogeneous mpsocs,” ACM Trans. Embed. Comput. Syst., Automation Conference, July 2006, pp. 845–848.
vol. 14, no. 1, pp. 14:1–14:25, Jan. 2015. [Online]. Available: [123] J. Raik, R. Ubar, and V. Govind, “Test configurations for diagnosing
http://doi.acm.org/10.1145/2680542 faulty links in noc switches,” in 12th IEEE European Test Symposium
[106] P. K. F. Hölzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit, “Run- (ETS’07), May 2007, pp. 29–34.
time spatial mapping of streaming applications to a heterogeneous [124] M. R. Kakoee, V. Bertacco, and L. Benini, “A distributed and topology-
multi-processor system-on-chip (mpsoc),” in Proceedings of the agnostic approach for on-line noc testing,” in Proceedings of the Fifth
Conference on Design, Automation and Test in Europe, ser. DATE ACM/IEEE International Symposium, May 2011, pp. 113–120.
’08. New York, NY, USA: ACM, 2008, pp. 212–217. [Online]. [125] C. Bobda, A. Ahmadinia, M. Majer, J. Teich, S. Fekete, and J. van der
Available: http://doi.acm.org/10.1145/1403375.1403427 Veen, “Dynoc: A dynamic infrastructure for communication in dynam-
[107] M. Mandelli, A. Amory, L. Ost, and F. G. Moraes, “Multi-task ically reconfugurable devices,” in International Conference on Field
dynamic mapping onto noc-based mpsocs,” in Proceedings of the 24th Programmable Logic and Applications, 2005., Aug 2005, pp. 153–158.
Symposium on Integrated Circuits and Systems Design, ser. SBCCI [126] T. F. Pereira, D. R. de Melo, E. A. Bezerra, and C. A. Zeferino,
’11. New York, NY, USA: ACM, 2011, pp. 191–196. [Online]. “Mechanisms to provide fault tolerance to a network-on-chip,” IEEE
Available: http://doi.acm.org/10.1145/2020876.2020920 Latin America Transactions, vol. 15, no. 6, pp. 1034–1042, June 2017.
[108] M. Mandelli, G. Castilhos, G. Sassatelli, L. Ost, and F. G. Moraes, “A [127] M. Ebrahimi, M. Daneshtalab, J. Plosila, and H. Tenhunen, “Mafa:
distributed energy-aware task mapping to achieve thermal balancing Adaptive fault-tolerant routing algorithm for networks-on-chip,” in
and improve reliability of many-core systems,” in Proceedings of 2012 15th Euromicro Conference on Digital System Design, Sept 2012,
the 28th Symposium on Integrated Circuits and Systems Design, ser. pp. 201–207.
SBCCI ’15. New York, NY, USA: ACM, 2015, pp. 13:1–13:7. [128] Z. Zhang, A. Greiner, and S. Taktak, “A reconfigurable routing al-
[Online]. Available: http://doi.acm.org/10.1145/2800986.2800992 gorithm for a fault-tolerant 2d-mesh network-on-chip,” in 2008 45th
[109] C. Constantinescu, “Trends and challenges in vlsi circuit reliability,” ACM/IEEE Design Automation Conference, June 2008, pp. 441–446.
IEEE Micro, vol. 23, no. 4, pp. 14–19, July 2003. [129] H. K. Hsin, E. J. Chang, C. A. Lin, and A. Y. . Wu, “Ant colony
[110] A. Ejlali, B. M. Al-Hashimi, P. Rosinger, and S. G. Miremadi, “Joint optimization-based fault-aware routing in mesh-based network-on-chip
consideration of fault-tolerance, energy-efficiency and performance systems,” IEEE Transactions on Computer-Aided Design of Integrated
in on-chip networks,” in 2007 Design, Automation Test in Europe Circuits and Systems, vol. 33, no. 11, pp. 1693–1705, Nov 2014.
Conference Exhibition, April 2007, pp. 1–6. [130] Alpha and O. Semiconductor, Power Semiconductor Reliability Hand-
[111] D. Bertozzi, L. Benini, and G. D. Micheli, “Error control schemes book, Sunnyvale, CA 94085, U.S.A., 2010.
for on-chip communication links: the energy-reliability tradeoff,” IEEE [131] M. Cuviello, S. Dey, X. Bai, and Y. Zhao, “Fault modeling and
Transactions on Computer-Aided Design of Integrated Circuits and simulation for crosstalk in system-on-chip interconnects,” in 1999
Systems, vol. 24, no. 6, pp. 818–831, June 2005. IEEE/ACM International Conference on Computer-Aided Design. Di-
[112] M. Radetzki, C. Feng, X. Zhao, and A. Jantsch, “Methods gest of Technical Papers (Cat. No.99CH37051), Nov 1999, pp. 297–
for fault tolerance in networks-on-chip,” ACM Comput. Surv., 303.
vol. 46, no. 1, pp. 8:1–8:38, Jul. 2013. [Online]. Available: [132] J. Keane and C. H. Kim, “An odomoeter for cpus,” IEEE Spectrum,
http://doi.acm.org/10.1145/2522968.2522976 vol. 48, no. 5, pp. 28–33, May 2011.
[113] S. Murali, T. Theocharides, N. Vijaykrishnan, M. J. Irwin, L. Benini,
and G. D. Micheli, “Analysis of error recovery schemes for networks
on chips,” IEEE Design Test of Computers, vol. 22, no. 5, pp. 434–442,
Sept 2005.
[114] J. M. Montanana, D. de Andres, and F. Tirado, “Fault tolerance on
nocs,” in 2013 27th International Conference on Advanced Information
Networking and Applications Workshops, March 2013, pp. 138–143.
[115] T. Lehtonen, P. Liljeberg, and J. Plosila, “Analysis of forward
error correction methods for nanoscale networks-on-chip,” in
Proceedings of the 2Nd International Conference on Nano-
Networks, ser. Nano-Net ’07. ICST, Brussels, Belgium, Belgium:
ICST (Institute for Computer Sciences, Social-Informatics and
Telecommunications Engineering), 2007, pp. 3:1–3:5. [Online].
Available: http://dl.acm.org/citation.cfm?id=1459290.1459295
[116] D. Rossi, P. Angelini, and C. Metra, “Configurable error control scheme
for noc signal integrity,” in 13th IEEE International On-Line Testing
Symposium (IOLTS 2007), July 2007, pp. 43–48.
[117] L. Fiorin, L. Micconi, and M. Sami, “Design of fault tolerant network
interfaces for nocs,” in 2011 14th Euromicro Conference on Digital
System Design, Aug 2011, pp. 393–400.
[118] V. Rantala, T. Lehtonen, P. Liljeberg, and J. Plosila, “Multi network
interface architectures for fault tolerant network-on-chip,” in 2009
International Symposium on Signals, Circuits and Systems, July 2009,
pp. 1–4.
[119] L. Fiorin and M. Sami, “Fault-tolerant network interfaces for networks-
on-chip,” IEEE Transactions on Dependable and Secure Computing,
vol. 11, no. 1, pp. 16–29, Jan 2014.
[120] S. Pasricha, Y. Zou, D. Connors, and H. J. Siegel, “Oe+ioe: A
novel turn model based fault tolerant routing scheme for networks-
on-chip,” in Proceedings of the Eighth IEEE/ACM/IFIP International
Conference on Hardware/Software Codesign and System Synthesis,
ser. CODES/ISSS ’10. New York, NY, USA: ACM, 2010, pp. 85–94.
[Online]. Available: http://doi.acm.org/10.1145/1878961.1878979
[121] A. Patooghy and S. G. Miremadi, “Xyx: A power x00026; performance
efficient fault-tolerant routing algorithm for network on chip,” in 2009
17th Euromicro International Conference on Parallel, Distributed and
Network-based Processing, Feb 2009, pp. 245–251.