0% found this document useful (0 votes)
16 views

Streaming Application On Many-Core Systems

This document discusses mapping streaming applications to many-core systems. It provides background on many-core systems and their advantages over multi-core systems. The challenges of mapping applications to many-core systems at large scale include scalability, dynamic workloads, and reliability. The aim of the research is to develop scalable methods for mapping applications that can handle dynamic workloads and improve reliability.

Uploaded by

abafpga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Streaming Application On Many-Core Systems

This document discusses mapping streaming applications to many-core systems. It provides background on many-core systems and their advantages over multi-core systems. The challenges of mapping applications to many-core systems at large scale include scalability, dynamic workloads, and reliability. The aim of the research is to develop scalable methods for mapping applications that can handle dynamic workloads and improve reliability.

Uploaded by

abafpga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Evolutionary optimisation of streaming application

on many-core systems
Ali Abuassal∗
∗ deptartment of Electronic Engineering, The University of York, York, UK
[email protected]

Abstract—The high performance requirements of today’s


technologies can be satisfied by relying on many-core systems. N00 N01 N02 N0n
However, how these architectures can be utilized efficiently is still
questionable. This paper presents a literature review on parallel
computing in general but in particular on many-core systems,
including their architectures, topologies, how applications N10 N11 N12 N1n
are mapped into them, and fault tolerance. Mechanisms and
methodologies proposed in the literature to achieve high
performance and reliable systems thus the maximum benefits of
the advanced architecture will also be revealed. N20 N21 N22 N2n

Index Terms—Transistor scaling, parallel computation, many-


core systems, network-on-chip, fault-tolerance, FPGA.
Nm0 Nm1 Nm2 Nmn
I. I NTRODUCTION
A. Background Fig. 1. Example of many-core system based 2D mesh topology.

The scaling of semiconductor technology has led to billions


of transistors into a single chip. As Moore predicted, the
number of transistors on a single chip has doubled every two
core, many-core implies parallelism both in its architecture
years [1]. This increase in chip density, however, seems to
and its application levels. With multiple cores executing tasks
have reached its end due to physical limitations. Particularly
belonging to the same application.
in terms of frequency and due to difficulty of dissipating
heat. Many-core systems (systems consist of large numbers of Many-core systems are designed specifically for a high de-
processor units operating in parallel using only local clocks gree of parallel processing. In order to deal with the bandwidth
for synchronization) are generally seen as an opportunity requirements of inter-processor communication, Networks on
to increase computational power within these physical Chip (NoCs) have been introduced by [5] to provide a
constraints. By moving to many-core, issues with clock power efficient, scalable and reliable on-chip communication
speeds can be avoided by effectively operating in a GALS paradigm.
(globally asynchronous locally synchronous) regime, while
However, to exploit the power of many-core systems, a
heat can be better managed since processors in the system will
design shift is necessary. Fig.1 depicts an overview of 2D mesh
run at lower frequencies (or sometimes be stopped if required).
topology (by far the most common topology used in many-core
systems) of m by n cores (Nodes). According to Amdahl’s law
Multiple core processors have been available for almost
the speed of processing can be increased if and only if the
two decades. The IBM POWER4 processor was released
application can be fully or partially parallelized. Obtaining
in 2001 as the first general-purpose processor to featured
the maximum benefits from many-core systems depends on
multiple processing cores on the same silicon die. Since
being able to define applications as conglomerations of many
then, multi-core processors have become the norm and
parallelizable tasks, which can be distributed (or mapped)
nowadays are seen to be the most effective way to improve
across the cores [6]. Streaming applications are an example
the performance for high-end processors [2] [3]. Current
of such applications, which consists of direct acyclic graphs
hardware roadmaps call for doubling the number of on-chip
of tasks where each task is fully executed locally before
cores approximately every two years [4].
the results are passed to the next processing node(s). Fig.2
illustrates an example of streaming application graph. Stream-
As the number of cores increases, however, a paradigm
ing applications include several multimedia, signal processing,
shift from multi-core to many-core is required. Unlike multi-
and/or image processing applications. These applications are
This research is funded by the Department of Electronic Engineering at the easier to develop and more efficient in using the resources of
University of York. a parallel architecture.
In conclusion, the complexity of mapping an application
P1
increases rapidly with the number of cores and this represents
3.0 a crucial constraint when utilizing a many-core system with
1.5
P2 thousands or millions of cores. Some of the key issues involved
1.0 3.0
2.5
P7 1.25 in managing how an application is mapped on a many-core
1.0 1.0 1.0 Sink
Source P0 P3
9.0
P6 0.4 0.1 system include:
1.5
P8
1.0
Scalability: Algorithms and methodologies for task mapping
0.833
P4
must be able to work efficiently for arbitrarily large number
of cores.
P5
Dynamic workloads: In many cases, application require-
ments according to runtime parameters, which implies the need
Fig. 2. Distributed application created using XL-STaGe tool [7] to design adaptable algorithms able to change the mapping
on-line to meet new scenarios (which might include new
applications being launched as well as dynamic requirements
B. Motivation of existing ones)
Reliability: It is generally accepted that next-generation
Broadly speaking, many-core architectures should allow technologies will be more susceptible to faults (at fabrication
performance to increase linearly with the number of cores. as well as at runtime due to radiation and/or ageing effects),
While this idea sounds rather straightforward, there are making the design of mechanisms to improve systems relia-
major obstacles to its implementation. In order to obtain the bility a necessity for many-core devices.
maximum benefits from processors, on one side, the hardware
architecture has to provide improved computational power, C. Aim of the Research
while on the other side software developers need to take the
The main goal of this research is to develop scalable
maximum advantage of the new hardware architecture by
mechanisms and methodologies for adaptive on-line mapping
developing algorithms that fit the architecture.
of application(s) on many-core systems with thousands or
millions of processing units (cores). The research will focus
In this context, there are many questions to be answered. on performance, power consumption, and/or fault-tolerance
For instance, on the hardware side, how will the cores be as targets for optimization.
connected and how will they communicate with each other? The project will investigate implementation of hierarchical
At the software level, how are these resources going to be multiple monitors. Monitoring areas with dynamic size will
managed with low overhead and how can an application also be explored. The performance of this approach will be
efficiently be mapped on the platform? Also at the application evaluated using the platform developed in the Graceful project.
level, what kind of applications and algorithms are suited
to the architecture? [4]. In other words, as Bill Gates put It should be noted that, this research will NOT involve:
it, ”Multicore: This is the one which will have the biggest
• The design a new NoC architecture (but the existing
impact on us. We have never had a problem to solve like
architecture will be modified as will the software).
this. A breakthrough is needed in how applications are done
• Attempting to address software deadlocks in parallel
on multicore devices [8].”
processing.
• The design of a new programming language and/or com-
Mapping an application on a many core system is known
piler.
to be a NP-hard problem. Therefore, heuristics based on
• The development of a new fault model (faults will be
application knowledge must be considered to achieve close-
addressed in order to deal with the errors).
to-optimal solutions [3]. Mapping must also take into account
user requirements (quality of service and power constraints)
for each application. The following sections present a short background literature
review of some of the key areas involved in the proposed
research.
In the architectural model that will be used for this research,
to handle the mapping process, one core is dedicated as central II. L ITERATURE R EVIEW
manager (monitor) to oversee the system at runtime. For
small number of cores, one monitor could be enough but as A. Transistor scaling
the cores increase, distributed monitoring (which divides the In the last half a century, the field of computing has under-
system into monitoring regions), will be necessary. When a gone vast and rapid changes, with a substantial evolution
distributed monitoring method is used, the platform is divided trends of technology, hardware architecture and usage of the
into smaller regions (monitoring areas) and communication systems. Even with all these changes, the long-term evolution
between monitors must be considered . of performance seems to be steady and continuous, following
Moore’s Law rather closely [9].
SISD Instruction(task) Pool MISD Instruction(task) Pool

Data Pool
Data Pool
PU PU PU

SIMD Instruction(task) Pool MIMD Instruction(task) Pool

PU PU
Fig. 3. Microprocessors Trends Data [10]. PU

Data Pool
Data Pool
PU PU PU

The MOS transistor has been the workhorse of digital PU PU PU


electronics for decades, scaling in performance by nearly
five orders of magnitude and providing the foundation for
today’s unprecedented computational performance. The basic
Fig. 4. Diagram comparing classifications of processors based on Flynn.
recipe for technology scaling was laid down by Robert N.
Dennard in the early 1970s and has been followed for the past
three decades. This scaling recipe calls for reducing transistor
dimensions by 30% every generation (two years) and keeping of transistors has increased the frequency has stayed almost
electric fields constant everywhere in the transistor to maintain level since 2005.
reliability.
Classical transistor scaling provided three major benefits The use of multiple processing units seems to be an
that have made possible rapid growth in computation appealing solution for the aforementioned problems and
performance: first, as transistor dimensions are scaled by 30% the trend toward multiple-core processing chips is now well
(0.7x), their area shrinks by 50%, doubling the transistor established, however, new architectural approaches are starting
density every technology generation; second, as the transistor to emerge in recent years. For example, multiple power-
is scaled, its performance (switching speed) increases by efficient processors combined with hardware accelerators
about 40%, providing higher system performance; third, to are becoming the preferred choice for many designers to
keep the electric field constant, supply voltage is reduced deliver the best trade-off between performance and power
by 30%, reducing energy by 65%. Consequently, in every consumption [12]. This approach implies spreading the
technology generation transistor integration doubles, circuits application tasks into multiple processing elements where: (1)
are 40% faster, and system power consumption (with twice each processing element can be individually turned on or off,
as many transistors) stays the same. thereby saving power; (2) each processing element can run
at its own optimized supply voltage and frequency; (3) it is
This scaling enabled three-orders-of-magnitude increase in easier to achieve load balance among processor cores and to
microprocessor performance over the past 20 years. Chip distribute heat across the die; and (4) it is possible to produce
architects exploited transistor density to create complex archi- lower die temperatures and improve reliability and leakage.
tectures and transistor speed to increase frequency, all within
a ”reasonable” power and energy envelope [11]. An old classification of processor architectures (Fig.4) by
Michael J. Flynn [13] based on instructions can be applied to
B. Trends in Microprocessors today’s multiprocessing systems by replacing instruction by
It is generally accepted that, the scaling of transistors task. According to Flynn, there are four types of processor ar-
described in the previous section can not continue at the same chitectures: single instruction single data stream (SISD); single
rate due to physical limitations: the power wall due to the instruction multiple data stream (SIMD); multiple instruction
effects of on the scaling of clock speeds and the difficulty of single data stream (MISD); and multiple instruction multiple
dissipating on-chip heat. A memory wall also occurs because data stream (MIMD). While systems in the MISD category are
the gap between memory access time and processing speed rarely built, the other three architectures are common and can
is growing larger. be distinguished simply by the differences in their respective
instruction cycles:
Fig.3 (plotted by Horowitz et al) shows the trends in In a SISD architecture, there is a single instruction cycle;
scaling compared to frequency, power consumption, and operands are fetched in serial fashion into a single processing
number of cores. It can be seen that, even though the number unit before execution.
A SIMD architecture is like SISD in terms of instruction Gustafsons Law [15], published in 1988 states that the
cycle, but it has multiple sets of operands that may be speed-up when a multi-core processor is used is a function
fetched to multiple processing units and may be operated upon of the number of cores N and the sequential portion of the
simultaneously within a single instruction cycle. application np. The law is defined by eq.2.
MIMD architectures have multiple instruction cycles that
can be active at any given time, each independently fetching Speedup(s) = N − (N − 1) ∗ np (2)
instructions and operands into multiple processing units and
operating on them in a concurrent fashion. This category In fact, both Amdahls and Gustafson’s laws are fundamen-
includes multiple processor systems in which each processor tally equivalent as both describe how the performance of the
has its own program control, rather than sharing a single parallel section of computation depends on the nature of the
control unit. application.
Gunther’s Law [16], also known as the Universal Scalability
C. Fundamental Issues in Parallelism Law, is actually an empirical formula based on real mea-
The idea of parallel computing dates back over four decades. surement data and aims at modelling performance scalability
Many of the basic parallel algorithms, but especially the considering the impact of synchronization and other delay
fundamental concepts and laws of parallelism were defined sources related to keeping a coherent state of the system thus
in the last century. An analysis of the many-core systems field achieving a more accurate model of the scalability curve. The
would not be complete without a survey of the fundamental mathematical formula can be seen in eq.3.
laws of parallelism.
Amdahls Law is the most commonly cited law in par- N
C(N, s, k) = (3)
allelism, defining the limits of speed that can theoretically (1 + s ∗ (N − 1) + k ∗ N ∗ (N − 1))
be gained when an application is parallelized. Eq.1 shows Where N is the number of cores, s is the sequential portion
Amdahl’s law, with Slatency (s) being the theoretical speed- of an application, and k is a parameter modelling coherence
up of the execution of the task, and n and p the number of (latency for data to become consistent) in the system.
cores and the proportion of a program that can be parallelized, Gunthers Law can be seen as Amdahls law considering the
respectively. overhead cost of the synchronization among multiple cores.
1 This law is capable of modelling the retrograde performance
Slatency (s) = p (1) phenomena which re usually observed in practice where many
(1 − p) + applications stop gaining performance after exceeding a certain
n
Amdahl’s Law applicability to many-core systems was number of cores.
explored for the first time by [14]. In this study, three scenarios The Karp-Flatt Metric [17] represents a practical method
of multi-core system were considered in order to analyse the to estimate how well an application can be scaled over a
maximum achievable speed, under the same assumptions as in number of processor units. It can be used in addition to
the original form of Amdahl’s Law. aforementioned laws because it determines the serial portion
The three scenarios were: np of an application. Eq.4 illustrates the formula.
• Symmetric multi-cores consisting of two or more identical 1 1 1
np = ( − )(1 − ) (4)
cores. S N N
• Asymmetric multi-cores, where the number of cores is
Where np is the estimated non-parallelized (serial) portion
reduced and one powerful core is added to deal with the of the application (basically the smaller the value of np the
unparallelized part. higher the performance).
• Dynamic multi-cores, in which the hardware can either
act as a large pool of simple cores (when the parallel part D. Parallelism and Performance
is executed) or as a single and fast core, when executing The introduction of many-core technologies has brought to
the serial part. the fore several of the fundamental issues of parallelism. As
In the symmetric multi-cores, the results were equivalent software, hardware, and applications have evolved, there is a
to the original scenario outlined by Amdahls law, while the real demand to efficiently run multiple tasks to take advantage
results for the other two scenarios were more interesting. of the available hardware capabilities in many-core based
In the asymmetrical (single asymmetrical) multi-cores, the systems. Therefore, an appropriate many-core programming
speed-up is more than that expected by Amdahl’s classical model must be developed and must exploit all kinds of
Law. However, different applications may have various speed- available parallelism.
ups due to the differences in the level of parallelism. For
this reason the third scenario, fully dynamic multi-cores, Broadly speaking there are different types of parallelism
achieves speed-ups that are greater (and are never worse) than that a programmer can exploit:
asymmetric. However, it is not clear yet how to built such a Bit-Level Parallelism (BLP) extends the hardware architec-
dynamic chip. ture to work concurrently on larger data, by extending the
word length (for instance, from 8 to 16), allowing multiple Open Computing Language (OpenCL) was developed first
identical operations to be carried out in a single cycle. by Apple, but later was adopted by several other major chip
Instruction-Level Parallelism (ILP) relies on identifying in- vendors via the Khronos Compute Working Group, which
dependent instructions and executing them concurrently. How- is responsible for the OpenCL standardization [19]. It can
ever, it is not always possible to find independent instructions be used by CPUs and GPUs. As OpenCL is extended from
as most programs are written in a sequential manner. C, it supports most of the C features (excluding function
Thread-Level Parallelism (TLP) enables a program to oper- pointers, recursion, variable length arrays, and bit fields) but
ate with multiple threads at the same time rather than waiting adds more features to support specific matches such as work-
on other threads. groups, synchronization, and new data types. It has also built-
Task-Level Parallelism (TaLP), also known as control or in functions for image processing.
function parallelism, focuses on speeding execution processes Message Passing Interface (MPI) is a standard message
among various parallel cores on the same or different data. passing library developed by researchers from academia and
The main difficulty with TaLP is not how to distribute the industry to support a wide range of parallel computing plat-
tasks, but rather how the application program can be divided forms. The main aim of MPI is to establish a portable, efficient,
into tasks. TaLP allows independent cores to run concurrently and flexible standard for message passing [23].
and exchange messages occasionally. Computer Unified Device Architecture (CUDA) is a parallel
Data-Level Parallelism (DLP): also known as loop-level computing programming model developed by NVIDIA to ex-
parallelism, allows multiple cores to process data concurrently. ploit the computing power of graphics processing units (GPUs)
In many-core systems, data parallelism is explored by making [24]. CUDA is widely used in thousands of applications and
every core perform the same task on different portions of the published works. According to NVIDIA, CUDA is supported
distributed data. by an installed base of over 375 million CUDA-enabled GPUs
in workstations, compute clusters, and supercomputers. Some
E. Parallel Programming Languages code examples can be found in [25].
Computing with multiple processors has introduced new PetaBricks is an implicitly parallel language and compiler
complex challenges related to the cooperation between the pro- developed by MIT. To provide algorithmic choices. These
cessors. Parallel programming languages, libraries, and tools choices are given in a way that lets the compiler selects one
have been developed to allow users to program algorithms and of them (tunes at a finer granularity and choose one of the
applications on parallel processing units. In this section some provided options) [26] [27].
of the most widely used parallel programming models and The PetaBricks implementation contains three components:
libraries will be discussed. • a source-to-source compiler from the PetaBricks language
Thread Building Blocks (TBB) [18] has been developed by to C++;
Intel as a parallel programming library for multi-core systems. • an auto-tuning system and choice framework to find
TBB is basically a collection of C/C++ library functions optimal choices and set parameters;
and templates which implements mechanisms for designing • a runtime library used by the generated code.
software that can execute on multi-core architectures. TBB is StreamIt is a programming language and a compilation
a standard programming library not limited to Intel processors, infrastructure extended from the C programming language,
as it has been integrated with Mircosoft’s Visual Studio [19]. also developed by MIT, to empower both programmers and
Cilk [20] is a multi-core programming model proposed by compiler writers to leverage the unique properties of the
MIT. It extends C and more recently, C++. It has been acquired streaming domain [28].
by Intel to be used alongside with well established thread
building block library. F. Parallel Hardware Architectures
Task Parallel Library (TPL) [18] is a task-parallel program- Parallel hardware has become a very important factor in
ming framework based on languages such as C# or Visual computer processing technology. Currently, many desktop
Basic. It has become an integral part of Microsoft’s .NET computers consist of dual-core, quad-core, or octa-core CPUs
framework since 2007. TPL introduces parallel constructs and graphics processing units (GPUs). Moreover, many em-
such as parallel F or and F orEach loops. TPL seems to be bedded systems contain MCSoC solutions. In this section,
Microsofts solution for parallel programming in .NET based various types of parallel hardware architectures including
languages [19]. multiprocessor, SoCs, multi-core, and many-core will be dis-
Open Multi-Programming (OpenMP) is a standardized cussed.
programming model specification for shared-memory based Multiprocessors basically consist of multiple CPU cores that
multiprocessing in C, C++, and Fortran [21] [22]. It is sup- are not on the same chip. Multiprocessors were primarily de-
ported by most hardware and software vendors and therefore veloped for IT servers in 1990s and later became known with
has been widely used particularly in the high-performance the label of supercomputers or high performance computers
computing field. OpenMP introduces the concepts of explicit (HPC).
tasks and decoupled execution and includes other features such System-on-Chip (SoC) integrated circuits that contain all
as parallel regions with nested parallelism. components of a system (normally embedded systems) into a
TABLE I
E XAMPLES OF M ULTI /M ANY-C ORE C HIPS F ROM 2002 TO 2017 [33]

Year Processor Cores Organization


2017 Core i9 10 Intel [34]
2016 KiloCore 1000 UC Davis [35]
2015 Radeon R9 Nano 64 AMD [36]
2014 Xeon Phi 7120X 61 Intel [37]
2013 Tilera TILE-Gx72 72 EZchip (prev. Tilera) [38]
2012 Kalray MPPA-256 288 Kalray [39]
2011 3D-Maps 64 Georgia Tech, Lee [40]
2010 Power 7 8 IBM [41]
2009 QorIQ P4080 8 Freescale Semiconductor [42]
2008 Ambric Am2045 336 Ambric, Inc [43]
2008 AsAP2 167 UC Davis [44]
2008 GeForce 8800 Ultra 16 Nvidia [45]
2007 Tilera TILE64 64 Tilera Corporation [46]
2005 Cell 9 Sony,Toshiba,IBM [47]
2002 RAW 16 MIT [48]

used to create a hardware implementation of the software


application to accelerate execution and/or save power.
G. Many-Core Processors Organization :
Fig. 5. Zedboard based Zynq-7000 All Programmable SoC [29]. To provide an overview of many-core system, in this section
the two main many-core system categories, namely homoge-
neous and heterogeneous, will be discussed. Table I presents
single chip and and can include reconfigurable modules, such some of the best known devices marketed with label of multi-
as FPGA devices. SoCs contain several blocks including ana- many-core.
log, digital, mixed-signal, and radio-frequency functions linked Homogeneous many-core systems: In homogeneous sys-
through industry-standard bus mechanisms (such as AMBA) or tems, all cores are identical. Research groups in academia
through custom links. The ZYNQ7000 family [29] from Xilinx have proposed several homogeneous platforms, for instance
is an example of SoC that includes programmable logic. A the 16-core Raw Architecture Workstation (RAW) processor
digram of Zedboard based on ZYNQ7000 All Programmable by the Massachusetts Institute of Technology (MIT) or the
SoC can be seen in Fig.5. Tera-op, Reliable, Intelligently adaptive Processing System
Multi-core Processors contain a few cores in the same (TRIPS) from the University of Texas. In 2016, KiloCore was
chip (typically 2, 4, 6, or 8 cores). Real examples of multi- announced as the world’s first 1,000-core chip [30].
core processor include i3, i5, and i7 processors by Intel. Heterogeneous many-core systems: Unlike homogeneous,
In general, multi-core processors imply parallelism in the heterogeneous Many-core SoCs contain different types of pro-
hardware architecture only. cessing units (PUs) in the same chip. The PUs can be graphics
Many-core Processors: unlike multi-core, Many-core pro- processing units (GPUs), central processing units (CPUs),
cessors apply parallelism at both architecture and application FPGA fabrics, dedicated intellectual property cores (IPs), ac-
levels. celerators, specialized memories etc. The use of heterogeneous
Generally speaking, many-core architectures have a number of components can improve computation performance by taking
potential advantages over other architectures: advantage of the unique features of different types of PUs.
However, homogeneous system architectures tend to be the
• They match well future process technologies, as they can proffered choice in most implementations because they are
exploit large numbers of cores. slightly easier to program than homogeneous ones.
• The cores can be extensively optimized. A MPSC consisting of four different types of PUs is
• Broadly, performance should scale linearly with the num- presented in [31]. These PUs are connected in a 3x3 mesh
ber of cores. topology. The Philips Nexperia [32] contains three processing
• Faulty core(s) can be isolated or discarded. units: one MIPS CPU for control processing and two DSP
• The clock rate of each core can be configured individually cores to process media content.
and dynamically, and core(s) can be turned on or off for
saving power and reducing heat. H. Network on Chip
Field Programmable Gate Arrays (FPGAs) accelerators can As CMOS technology improves, the number of cores within
be utilized with any of the above architectures. FPGAs are a single chip has been continually rising. Recently, prototype
devices that consist of a matrix of reconfigurable gate array or commercial chips that contain 64 or more cores have been
circuitry. The reconfigurable circuitry of the FPGA can be manufactured [49] [50] [51]. The core number reached 1000
Processing Processing Processing
system system system
(PS) (PS) (PS)
Higher NoC application Processing
SW

X
R

R
TX RX TX RX TX RX
Layer NoC OS.

TX

TX

TX
Elements
TX RX TX RX TX RX

RX
NoC TX RX
NoC TX RX
NoC TX

RX TX RX TX RX TX

Processing Processing Processing Transport


system system system
Packet Trans. Network
Layer Between Cores
(PS) (PS) (PS) Interfaces
X

X
R

R
TX RX TX RX TX RX
TX

TX

TX
TX RX TX RX TX RX

RX
NoC RX
NoC RX
NoC
TX TX Text
TX
Router Router
RX TX RX TX RX TX HW Layer Layer
Packet routing Switches

Processing Processing Processing


system system system
(PS) (PS) (PS)
X

X
R

R
TX RX TX RX TX RX
TX

TX

TX
Physical Physical Physical trans. Interconnect
TX RX TX RX TX RX Layer Layer of phits links
RX
NoC RX
NoC RX
NoC
TX TX TX

RX TX RX TX RX TX

Fig. 6. Overview of a NoC. Fig. 7. Network-on-Chip layers and modules.

in 2016 as announced by the University of California, Davis optimized and controlled. NoCs use packets to route data from
[35]. To connect such complicated architectures, Networks- the source to destination processor through a network fabric.
on-Chip (NoC), introduced by Benini and De Micheli [5], has Packets are sent in a pipelined fashion, which enhances the
become the default standard. operating frequency and solves the signal integrity problem,
1) From Bus to NoC: Many recent electronics devices con- which is a set of measurements to ensure that all signals
tain complete systems embedded in a chip. As a result, intra- transmitted are received correctly and they do not interfere
chip communication requirements have become crucial. Point- with one another. The basic NoC architecture, proposed by
to-point links are often adopted to satisfy the requirements of [5] and [54], was inspired by the modern land development
on-chip communication that connects the top level modules, process, in which road and communication infrastructures are
but, as the complexity of the systems increases, wire density laid first, then buildings are placed and built. By mimicking
and length have grown. Thus point-to-point architectures have the modern city, the chip is divided into rectangular tiles
become more and more infeasible, due to high propagation (processing elements) and roads between tiles, where on-chip
delays, high power dissipation, poor scalability, and lack of network routers are placed. Fig.6 provides an overview of a
reliability. In particular, global wires across the chip, which typical 2-D mesh NoC, which consists of 5-port routers that
can not be scaled with technology [52], have become a can exchange data through two unidirectional channels with
bottleneck in bus-based interconnections. Moreover, in deep the four neighbouring routers (north, east, south, and west)
sub-micron VLSI technology, delay, power, and reliability are as well as to the local processing node/core. A router is not
crucial issues [53]. Therefore, centralized approaches to global only responsible for its associated tile but it also routes signals
communication across top level modules are no longer suitable from/ to other routers.
for advanced architectures that contain a large number of 3) Network-on-Chip layers: Network-on-Chip can be di-
cores. Current on-chip bus interconnect templates, for instance vided into the following layers:
the AMBA and CoreConnect buses from ARM and IBM Application Layer: In this layer, applications are broken
(respectively), are widely used in multiprocessor system on down into a set of computation and communication tasks. The
chip (MP-SoC) but do not scale to large numbers of cores as functionalities of this layer include message synchronization
they allow only one communication transaction at the same and management, for instance, the performance factors like
time. Consequently, the average communication bandwidth energy and speed can be optimized.
of every core is inverse to the total number of cores in the Transport layer: According to Fig.7, the transport layer
chip [52]. This implies that the on-chip bus architecture is provides the end-to-end communication (between switches)
inherently not scalable for many-core devices. and delivery of data using the network router layer. Therefore,
The Network-on-Chip (NoC) approach has been introduced its functionalities include packetization of data at the source
by [5] to provide solutions for the aforementioned issues and depacketization at the destination.
and to introduce a structured and scalable communication Network router layer: this layer provides the service of com-
architecture [54]. municating a packet from one resource to another using the
2) Concept of Network-on-Chip : A typical Network-on- network of switches. Buffering of packets and taking routing
Chip (Fig.6) is a scalable general purpose on-chip communi- decisions in the intermediate switches are the functionalities
cation network that enables the design of increasingly complex of this layer.
multi-processor systems. NoC divides the long cross chip Physical layer This layer is concerned with physical char-
wires into a smaller segments, allowing their properties to be acteristics of the NoCs for connecting switches and resources
with each other.
4) Topology: topology is the one of fundamental features of
NoC design because it plays a crucial role in overall network
performance and cost. The topology determines the physical
connections between the nodes and it also determines the
number of alternative paths between the nodes, and as a result
it affects the network traffic distribution. Several topologies
(a) (b)
have been proposed for high-performance parallel computing (a) (b)
(for example, hypercubes) but 2D meshes, rings, or tori are the
most common topologies in current integrated circuits [55].
Fig.8 shows the three most used topologies.
The ring topology (fig.8(a)) is commonly used in small
networks with a small number of nodes, for instance in the
six-node Ivy Bridge [56] and Cell processors [57]. Since the
ring network has a simple structure, the routing logic and flow
control are easy to implement. Moreover, a centralized control
(c)
(c)
mechanism can be applied [56] [57]. Scalability is limited in
the ring topology but it can be improved using multiple rings, Fig. 8. Three typical NoC topologies. (a) Ring. (b) 2D mesh. (c) 2D torus.
as Intel did in its Knights Corner processor, where ten rings
are used [58].
The 2D mesh topology (fig.8(b)) is highly compatible with [72] provides an extensive study evaluating various topologies
wire routing in CMOS technology and it can also potentially for on-chip communication and raises interesting conclusions
reduce or even eliminate deadlocks (9) in the NoC. These such as the fact that connecting more than one core to a single
factors, in addition to its high scalability, make the 2D mesh router is an efficient method to reduce latency. Connecting
topology the most commonly utilized NoC topology in indus- routers that are not neighbours by adding physical channels
try and academia. Some examples include the U.T. Austin can also reduce latency. However, authors in [73], while
TRIPS [59], Intel Teraflops [60], and Tilera TILE64 [61] generally agreeing with [72], argue that adding more channels
chips. Most researchers assume 2D mesh topologies for their and connecting more cores to the same router increases the
theoretical research and simulations. The congestion at the design complexity dramatically.
centre of 2D mesh is the most obvious limitation, becoming a
bottleneck for large number of cores. However, according to Another important research aspect for topology design is
[62], this issue might be reduced by assigning more wiring improving network throughput. According to [74], the central
resources to the centre portion, in other words designing portions of a mesh network are more likely to be congested,
asymmetric topologies. as a result limiting throughput. Therefore, [74] proposes a
The torus topology (fig.8(c)) helps to balance network heterogeneous mesh network that adds higher bandwidth to
utilization, as the wraparound links reduce the latency and the routers in the centre. This heterogeneous network requires
congestion in the centre. However, deadlock may still occur the division of large packets into smaller ones and also
and additional mechanism are often used, such as multiple the combination of small packets into large ones, in order
virtual channels (VCs) [63] or bubble flow control (BFC) [64] to provide communication between narrow and wide channels.
[65]. 2D and higher-dimension tori are utilized in many-core
systems such as the K Computer [66], Blue Gene [67] [68], The last important research field in topology design is the
the Cray T3E [69], and SpiNNaker chip [70] reduction of area and power overheads. Regarding reducing
Other topologies have also been proposed, such as star and power consumption, [75] proposes a cubic ring topology, in
tree. While in the former, all the nodes are connected to a which 30% of the routers can be turned off dynamically to
central (common) node, which is usually referred to as super- reduce power consumption. [76] places a ring alongside with
node, the latter has a central root node which is connected a mesh topology, so the ring part acts as backup connection
to one or more nodes of a lower hierarchy. According to [71] in case mesh routers are turned off. In this proposal, all mesh
both star and tree topologies have a very limited scalability due routers can be shut down to the same power. Virtual channels
to the high cost of links implementation. However, they can that can dynamically configure links between routers to form
profit from 3D topology, where the average distance between a bus structure to reduce power and latency are proposed by
the nodes is dramatically reduced. [77].
Research on topologies has mainly focused on reducing 5) Router architecture: The architecture of the router de-
latency, improving throughput, and reducing area and power fines the area overhead, power consumption, and routing
overhead [55]. Reducing the latency in the NoC is probably delay. A canonical router is composed of input units, routing
the most interesting avenue for topology design because it computation logic, switch allocators, a crossbar, and output
has a huge impact on the overall performance of the NoC. units. Fig.9 illustrates the architecture of common router.
l nt
ca e
lo lem F1 F2 F3
/t o g e
0
om in F1 F2 F3
Fr ess Router
1
oc F1 F2 F3
pr 2

3 F1 F2 F3
0 1 2 3 4 5 6 7 8 9 10 11
(a)
From other routers

Crossbar Cycle

Output channels
To other routers
Input channels

.
. .
. .
.

0 F1 F2 F3
1 F1 F2 F3
Router
2 X F1 F2 F3
3 F1 F2 F3
Routing &
Arbitration
(b) 0 1 2 3 4 5 6 7 8
Cycle

Fig. 9. General architecture of a router. 0 F1 F2 F3


1 F1 F2 F3
Router
2 F1 X F2 F3
Input Units contain an input buffer, which is used to
3 X F1 F2 F3
maintain the channel availability until the buffer is filled. To
0 1 2 3 4 5 6
increase throughput and reduce latency, a large buffer can (c)
Cycle
be used, as more packets can be stored and forwarded, but
this comes at a price of large area, consequently more power
Fig. 10. Three types flow Control (a) SAF. (b) VCT. (c) Wormhole.
consumption. First-in-first-out buffers are usually utilized in
the input unit, in which data flits processed depending on the
order of their arrival time.
not taken into account thus the packet is sent randomly via
The routing logic processes flits to determine the output
one of the available paths. On the other hand, adaptive routing
direction and then sends a direction request to the arbitration
checks the congestion in the available paths and then passes
logic.
the packet through the least congested path.
The arbitration logic receives requests from the routing
logic and produces a grant signal if the requested direction Packets in a NoC are divided into flits (amount of data that
is available. Usually, the number of direction requests is equal can be sent in one cycle) by routers. Each flit that arrives at a
the number of the physical output directions of the router. For router is stored in a buffer until it can pass to the downstream
instance in a 2D mesh topology, there are five output channels router. The control logic processes the first flit in the buffer to
(ports). determine the packet’s destination and whether it is meant for
The crossbar is basically a set of multiplexers. one for the local processing element or not. The decision of which
each output port. The control signals of the multiplexers are direction to send the packet is made by the control unit,
generated by the corresponding arbiter. One flit can be sent according to routing tables (data tables stored in the router,
out in each clock cycle depending on the arbitration result. which contain information about the mapping of the NoC
6) Routing: After setting the NoC topology, routing logic around the current router), arbitration mechanisms, as well as
is needed to define the paths for packets from sender nodes to the status of the downstream buffer in case of adaptive routing.
receivers. Routing algorithms can be classified into two types When the control set-up is completed, the flit is passed to its
according to the traverse path length: (1) Minimal routing related output direction via a crossbar.
always sends the packets within the minimal quadrant defined 7) Deadlock: deadlock can occur when a buffer on the path
by the sender and receiver node pair. (2) Non-minimal routing of a packet is filled or blocked and can not hold data any more.
considers paths outside the minimal quadrant. As minimal Deadlocks could lead to situations where packets get stuck in
routing has fewer hops than non-minimal routing, its power the NoC and do not arrive at the destination node. One of the
consumption is generally lower. basic requirements of a reliable network is deadlock-freedom.
Most of the research on routing algorithms focuses on
The routing algorithm can also be classified into de- achieving deadlock freedom, which requires that there is no
terministic and non-deterministic based on the number of cyclic dependency across the network resources (channels and
paths that a packet can take. While the deterministic method buffers) [78] [79]. [80] [81] eliminate the cyclic resource
offers one path only for each pair of sender and re- dependency by not allowing routing to utilize certain turns.
ceiver, non-deterministic routing offers multiple choices. Non- Others researchers [82] [83] [83] propose fully adaptive rout-
deterministic itself is classified into oblivious routing and ing, that considers all the minimal paths using virtual channels
adaptive routing. In oblivious routing the network status is (VCs) as backup to avoid deadlocks.
I. Flow Control l nt
ca e
lo lem
To allocate the network resources (buffer capacity and /t o g e
om in
channel bandwidth), a flow control algorithm is used to Fr ess
oc
transmit packets. There are three primary types of flow pr
control: circuit switching, store-and-forward (SAF), virtual
cut-through (VCT), and wormhole flow control.

From other routers


Crossbar

Output channels
To other routers
Input channels
circuit switching reserves physical link(s) from source to .
. .
.

destination from the time when connection is established until .


.
the transmitting of data is completed [84]. The reserving
is done by sending a header flit into the network, which
contains the source and destination addresses. On its way Input unit

from source to destination, the header flit reserves the physical Routing &
Arbitration
links. When the destination receives the header flit, it sends
an acknowledgement to the source which then sends the rest
of the data.
b0
Store-and-forward (SAF): In the SAF flow control [85] the
entire packet is received first and then forwarded to the next .
Ch . Ch
router. Fig.10(a) illustrates an example of SAF, in which a .
b3
packet consisting of three flits (F1 to F3) is routed from router
0 to router 3. At each router the entire packet is received flit VC Buffer
by flit before it is forwarded to the next router. As a result
SAF introduces a latency of at least N clock cycles at every
hop (where N is the number of flits in a packet). Fig. 11. Router architecture with virtual channel.
Virtual cut-through (VCT): In order to reduce the serializa-
tion latency, another type of flow control was introduced by
[86]. Virtual cut-through flow control forwards the packet flit that handles resource congestion at the flit granularity. If
by flit as soon as the header flit (F1) is received by the router if more than one flit compete for the same port, only one flit
the link is available. As can be seen in Fig.10(b), every router is granted the port and the rest will be stored or dropped.
forwards the packet immediately after receiving the header flit. Mitchell et. al. [89] took bufferless flow control further
However, as VCT allocates the buffers at packet granularity, by proposing a single-cycle adaptive routing and bufferless
it requires the downstream router to have enough buffer space network (SCARAB) in which the non-granted flits are
for the entire packet prior to forwarding the header flit. For dropped and a negative acknowledgement is sent back to
instance, in Fig.10(b) at cycle 2 the target buffer does not have the source node to retransmit. An adaptive flow control that
sufficient free slots for the entire packet, therefore the packet combines buffered and bufferless techniques is proposed in
waits until there is enough space (after 3 more cycles). [90]. It allows routers to buffer the flits in case of high load,
Wormhole : Unlike SAF and VCT, wormhole flow control but it turns off the buffers to apply a bufferless scheme to
(Fig.10(c)) requires only one buffer slot to be available in save power in case of low load. [91] shows that bubble flow
the downstream router before starting to forward the packet. control exhibits a reduction in base latency values of over
Generally, if there is no congestion in the network, both 40% with respect to the corresponding wormhole. A research
wormhole and VCT perform equally. When congestion occurs, group at the National University of Defence Technology [92]
however, they behave differently because wormhole requires has proposed a hybrid flow control similar to VCT in terms
only one buffer slot while VCT requires buffer slots for the of injecting packets, but using packet movement typical of
whole packet before forwarding the it [87]. wormhole flow control.
Virtual channels (VC) can be used with any of the
aforementioned flow controls. VCs consist of buffers which To summarize, research on flow control techniques mainly
can hold one or more flits of a packet. Several virtual focuses on reducing the packet transmission latency, reducing
channels might share the bandwidth of a single physical power consumption by using bufferless methods, and avoiding
channel Ch. VCs can reduce or avoid deadlock: for instance, deadlock in the network.
as it can be seen in (Fig.11) if a blocked packet A fills
buffer b0, other buffers b1, b2, and b3 are available allowing
J. Mapping Applications on Many-Core Processors
other packets to pass while packet A is holding buffer b0 [78].
The following processes are required before mapping appli-
Other flow controls have been proposed by various cations onto a many-core platform:
researchers to improve performance and achieve better • The parallelization of the application, including defin-
resources allocation. [88] proposes a bufferless flow control ing communication between the parallelized task and
synchronization. This can be done by one of standard with dynamic workloads, which need remapping or run time
application parallelization tools such as [93] [94] [95]. mapping.
• The transformation of the application into a task graph, 2) Run Time Mapping: In contrast to design-time mapping,
using for example task graph generators such as task run-time mapping must take into account the time taken
graph for free (TGFF) [96] and Synchronous Data-flow to remap tasks as this affects the overall execution time
Graphs (SDFGs) [97]. of the application. Moreover, in dynamic mapping tasks
• An analysis of the constraints of the application, such as are generally mapped one by one. Therefore, greedy
power consumption and performance. algorithms are normally utilized for efficient mapping so that
• When considering heterogeneous platforms, task binding performance metrics (communication latency, execution time,
is required to map tasks to suitable HW resources. power consumption, etc.) can be optimized. Furthermore,
Mapping application tasks on many-core platforms can be run-time mapping provides several advantages over static
carried out either statically (at design time) or dynamically mapping, as it can adapt to the available resources and
(at run time). Mapping at design time is used for applications also discard defective parts of the platform (allowing fault
with known computation processing and communication tolerance techniques).
behaviour, but is less efficient for dynamic workloads (for
instance, when adding new applications at run time). Run-time [103] presents a heuristic algorithm which is distributed
mapping methods, on the other hand, consider applications over the processors and therefore can be applied to systems
during their operation, generally using task migration to move of any size. Moreover, tasks added at run time can be handled
tasks in case the application requirements change or a new without any difficulty, allowing for online optimisation. Tasks
application is entered into the platform. can also be migrated based on local information on processor
workloads, task size, communication requirements, and link
1) Design Time Mapping: Design time mapping techniques contention. The mapping results for several example task
must have a whole picture of the system and application sets suggest that the performance of mappings obtained by
beforehand in order to make an appropriate decision for this algorithm is within 25% of that of the exact algorithm
using the resources. Because there are no computational or for a 3x3 mesh topology platform. Task allocation strategies
time restrictions involved, a high quality of mapping can be based on bin-packing algorithms with task migration ability
obtained compared to run-time mapping techniques, which are proposed by [104], where various types of algorithm are
are usually restricted to a local view and operate under combined to obtain better allocation results. The system can
tight constraints. Most of the literature on mapping covers shut down idle processors and apply dynamic voltage scaling
design-time methods. to processors with slack, thus, reducing power consumption.

[98] proposes the Communication Weighted Model To cope with the dynamism of application workloads at
(CWM) as a mapping technique to reduce the overall runtime and improve the efficiency of the underlying system
power consumption by reducing energy consumption in architecture, [105] presents a hybrid task mapping algorithm
communication. Unlike other mapping strategies, [99] that combines a static mapping exploration and a dynamic
takes into consideration the dynamic behaviour of the mapping optimization to achieve an overall improvement
target application and thus potential contentions in the of system efficiency. The algorithm was evaluated using a
communication between cores, showing that a 42% average heterogeneous MPSoC system with three real applications.
reduction in the execution time of the mapped application According to [105], the results reveal the effectiveness of the
can be obtained, together with a 21% average reduction in proposed algorithm: in test cases with three simultaneously
the total energy consumption. A mapping scheme based on active applications, the mapping solutions derived by the
a branch-and-bound algorithm is proposed in [100] to map approach have average performance improvements ranging
applications on hybrid NoCs. The results show that this from 45.9% to 105.9% and average energy savings ranging
scheme can reduce communication latency. from 14.6% to 23.5%.

A two-step genetic mapping algorithm is used in [101] To map streaming applications on a heterogeneous platform,
to optimize the application execution time. The algorithm [106] proposes a run-time spatial mapping technique that
proposes mathematical delay models and maps vertices of a contains four steps. This algorithm is implemented on an
multi-task graph to available cores so that every task can meet ARM926 operating at 100 MHz and is able to obtain
its respective deadline. A genetic mapping algorithm that significant improvements in latency. [107] suggests that
utilizes dynamic voltage scaling (DVS) is proposed in [102] allowing multiple tasks allocation per core will reduce the
to decrease power consumption. Considering DVS during the energy used by the NoC.
mapping optimization can save up to 51% of the energy.
One of the aims of dynamic mapping is to achieve load
As these mapping methods determine the placement of balance in NoCs, thus avoiding hotspots. Most of the literature
tasks at design time, they are not applicable to application focuses on local optimization such as minimizing the number
of hops between communicating tasks, which may lead to L. Faults in NoCs
hotspot zones and underutilization of resources. Recently, Packet-switched NoCs [5] are widely used instead of tra-
some researchers have tried to address these issues. For ditional shared bus for on-chip interconnects in many-core
instance, [108] proposes a runtime mapping heuristic which systems. However, failures that appear in any part of the NoCs,
has a cost function that targets temporal workload and energy can compromise the correct functionality of the entire system.
consumption balance in large scale systems. Therefore, it becomes advisable to introduce fault-tolerance
features.
However, using a centralized manager (CM) approach Broadly speaking, a fault can appear at three different layers
(which is the case in most of the literature) has several issues of NoC architectures, for each of which specific fault-tolerance
with respect to the proposed research including: a single point mechanisms are relevant. Various fault tolerance techniques
of failure, high monitoring traffic by the CM, a bottleneck have been suggested to tackle errors [114] at the transport
around the CM because each core is sending its status to the layer, router layer, and physical layer (Fig.7).
CM after mapping, and, in some instances, the fact that the 1) Fault Tolerance in the Transport (link) Layer: The
CM itself becomes a hotspot. transport layer, also known as end-to-end communication as it
links routers (ends) (Fig.7), provides communication services
K. Fault Classes between network routers. In order to achieve reliable end-to-
In general, three kinds of faults are the basis of most fault end communication for NoCs, fault tolerance is needed at this
models: layer. According to [112], fault-tolerance schemes in this layer
Transient faults occur for a short time. For instance the can be classified into four types;
change of value of one bit (bit-flip) can corrupt the header Automatic repeat request (ARQ) is basically a time re-
of a packet. For this type of fault, error control could be dundancy technique based on acknowledgements and re-
implemented at the link or in the transmitter and receiver sends to achieve reliable system operation and to correct
[85]. Transient faults are typically generated as a result of corrupted packets. The acknowledgement (ACK) or not-
terrestrial cosmic neutrons and alpha particles which come acknowledgement (NACK) signal is generated by the receiver
from radioactive impurities in the device or packaging material side when it decodes the packet, encoded by the predecessor.
[109] [110] [111]. A flipping of bits in the SRAM or DRAM If the packet is corrupted, then the receiver will send a NACK
memory might happen due to these particles. signal; upon receiving the NACK signal the sender will send
Intermittent faults appear frequently but are not permanent. the packet again. A time-out (predefined time) mechanism is
They are therefore not easy to distinguish from transient faults. also used to account for errors in ACK and NACK: if the
However, [109] proposes three features that can distinguish predecessor does not receive the ACK/NACK signal it will
between intermittent and transient faults: (1) intermittent faults keep retransmitting the packet until it receives the ACK/NACK
occur repeatedly at the same location, (2) errors induced by signal. In such a situation, a buffer is needed to store the
intermittent faults tend to occur in bursts, and (3) replacement packets [111].
of the offending circuit removes the intermittent fault. As In Forward error correction (FEC), the predecessor encodes
an example, electromagnetic interference such as crosstalk or the packets using an error correction code that lets the re-
self-coupling could cause intermittent faults on wires. ceiver decode and correct the error itself without sending an
acknowledgement signal. Block codes and convolution codes
Permanent faults represent defects in the hardware. Both
are the two main types of FEC [112]. One of the most widely
transient and intermittent faults could lead to logic faults which
used codes in NoC end-to-end communication is the Hamming
will progress into permanent faults over time [112]. Unlike
code, which is a classical block code. [115] studies various
the other types of faults, permanent faults do not disappear,
forward error correction methods including Hamming codes
because they represent permanent damage to the circuits or
for NoC communication.
wires which often manifest as short/open circuit errors due to
In Hybrid between ARQ and FEC (HARQ), in this type
ageing and physical failure. Permanent faults can be divided
of scheme, packets are encoded using an error correction
into two classes:
code, thus the receiver can decode packets, detect errors, and
• Logic faults, in which CMOS devices (transistors) or correct them, but if the error can not be corrected, then it will
wires are permanently open or shortcut. send a request to the predecessor to resend the packet. This
• Delay faults, which cause transistors or wires to be slower process is repeated as long as the error is not corrected by
than previously, as a result possibly causing set-up and the receiver. [116] proposes a HARQ that can be configured
or hold time violations, which generate incorrect logic to work in different modes (detection, correction, and mixed-
values. mode) based on the specific application to achieve various
In order for a fault to be handled, it is important to differen- quality of service levels. While the correction mode allows
tiate between the three aforementioned faults. While transient correcting corrupted packets and always forwards them with-
and intermittent faults can be handled by soft techniques (error out sending an acknowledgement to the source to retransmit,
correcting codes or multi-path routing) [113], permanent faults which reduces the latency, the detection mode bypasses or
are more challenging to correct. even disables the decoding part completely but sets a flag as
soon as it detects an error, so that the source will transmit
the packet again. This mode reduces power consumption by
switching off correction components, but increases latency. In
mixed mode, different error control approaches are applied to
various parts of the transmitted packets, for example errors in
headers can be corrected, while errors in the payload can only
be detected [110].
Spatial redundancy techniques use alternative links or routes
Fig. 12. Test configurations in [123]: (a) straight paths; (b) turning paths; (c)
when the current link is identified as faulty. A reconfigurable local resource connections.
network interface (NI) with two ports is proposed by [117]: a
main port and a spare port, used as backup. By reconfiguring
the NI, some internal faults and a broken primary port can be NoC by extending the use of test configurations for diagnostic
tolerated. One core with multiple NIs that connect to more than purposes. The algorithm employs three test configurations for
one router is presented in [118], improving the fault tolerance mesh-like NoCs: (a) straight paths, (b) turning paths, and
of the connections between NIs and routers but still suffering (c) local resource connections. These configurations cover the
from errors in the communication due to faulty behaviours of entire NoC as seen in Fig.12. The first configuration (a) drives
NI components. [119] proposes a functional fault model for the packets straight across the NoC, thus the faults in the
the NI components by evaluating their susceptibility to faults. straight connections will be checked, while (b) tests the routing
In [120] and [121] a spatial redundancy technique is paths by taking advantage of deterministic XY routing. Finally
utilized that includes transmitting a packet over disjoint paths. the links to the resources are covered by (c). Although these
In case of an error on one transmission path, the uncorrupted test configurations can locate faults in individual connections
packets that are transferred via the alternative paths can be between the routers and inside the switches, thus achieving
used. Multi-path routing [122] can decrease latency compared high fault coverage, it is not mentioned by the authors when
to a retransmission technique. However, multi-path routing to run this test. Obviously, it is infeasible to run an application
increases the utilization of NoCs due to spatial redundancy. during the test, and therefore, there is a trade-off between fault
detection and performance.
The three kinds of faults provided in the previous section 2) Fault Tolerance in the Network Router Layer: Fault
(transient, intermittent, and permanent) can affect the correct- tolerance at this level generally has higher cost compared to
ness of the payload or header, and therefore, can potentially that of the transport layer, because of the complexity in the
destroy the packets completely. Most of the literature regarding router architecture, and may require additional components or
detecting and protecting from errors in end-to-end communica- memory for sorting routing tables.
tion uses implicit models, which imply the existence of errors An approach for dynamic testing to detect up to 85% of
in some bits in the packet due to faults. [123] introduces a errors in routing logic, FIFOs control paths, and arbiter is
link faults technique to diagnose faulty links in NoCs based on presented in [124]. This methodology is implemented in a
functional fault models and implements packet-address-driven NoC with a basic packet switch and without considering QoS
test configurations. [123] labels link as faulty if a packet enters support.
a router and either a packet is corrupted or it is not being sent When a faulty router is detected and localized using one
to the relevant output port (corrupted header). of the faults detection techniques, it must be eliminated
To detect and correct errors in end-to-end communication, (bypassed). One of the strategies to bypass routers with
an error control coding can be used. For instance an encoder faults is to modify the routing, so that the packets can
can be added to the sender and a decoder can be added to be routed away from the defective router [112]. In [125]
the receiver. Parity codes or cyclic redundancy check (CRC) a basic XY routing technique is implemented. In order to
codes are used as error detecting codes. In such cases the avoid a faulty router, adaptive routing can be employed. This
sender network interface will have one or more buffers to store strategy, however, is not deadlock free for the commonly used
the transmitted packets. If the receiver detects an error in the wormhole switching NoCs. The adaptivity of the odd-even
packet, it will request the packet to be sent again from the turn model is presented in [126] to bypass rectangular areas. A
sender. However, this scheme has disadvantages: first of all, routing algorithm called Minimal and Adaptive Fault-tolerant
it can not precisely locate the position of the fault since the Algorithm (MAFA) is presented in [127] to route packets via
checking is done at the receiver only; moreover, it can only shortest paths in the presence of faulty links. [112] presents a
overcome transient faults, while in case of intermittent and table that lists and compares fault tolerant techniques.
permanent faults it is highly likely that the resent data will be A single-fault tolerant (SFT) technique is proposed by
affected by the same fault. Using error correction codes, some [128] for a 2D-Mesh NoC using various deterministic routing
faults can be handled by the receiver. However, ”only a limited algorithms on the same physical channel based on multiple
number of faults can be handled and it gets overwhelmed when virtual channels. [129] proposed an ant colony optimization-
permanent faults accumulate over time” [112]. based fault-aware routing (ACO-FAR) algorithm to avoid
[123] proposes a strategy to locate faulty links in the hotspots around faulty routers. ACO-FAR consists of three
steps: (1) detection of fault information (encounter), (2) search to be processed in parallel. The fundamental laws of parallel
for alternative paths (search), and (3) alternative path selection computing are explained together with the various levels of
(select). parallelism. The mechanisms of mapping applications onto
Despite the existence of fault tolerance techniques at the many-core platforms are discussed and relevant literature is
network router level, there are still some limitations because explored. Furthermore, fault-tolerance techniques in the three
they mainly focus on routing packets away from the faulty layers of NoC are presented. Even though extensive research
region, which might add additional traffic to the surrounding has already been carried out in the area of many-core systems
areas, consequently generating more congestion zones. and on how they can be utilized efficiently, a breakthrough is
3) Faults in the Physical Layer: The reliability of CMOS still needed at different levels of these systems.
devices can be affected by different kinds of physical failures.
Generally, these faults can be categorised into four groups: ACKNOWLEDGEMENTS
radiation, electrostatic discharge, electromagnetic interface, I am thankful to my supervisors Dr. Gianluca Tempesti and
and ageing. Dr. Martin Trefzer who provided guidance and expertise that
Radiation Radioactive impurities in the circuit and packag- greatly assisted the research. I am also grateful to Nizar and
ing materials generate alpha particles and terrestrial cosmic Pedro for assistance with technical help to understand and set-
which can cause soft errors [109] [110] [111]. A bit flip up the Graceful platform.
could happen to one or more bits in the memory due to these
particles. This is usually referred to as Single Event Upset R EFERENCES
(SEU) but if a particle drives a gate or wire to generate an [1] G. E. Moore, “Cramming more components onto integrated circuits,
incorrect level of voltage it is called a Single Event Transient reprinted from electronics, volume 38, number 8, april 19, 1965, pp.114
ff.” IEEE Solid-State Circuits Society Newsletter, vol. 11, no. 5, pp.
(SET). 33–35, Sept 2006.
Electrostatic Discharge: A breakdown of devices could [2] Andrs Vajda, Programming many-Core Chips. Springer Sci-
happen due to high electric current, which can enter via I/O ence+Business Media, 233 Spring Street, New York, USA: Springer,
2011.
pins or be introduced by strong electric fields. Electrostatic [3] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel, “Mapping on
breakdown is classified into three types; (1) dielectric oxide multi/many-core systems: Survey of current and emerging trends,” in
breakdown, (2) PN junction breakdown, and (3) wiring break- 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC),
May 2013, pp. 1–10.
down. However, internal components such as NoCs will rarely [4] J. Torrellas, “How to build a useful thousand-core manycore system?”
be affected due to the protection at the pins of the ICs [130]. in 2009 IEEE International Symposium on Parallel Distributed Pro-
Electromagnetic Interference: The crosstalk between long cessing, May 2009, pp. 1–1.
[5] L. Benini and G. D. Micheli, “Networks on chips: a new soc paradigm,”
parallel wires is said to be the main source of electromagnetic Computer, vol. 35, no. 1, pp. 70–78, Jan 2002.
Interference. With the scaling of the technology the wires [6] M. A. A. Faruque, R. Krist, and J. Henkel, “Adam: Run-time agent-
become thinner, thus increasing delay and resistance. Con- based distributed application mapping for on-chip communication,” in
2008 45th ACM/IEEE Design Automation Conference, June 2008, pp.
sequently, the coupling capacitance and inductance between 760–765.
parallel wires are growing. Moreover, the signal on one wire [7] P. Campos, N. Dahir, C. Bonney, M. Trefzer, A. Tyrrell, and G. Tem-
can influence the one on the next wire, which increases signal pesti, “Xl-stage: A cross-layer scalable tool for graph generation,
evaluation and implementation,” in 2016 International Conference on
delay, glitches, and damped voltage oscillations [131]. Embedded Computer Systems: Architectures, Modeling and Simulation
Ageing: The performance of CMOS devices decreases over (SAMOS), July 2016, pp. 354–359.
time due to a number of physical effects. For instance, some [8] J. Cavazos, “Lecture 1 The Multicore Revolution,” Dept. of Computer
& Information Sciences, University of Delaware.
of the carriers (electrons or holes) can make their way through [9] Jack Dongarra and others, The Sourcebook of Parallel Computing,
the insulating silicon oxide layer beneath the gate, in a process 1st ed. Elsevier, November 2002.
known as Hot Carrier Injection (HCI). As a consequence, [10] Karl Rupp, “40 years of microprocessor trend data,” June 2015.
[11] A. A. C. Shekhar Borkar, “The future of microprocessors,” Commu-
the switching characteristics of transistors gradually change, nication of The Association for Computing Machinery, vol. 54, no. 5,
particularly the threshold voltage [132]. pp. 67–77, May 2011.
Fault tolerance in the physical layer is not explored in this [12] F. N. Najm, “A survey of power estimation techniques in vlsi circuits,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
paper because it does not fall within the scope of the proposed vol. 2, no. 4, pp. 446–455, Dec 1994.
research. [13] M. J. Flynn, “Some computer organizations and their effectiveness,”
IEEE Trans. Comput., vol. 21, no. 9, pp. 948–960, Sep. 1972.
III. C ONCLUSION [Online]. Available: http://dx.doi.org/10.1109/TC.1972.5009071
[14] M. D. Hill and M. R. Marty, “Amdahl’s law in the multicore era,”
This paper has presented a literature review on some areas Computer, vol. 41, no. 7, pp. 33–38, Jul. 2008. [Online]. Available:
relevant to many-core systems and their applications. It starts http://dx.doi.org/10.1109/MC.2008.209
by explaining the scaling of transistor and how this is affected [15] J. L. Gustafson, “Reevaluating amdahl’s law,” Commun. ACM,
vol. 31, no. 5, pp. 532–533, May 1988. [Online]. Available:
by physical limitations, which led to the idea of putting more http://doi.acm.org/10.1145/42411.42415
than one core in the same chip. This hasted to the introduction [16] N. J. Gunther, “A new interpretation of amdahl’s law and geometric
of many-core systems, which provide an appealing architec- scalability,” https://arxiv.org/abs/cs/0210017, accessed: 2017-07-19.
[17] A. H. Karp and H. P. Flatt, “Measuring parallel processor
tures but also introduced new research challenges. In order performance,” Commun. ACM, vol. 33, no. 5, pp. 539–543, May
to utilize these new architectures, applications are required 1990. [Online]. Available: http://doi.acm.org/10.1145/78607.78614
[18] James Reinders, Intel Threading Building Blocks: Outfitting C++ for [44] “A 167-processor 65 nm computational platform with per-
Multi-core Processor Parallelism. OReilly Media, July 2007. processor dynamic supply voltage and dynamic clock
[19] A. Vajda, Programming Many-Core Chips, 1st ed. Springer Publishing frequency scaling,” accessed: 2017-07-23. [Online]. Available:
Company, Incorporated, 2011. http://vcl.ece.ucdavis.edu/pubs/2008.06.symp.vlsi/
[20] M. Frigo, C. E. Leiserson, and K. H. Randall, “The [45] “Nvidia tesla :a unified graphics and computing a rchitecture,”
implementation of the cilk-5 multithreaded language,” SIGPLAN http://www.serc.iisc.ernet.in/ vss/courses/PPP.old/GPU/tesla-
Not., vol. 33, no. 5, pp. 212–223, May 1998. [Online]. Available: architecture-ieeemicro.pdf, accessed: 2017-07-23.
http://doi.acm.org/10.1145/277652.277725 [46] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,
[21] Barbara Chapman; Gabriele Jost; Ruud van der Pas, Using J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C. C. Miao,
OpenMP:Portable Shared Memory Parallel Programming. MIT Press, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks,
2007. D. Khan, F. Montenegro, J. Stickney, and J. Zook, “Tile64 - processor:
[22] “The OpenMP API specification for parallel programming,” A 64-core soc with mesh interconnect,” in 2008 IEEE International
http://www.openmp.org/, accessed: 2017-07-22. Solid-State Circuits Conference - Digest of Technical Papers, Feb 2008,
[23] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel pp. 88–598.
Programming with the Message-passing Interface. Cambridge, MA, [47] D. C. Pham, T. A. rspach, D. Boerstler, M. Bolliger, R. Chaudhry,
USA: MIT Press, 1994. D. Cox, P. Harvey, P. M. Harvey, H. P. Hofstee, C. Johns, J. Kahle,
[24] Jason Sanders and Edward Kandrot, CUDA by Example: An Introduc- A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny,
tion to General-Purpose GPU Programming, July 2010. M. Riley, D. L. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock,
[25] “Cuda zone,” https://developer.nvidia.com/cuda-zone, accessed: 2017- S. Weitzel, D. Wendel, and K. Yazawa, “Overview of the architecture,
07-23. circuit design, and physical implementation of a first-generation cell
[26] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, processor,” IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp.
and S. Amarasinghe, “Petabricks: A language and compiler for 179–196, Jan 2006.
algorithmic choice,” in Proceedings of the 30th ACM SIGPLAN [48] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-
Conference on Programming Language Design and Implementation, wald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf,
ser. PLDI ’09. New York, NY, USA: ACM, 2009, pp. 38–49. M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe,
[Online]. Available: http://doi.acm.org/10.1145/1542476.1542481 and A. Agarwal, “The raw microprocessor: a computational fabric for
[27] “Petabricks,” http://projects.csail.mit.edu/petabricks/, accessed: 2017- software circuits and general-purpose programs,” IEEE Micro, vol. 22,
07-23. no. 2, pp. 25–35, Mar 2002.
[28] “Streamit,” http://groups.csail.mit.edu/cag/streamit/, accessed: 2017- [49] S. R. Vangal et al., “An 80-tile sub-100-w teraflops processor in 65-nm
07-23. cmos,” IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 29–41,
[29] “All Programmable SoC with Hardware and Software Jan 2008.
Programmability,” accessed: 2017-07-20. [Online]. Available: [50] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey,
https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html M. Mattina, C. C. Miao, J. F. B. III, and A. Agarwal, “On-chip
[30] “”worlds first 1,000-processor chip”,” interconnection architecture of the tile processor,” IEEE Micro, vol. 27,
https://www.ucdavis.edu/news/worlds-first-1000-processor-chip/, no. 5, pp. 15–31, Sept 2007.
accessed: 2017-07-20. [51] C. Ramey, “Tile-gx100 manycore processor: Acceleration interfaces
[31] V. Nollet, P. Avasare, H. Eeckhaut, D. Verkest, and H. Corporaal, and architecture,” in 2011 IEEE Hot Chips 23 Symposium (HCS), Aug
“Run-time management of a mpsoc containing fpga fabric tiles,” IEEE 2011, pp. 1–21.
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, [52] S.-J. Chen, Y.-C. Lan, W.-C. Tsai, and Y.-H. Hu, Communication
no. 1, pp. 24–33, Jan 2008. Centric Design. New York, NY: Springer New York, 2012, pp. 3–13.
[32] J. A. de Oliveira and H. van Antwerpen, The Philips Nexperia Digital
[53] R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,”
Video Platform. Boston, MA: Springer US, 2003, pp. 67–96.
Proceedings of the IEEE, vol. 89, no. 4, pp. 490–504, Apr 2001.
[33] “Many-core fabricated chips information page,”
[54] W. J. Dally and B. Towles, “Route packets, not wires: On-
http://vcl.ece.ucdavis.edu/misc/many-core.html, accessed: 2017-07-19.
chip inteconnection networks,” in Proceedings of the 38th
[34] “Intel Core i9-7900X X-series Processor,” accessed: 2017-07-23.
Annual Design Automation Conference, ser. DAC ’01. New
[Online]. Available: http://ark.intel.com/products/123613/Intel-Core-
York, NY, USA: ACM, 2001, pp. 684–689. [Online]. Available:
i9-7900X-X-series-Processor-13-75M-Cache-up-to-4 30-GHz
http://doi.acm.org/10.1145/378239.379048
[35] B. Bohnenstiehl, A. Stillmaker, J. Pimentel, T. Andreas, B. Liu,
[55] Sheng Ma and others, Networks-on-Chip: From Implementations to
A. Tran, E. Adeagbo, and B. Baas, “A 5.8 pj/op 115 billion ops/sec,
Programming Paradigms, 1st ed. Springer, 2015.
to 1.78 trillion ops/sec 32nm 1000-processor array,” in 2016 IEEE
Symposium on VLSI Circuits (VLSI-Circuits), June 2016, pp. 1–2. [56] S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey,
[36] “AMD Radeon R9 Series Gaming Graphics Cards with High- S. Sarkar, S. Siers, I. Stolero, and A. Subbiah, “A 22nm ia multi-
Bandwidth Memory,” accessed: 2017-07-23. [Online]. Available: cpu and gpu system-on-chip,” in 2012 IEEE International Solid-State
http://www.amd.com/en-us/products/graphics/desktop/ Circuits Conference, Feb 2012, pp. 56–57.
[37] “Intel Xeon Phi Processors,” accessed: 2017-07-23. [Online]. Available: [57] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer,
https://www.intel.com/content/www/us/en/support/processors.html and D. Shippy, “Introduction to the cell multiprocessor,” IBM J. Res.
[38] “TILE-Gx72 Processor Product Brief,” accessed: 2017- Dev., vol. 49, no. 4/5, pp. 589–604, Jul. 2005. [Online]. Available:
07-23. [Online]. Available: http://www.mellanox.com/related- http://dl.acm.org/citation.cfm?id=1148882.1148891
docs/prod multi core/PB TILE-Gx72.pdf [58] Intel Xeon Phi Coprocessor, Datasheet, Intel, 2013, rev. 002.
[39] “A clustered manycore processor architecture for embedded and [59] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, S. W.
accelerated applications,” accessed: 2017-07-23. [Online]. Available: Keckler, and D. Burger, “On-chip interconnection networks of the trips
http://ieee-hpec.org/2013/index htm files/44.pdf chip,” IEEE Micro, vol. 27, no. 5, pp. 41–50, Sept 2007.
[40] M. B. Healy et al., “Design and analysis of 3d-maps: A many-core 3d [60] S. Vangal et al., “An 80-tile 1.28tflops network-on-chip in 65nm cmos,”
processor with stacked memory,” in IEEE Custom Integrated Circuits in 2007 IEEE International Solid-State Circuits Conference. Digest of
Conference 2010, Sept 2010, pp. 1–4. Technical Papers, Feb 2007, pp. 98–589.
[41] R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd, “Power7: Ibm’s [61] D. Wentzlaff et al., “On-chip interconnection architecture of the tile
next-generation server processor,” IEEE Micro, vol. 30, no. 2, pp. 7– processor,” IEEE Micro, vol. 27, no. 5, pp. 15–31, Sept 2007.
15, March 2010. [62] A. K. Mishra, N. Vijaykrishnan, and C. R. Das, “A case for
[42] “QorIQ P4080 Communications Processor Prod- heterogeneous on-chip interconnects for cmps,” in Proceedings of the
uct Brief,” accessed: 2017-07-23. [Online]. Available: 38th Annual International Symposium on Computer Architecture, ser.
http://cache.freescale.com/files/32bit/doc/prod brief/P4080PB.pdf ISCA ’11. New York, NY, USA: ACM, 2011, pp. 389–400. [Online].
[43] M. Butts, “Synchronization through communication in a massively Available: http://doi.acm.org/10.1145/2000064.2000111
parallel processor array,” IEEE Micro, vol. 27, no. 5, pp. 32–40, Sept [63] William Dally and Brian Towles, Principles and Practices of Intercon-
2007. nection Networks , 1st ed. Elsevier, December 2003.
[64] M. H. Cho, M. Lis, K. S. Shim, M. Kinsy, T. Wen, and S. Devadas, [84] J. Duato, S. Yalamanchili, and N. Lionel, Interconnection Networks: An
“Oblivious routing in on-chip bandwidth-adaptive networks,” in 2009 Engineering Approach. San Francisco, CA, USA: Morgan Kaufmann
18th International Conference on Parallel Architectures and Compila- Publishers Inc., 2002.
tion Techniques, Sept 2009, pp. 181–190. [85] W. Dally and B. Towles, Principles and Practices of Interconnection
[65] L. Chen, R. Wang, and T. M. Pinkston, “Critical bubble scheme: An Networks. San Francisco, CA, USA: Morgan Kaufmann Publishers
efficient implementation of globally aware network flow control,” in Inc., 2003.
2011 IEEE International Parallel Distributed Processing Symposium, [86] P. Kermani, “Virtual cut-through: A new computer communication
May 2011, pp. 592–603. switching technique,” 1979.
[66] Y. Ajima, S. Sumimoto, and T. Shimizu, “Tofu: A 6d mesh/torus [87] W. J. Dally, B. F. Intel, A. N. Chips, and M. Plesiochronous, “The
interconnect for exascale computers,” Computer, vol. 42, no. 11, pp. torus routing chip,” Distributed Computing, pp. 187–196, 1986.
36–40, Nov 2009. [88] T. Moscibroda and O. Mutlu, “A case for bufferless routing in
[67] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. on-chip networks,” in Proceedings of the 36th Annual International
Giampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, Symposium on Computer Architecture, ser. ISCA ’09. New
T. Takken, M. Tsao, and P. Vranas, “Blue gene/l torus interconnection York, NY, USA: ACM, 2009, pp. 196–207. [Online]. Available:
network,” IBM J. Res. Dev., vol. 49, no. 2, pp. 265–276, Mar. 2005. http://doi.acm.org/10.1145/1555754.1555781
[Online]. Available: http://dx.doi.org/10.1147/rd.492.0265 [89] M. Hayenga, N. E. Jerger, and M. Lipasti, “Scarab: A single cycle adap-
[68] D. Chen et al., “The ibm blue gene/q interconnection fabric,” IEEE tive routing and bufferless network,” in 2009 42nd Annual IEEE/ACM
Micro, vol. 32, no. 1, pp. 32–43, Jan. 2012. [Online]. Available: International Symposium on Microarchitecture (MICRO), Dec 2009,
http://dx.doi.org/10.1109/MM.2011.96 pp. 244–254.
[69] S. L. Scott and et al., “The cray t3e network: Adaptive routing in a [90] S. A. R. Jafri, Y. J. Hong, M. Thottethodi, and T. N. Vijaykumar,
high performance 3d torus,” 1996. “Adaptive flow control for robust performance and energy,” in 2010
[70] “SpiNNaKer Chip,” accessed: 2017-07-31. [Online]. Available: 43rd Annual IEEE/ACM International Symposium on Microarchitec-
http://apt.cs.manchester.ac.uk/projects/SpiNNaker/SpiNNchip/ ture, Dec 2010, pp. 433–444.
[71] S. S. B. M. A. Gaikwad, “A comparative study of different topologies [91] V. Puente, R. Beivide, J. A. Gregorio, J. M. Prellezo, J. Duato, and
for network-on-chip architecture,” International Journal of Computer C. Izu, “Adaptive bubble router: a design to improve performance in
Applications, March 2013. torus networks,” in Proceedings of the 1999 International Conference
[72] J. Balfour and W. J. Dally, “Design tradeoffs for tiled on Parallel Processing, 1999, pp. 58–67.
cmp on-chip networks,” in Proceedings of the 20th Annual [92] S. Ma, Z. Wang, Z. Liu, and N. E. Jerger, “Leaving one slot empty: Flit
International Conference on Supercomputing, ser. ICS ’06. New bubble flow control for torus cache-coherent nocs,” IEEE Transactions
York, NY, USA: ACM, 2006, pp. 187–198. [Online]. Available: on Computers, vol. 64, no. 3, pp. 763–777, March 2015.
http://doi.acm.org/10.1145/1183401.1183430 [93] “Imec unveils tools to speed design of energy-efficient multi-
[73] P. Kumar, Y. Pan, J. Kim, G. Memik, and A. Choudhary, “Exploring processor soc platforms,” http://embedded-computing.com/news/imec-
concentration and channel slicing in on-chip network router,” in 2009 multi-processor-soc-platforms/, accessed: 2017-07-20.
3rd ACM/IEEE International Symposium on Networks-on-Chip, May [94] J. Ceng et al., “Maps: An integrated framework for mpsoc application
2009, pp. 276–285. parallelization,” in 2008 45th ACM/IEEE Design Automation Confer-
[74] A. K. Mishra, N. Vijaykrishnan, and C. R. Das, “A case for ence, June 2008, pp. 754–759.
heterogeneous on-chip interconnects for cmps,” in Proceedings of the [95] D. Cordes, O. Neugebauer, M. Engel, and P. Marwedel, “Automatic
38th Annual International Symposium on Computer Architecture, ser. extraction of task-level parallelism for heterogeneous mpsocs,” in 2013
ISCA ’11. New York, NY, USA: ACM, 2011, pp. 389–400. [Online]. 42nd International Conference on Parallel Processing, Oct 2013, pp.
Available: http://doi.acm.org/10.1145/2000064.2000111 950–959.
[75] B. Zafar, J. Draper, and T. M. Pinkston, “Cubic ring networks: A [96] R. P. Dick, D. L. Rhodes, and W. Wolf, “Tgff: task graphs for
polymorphic topology for network-on-chip,” in 2010 39th International free,” in Hardware/Software Codesign, 1998. (CODES/CASHE ’98)
Conference on Parallel Processing, Sept 2010, pp. 443–452. Proceedings of the Sixth International Workshop on, Mar 1998, pp.
[76] L. Chen and T. M. Pinkston, “Nord: Node-router decoupling for effec- 97–101.
tive power-gating of on-chip routers,” in 2012 45th Annual IEEE/ACM [97] S. Stuijk, M. Geilen, and T. Basten, “Sdf3: Sdf for free,” in Sixth
International Symposium on Microarchitecture, Dec 2012, pp. 270– International Conference on Application of Concurrency to System
281. Design (ACSD’06), June 2006, pp. 276–278.
[77] L. Huang, Z. Wang, and N. Xiao, “Vbon: Toward efficient on-chip [98] J. Hu and R. Marculescu, “Energy- and performance-aware mapping
networks via hierarchical virtual bus,” Microprocess. Microsyst., for regular noc architectures,” IEEE Transactions on Computer-Aided
vol. 37, no. 8, pp. 915–928, Nov. 2013. [Online]. Available: Design of Integrated Circuits and Systems, vol. 24, no. 4, pp. 551–562,
http://dx.doi.org/10.1016/j.micpro.2012.06.013 April 2005.
[78] W. J. Dally and C. L. Seitz, “Interconnection networks for [99] C. Marcon, A. Borin, A. Susin, L. Carro, and F. Wagner, “Time
high-performance parallel computers,” I. D. Scherson and A. S. and energy efficient mapping of embedded applications onto nocs,”
Youssef, Eds. Los Alamitos, CA, USA: IEEE Computer in Proceedings of the ASP-DAC 2005. Asia and South Pacific Design
Society Press, 1994, ch. Deadlock-free Message Routing in Automation Conference, 2005., vol. 1, Jan 2005, pp. 33–38 Vol. 1.
Multiprocessor Interconnection Networks, pp. 345–351. [Online]. [100] G. Jiang, Z. Li, F. Wang, and S. Wei, “Mapping of embedded
Available: http://dl.acm.org/citation.cfm?id=201173.201227 applications on hybrid networks-on-chip with multiple switching mech-
[79] J. Duato, “A new theory of deadlock-free adaptive routing in wormhole anisms,” IEEE Embedded Systems Letters, vol. 7, no. 2, pp. 59–62, June
networks,” IEEE Transactions on Parallel and Distributed Systems, 2015.
vol. 4, no. 12, pp. 1320–1331, Dec 1993. [101] T. Lei and S. Kumar, “Algorithms and tools for network on chip based
[80] G.-M. Chiu, “The odd-even turn model for adaptive routing,” IEEE system design,” in 16th Symposium on Integrated Circuits and Systems
Transactions on Parallel and Distributed Systems, vol. 11, no. 7, pp. Design, 2003. SBCCI 2003. Proceedings., Sept 2003, pp. 163–168.
729–738, 2000. [102] D. Wu, B. M. Al-Hashimi, and P. Eles, “Scheduling and mapping
[81] B. Fu, Y. Han, J. Ma, H. Li, and X. Li, “An abacus turn model of conditional task graph for the synthesis of low power embedded
for time/space-efficient reconfigurable routing,” in Proceedings of the systems,” IEE Proceedings - Computers and Digital Techniques, vol.
38th Annual International Symposium on Computer Architecture, ser. 150, no. 5, pp. 262–73–, Sept 2003.
ISCA ’11. New York, NY, USA: ACM, 2011, pp. 259–270. [Online]. [103] P. Zipf, G. Sassatelli, N. Utlu, N. Saint-Jean, P. Benoit,
Available: http://doi.acm.org/10.1145/2000064.2000096 and M. Glesner, “A decentralised task mapping approach for
[82] J. Duato, “A necessary and sufficient condition for deadlock-free homogeneous multiprocessor network-on-chips,” Int. J. Reconfig.
adaptive routing in wormhole networks,” IEEE Transactions on Parallel Comput., vol. 2009, pp. 3:1–3:14, Jan. 2009. [Online]. Available:
and Distributed Systems, vol. 6, no. 10, pp. 1055–1067, Oct 1995. http://dx.doi.org/10.1155/2009/453970
[83] S. Ma, N. D. E. Jerger, and Z. Wang, “Whole packet forwarding: [104] E. W. Brião, D. Barcelos, and F. R. Wagner, “Dynamic task allocation
Efficient design of fully adaptive routing algorithms for networks-on- strategies in mpsoc for soft real-time applications,” in Proceedings of
chip,” in HPCA, 2012. the Conference on Design, Automation and Test in Europe, ser. DATE
’08. New York, NY, USA: ACM, 2008, pp. 1386–1389. [Online]. [122] S. Murali, D. Atienza, L. Benini, and G. D. Micheli, “A multi-path
Available: http://doi.acm.org/10.1145/1403375.1403709 routing strategy with guaranteed in-order packet delivery and fault-
[105] W. Quan and A. D. Pimentel, “A hybrid task mapping algorithm tolerance for networks on chip,” in 2006 43rd ACM/IEEE Design
for heterogeneous mpsocs,” ACM Trans. Embed. Comput. Syst., Automation Conference, July 2006, pp. 845–848.
vol. 14, no. 1, pp. 14:1–14:25, Jan. 2015. [Online]. Available: [123] J. Raik, R. Ubar, and V. Govind, “Test configurations for diagnosing
http://doi.acm.org/10.1145/2680542 faulty links in noc switches,” in 12th IEEE European Test Symposium
[106] P. K. F. Hölzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit, “Run- (ETS’07), May 2007, pp. 29–34.
time spatial mapping of streaming applications to a heterogeneous [124] M. R. Kakoee, V. Bertacco, and L. Benini, “A distributed and topology-
multi-processor system-on-chip (mpsoc),” in Proceedings of the agnostic approach for on-line noc testing,” in Proceedings of the Fifth
Conference on Design, Automation and Test in Europe, ser. DATE ACM/IEEE International Symposium, May 2011, pp. 113–120.
’08. New York, NY, USA: ACM, 2008, pp. 212–217. [Online]. [125] C. Bobda, A. Ahmadinia, M. Majer, J. Teich, S. Fekete, and J. van der
Available: http://doi.acm.org/10.1145/1403375.1403427 Veen, “Dynoc: A dynamic infrastructure for communication in dynam-
[107] M. Mandelli, A. Amory, L. Ost, and F. G. Moraes, “Multi-task ically reconfugurable devices,” in International Conference on Field
dynamic mapping onto noc-based mpsocs,” in Proceedings of the 24th Programmable Logic and Applications, 2005., Aug 2005, pp. 153–158.
Symposium on Integrated Circuits and Systems Design, ser. SBCCI [126] T. F. Pereira, D. R. de Melo, E. A. Bezerra, and C. A. Zeferino,
’11. New York, NY, USA: ACM, 2011, pp. 191–196. [Online]. “Mechanisms to provide fault tolerance to a network-on-chip,” IEEE
Available: http://doi.acm.org/10.1145/2020876.2020920 Latin America Transactions, vol. 15, no. 6, pp. 1034–1042, June 2017.
[108] M. Mandelli, G. Castilhos, G. Sassatelli, L. Ost, and F. G. Moraes, “A [127] M. Ebrahimi, M. Daneshtalab, J. Plosila, and H. Tenhunen, “Mafa:
distributed energy-aware task mapping to achieve thermal balancing Adaptive fault-tolerant routing algorithm for networks-on-chip,” in
and improve reliability of many-core systems,” in Proceedings of 2012 15th Euromicro Conference on Digital System Design, Sept 2012,
the 28th Symposium on Integrated Circuits and Systems Design, ser. pp. 201–207.
SBCCI ’15. New York, NY, USA: ACM, 2015, pp. 13:1–13:7. [128] Z. Zhang, A. Greiner, and S. Taktak, “A reconfigurable routing al-
[Online]. Available: http://doi.acm.org/10.1145/2800986.2800992 gorithm for a fault-tolerant 2d-mesh network-on-chip,” in 2008 45th
[109] C. Constantinescu, “Trends and challenges in vlsi circuit reliability,” ACM/IEEE Design Automation Conference, June 2008, pp. 441–446.
IEEE Micro, vol. 23, no. 4, pp. 14–19, July 2003. [129] H. K. Hsin, E. J. Chang, C. A. Lin, and A. Y. . Wu, “Ant colony
[110] A. Ejlali, B. M. Al-Hashimi, P. Rosinger, and S. G. Miremadi, “Joint optimization-based fault-aware routing in mesh-based network-on-chip
consideration of fault-tolerance, energy-efficiency and performance systems,” IEEE Transactions on Computer-Aided Design of Integrated
in on-chip networks,” in 2007 Design, Automation Test in Europe Circuits and Systems, vol. 33, no. 11, pp. 1693–1705, Nov 2014.
Conference Exhibition, April 2007, pp. 1–6. [130] Alpha and O. Semiconductor, Power Semiconductor Reliability Hand-
[111] D. Bertozzi, L. Benini, and G. D. Micheli, “Error control schemes book, Sunnyvale, CA 94085, U.S.A., 2010.
for on-chip communication links: the energy-reliability tradeoff,” IEEE [131] M. Cuviello, S. Dey, X. Bai, and Y. Zhao, “Fault modeling and
Transactions on Computer-Aided Design of Integrated Circuits and simulation for crosstalk in system-on-chip interconnects,” in 1999
Systems, vol. 24, no. 6, pp. 818–831, June 2005. IEEE/ACM International Conference on Computer-Aided Design. Di-
[112] M. Radetzki, C. Feng, X. Zhao, and A. Jantsch, “Methods gest of Technical Papers (Cat. No.99CH37051), Nov 1999, pp. 297–
for fault tolerance in networks-on-chip,” ACM Comput. Surv., 303.
vol. 46, no. 1, pp. 8:1–8:38, Jul. 2013. [Online]. Available: [132] J. Keane and C. H. Kim, “An odomoeter for cpus,” IEEE Spectrum,
http://doi.acm.org/10.1145/2522968.2522976 vol. 48, no. 5, pp. 28–33, May 2011.
[113] S. Murali, T. Theocharides, N. Vijaykrishnan, M. J. Irwin, L. Benini,
and G. D. Micheli, “Analysis of error recovery schemes for networks
on chips,” IEEE Design Test of Computers, vol. 22, no. 5, pp. 434–442,
Sept 2005.
[114] J. M. Montanana, D. de Andres, and F. Tirado, “Fault tolerance on
nocs,” in 2013 27th International Conference on Advanced Information
Networking and Applications Workshops, March 2013, pp. 138–143.
[115] T. Lehtonen, P. Liljeberg, and J. Plosila, “Analysis of forward
error correction methods for nanoscale networks-on-chip,” in
Proceedings of the 2Nd International Conference on Nano-
Networks, ser. Nano-Net ’07. ICST, Brussels, Belgium, Belgium:
ICST (Institute for Computer Sciences, Social-Informatics and
Telecommunications Engineering), 2007, pp. 3:1–3:5. [Online].
Available: http://dl.acm.org/citation.cfm?id=1459290.1459295
[116] D. Rossi, P. Angelini, and C. Metra, “Configurable error control scheme
for noc signal integrity,” in 13th IEEE International On-Line Testing
Symposium (IOLTS 2007), July 2007, pp. 43–48.
[117] L. Fiorin, L. Micconi, and M. Sami, “Design of fault tolerant network
interfaces for nocs,” in 2011 14th Euromicro Conference on Digital
System Design, Aug 2011, pp. 393–400.
[118] V. Rantala, T. Lehtonen, P. Liljeberg, and J. Plosila, “Multi network
interface architectures for fault tolerant network-on-chip,” in 2009
International Symposium on Signals, Circuits and Systems, July 2009,
pp. 1–4.
[119] L. Fiorin and M. Sami, “Fault-tolerant network interfaces for networks-
on-chip,” IEEE Transactions on Dependable and Secure Computing,
vol. 11, no. 1, pp. 16–29, Jan 2014.
[120] S. Pasricha, Y. Zou, D. Connors, and H. J. Siegel, “Oe+ioe: A
novel turn model based fault tolerant routing scheme for networks-
on-chip,” in Proceedings of the Eighth IEEE/ACM/IFIP International
Conference on Hardware/Software Codesign and System Synthesis,
ser. CODES/ISSS ’10. New York, NY, USA: ACM, 2010, pp. 85–94.
[Online]. Available: http://doi.acm.org/10.1145/1878961.1878979
[121] A. Patooghy and S. G. Miremadi, “Xyx: A power x00026; performance
efficient fault-tolerant routing algorithm for network on chip,” in 2009
17th Euromicro International Conference on Parallel, Distributed and
Network-based Processing, Feb 2009, pp. 245–251.

You might also like