0% found this document useful (0 votes)
26 views5 pages

Ann On Fpga

Challenges of ANN implementation

Uploaded by

H.Bhargav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views5 pages

Ann On Fpga

Challenges of ANN implementation

Uploaded by

H.Bhargav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MIPRO 2011, May 23-27, 2011, Opatija, Croatia

Implementation Framework for Artificial Neural


Networks on FPGA

P. Škoda*, T. Lipić*, Á. Srp**, B. Medved Rogina*, K. Skala* and F. Vajda**


* Ruđer Bošković Institute, Zagreb, Croatia
**
Budapest University of Technology and Economics, Budapest, Hungary
{pskoda, tlipic, medved, skala}@irb.hr, {srp.agoston, vajda}@iit.bme.hu

Abstract - In an Artificial Neural Network (ANN) a large look-up tables, multiplexers and flip-flops which together
number of highly interconnected simple nonlinear with interconnect allow performance of complex
processing units work in parallel to solve a specific problem. combinatorial and sequential functions, and
Parallelism, modularity and dynamic adaptation are three implementation of a wide variety of digital systems as
characteristics typically associated with ANNs. Field well. Modern FPGAs also contain specialized memory,
Programmable Gate Array (FPGA) based reconfigurable arithmetic, and communication blocks which make
computing architectures are well suited to implement ANNs possible more efficient implementations of digital
as one can exploit concurrency and rapidly reconfigure to systems. By allowing implementation of custom
adapt the weights and topologies of an ANN. ANNs are
computational architectures, FPGAs provide opportunities
suitable for and widely used in various real-life applications.
for exploitation of parallel nature inherent to ANNs.
A large portion of these applications are realized as
embedded computer systems. With continuous Implementation of ANNs in FPGAs is not without
advancements in VLSI technology FPGAs have become difficulties [6]. One difficulty stems from the nature of
more powerful and power efficient, enabling the FPGA ANN itself – ANNs are multiplication rich. Large
implementation of ANNs in embedded systems. This paper networks can exceed capacity of an FPGA device. Other
proposes an FPGA ANN framework which facilitates difficulties, such as choice of number representation,
implementation in embedded systems. A case study of an selection of sufficient arithmetic precision, and nonlinear
ANN implementation in an embedded fall detection system
functions implementation, stem from capacity limitations
is presented to demonstrate the advantages of the proposed
framework.
of FPGA devices. All these limitations have to be
carefully considered in order to achieve desired
performance of the ANN when implemented on FPGA. A
I. INTRODUCTION significant obstacle in using FPGAs for implementing
ANNs is the process of design itself. It is still virtually
Artificial neural networks (ANN) are computational
impossible to design for FPGA without knowledge of low
models inspired by biological neural networks of the brain
level details of the design [7]. While high level tools and
[1]. The processing in the brain is mainly parallel and
languages for FPGA based design are being developed
distributed. Information is stored in connections and thus
[8], they are not yet mature enough. To be able to take
distributed over the network and processed in a large
advantage of the optimization possibilities offered by
number of neurons in parallel. ANNs have found
FPGAs the design must be optimized at low level
applications in many domains, e.g. signal processing,
hardware.
image analysis, speech recognition and automation and
control systems [2], [3], [4]. Roles of ANNs in before In this paper we propose a framework for
mentioned applications fall broadly into two categories: implementation of ANNs on FPGAs. The framework is
pattern recognition and function approximation. In pattern intended to enable rapid deployment of beforehand trained
recognition the task is to provide meaningful ANNs on FPGA platforms by relieving the user from
categorization of input patterns. In function approximation dealing with low-level hardware details. It consists of: a
the network finds a smooth function that approximates the VHDL library which implements basic elements of
actual mapping between the input and output. Majority of ANNs, an ANN generator written in VHDL, and a
ANNs are implemented in software on sequential configuration generator which generates a configuration
machines. While this is not a severe limitation, there is package for the ANN generator. The framework supports
much to be gained by implementing ANNs in hardware, the multilayer perceptron (MLP) neural network type.
especially if it exploits the parallelism inherent in the Other approaches to implementation frameworks for
ANNs. ANNs on FPGA can be found in [9], [10].
Field Programmable Gate Arrays (FPGA) are This paper is organized as follows. Fundamentals of
integrated circuits designed to be configured by the user ANNs are given in Section II. In Section III, the major
after manufacturing [5]. An FPGA contains a matrix of issues in hardware implementation of ANNs are reviewed,
configurable logic blocks and a hierarchy of and the implementation architectures of a single neuron
reconfigurable interconnects that allow the blocks to be and of an MLP are defined. A case study of framework
connected together. A configurable logic block contains

274
FPGA limits the ways in which these functions can be
implemented at reasonable cost. However, high-speed
implementations can be achieved if the right choices are
made.
There are several types of neural networks, e.g.
multilayer perceptron, self-organizing feature maps and
associative-memory networks [6]. MLP is the network
type supported by the implementation framework
MLP ANN [11] is a feed-forward network consisting
Figure 1. The basic structure of an artificial neuron with bias of an input layer of nodes, followed by two or more layers
of neurons, with the last layer being the output layer. The
use is presented in Section IV. Conclusion and future layers between the input and output layers are referred to
work are given in Section V. as hidden layers. Outputs of neurons in one layer are
inputs for the next layer. There are no connections
between non-adjacent layers, and no connections between
II. ARTIFICIAL NEURAL NETWORKS FUNDAMENTALS neurons in the same layer. Connections between layers go
The basic unit of ANNs is an artificial neuron. in only one direction, i.e. there are no feedbacks.
Fundamental elements of artificial neuron are: (a) input Architecture of a three-layer MLP is shown in Figure 2.
nodes {x1, x2, …, xn}, which receive the input signal or
pattern, (b) synaptic links with associated weights III. FPGA IMPLEMENTATION
{w1, w2, …, wn}, which represent their strengths, and (c)
an activation function Φ that relates total synaptic input to A. Major Issues in FPGA Implementation of ANN
the output. An artificial neuron can also have a bias The two major issues that need to be considered in
constant b, although it is usually incorporated into the hardware implementation are parallelism exploitation and
weight vector. Basic structure of the artificial neuron with computer arithmetic.
bias is illustrated in Figure 1. Total synaptic input u of the
neuron is given by: There are three kinds of parallelism in an ANN: (a)
layer parallelism which is present in multilayer networks,
(b) node parallelism which corresponds to individual
n
u wi xi b (1) neurons, and (c) weight parallelism which is present in
i 1
computation of total synaptic input [6]. These different
types of parallelism can be traded off against other,
Inputs and weights are normally represented by vectors depending on the targeted cost/performance ratio. Since
making the sum in (1) an inner product. The output fully parallel implementations are generally not feasible,
activation is given by: except for very small ANNs, some sequential processing
will be necessary.

(2) The implementation framework we propose exploits


y (u )
(u
node parallelism and can exploit layer parallelism. Each
neuron in a layer is placed in hardware, and all neurons in
where Φ denotes the activation function of the neuron. a layer are evaluated simultaneously. Network layers can
Some of the commonly used activation functions for be pipelined so that each layer can process its own set of
neurons are: inputs in parallel. Weight parallelism is not exploited and,
as consequence, the total synaptic input is computed
linear: f ( x) x (3)

log-sigmoid: 1 (4)
f ( x) x
1 e

tan-sigmoid: ex e x
(5)
f ( x)
ex e x

Computation of the inner products for the total


synaptic input is one of the most important arithmetic
operations for the hardware implementation of an ANN.
The inner product is essentially a series of multiplications
and additions, and current FPGAs are well suited for such
operations. The second most important operation is
computation of activation functions. The structure of Figure 2. Architecture of a three-layer MLP network

275
Figure 3. Basic structure of hardware implementation of neuron

sequentially. elements received and in this way ensures that the


received element is multiplied with the correct weight.
There are several aspects of computer arithmetic that Neuron’s bias is included in computation as a weight w0
need to be considered in the process of implementing which is multiplied with an inserted constant input x0 = 1.
ANNs on FPGAs. These include data representation, inner
product computation, implementation of activation The activation function LUT is implemented as a
functions, and storage of weights [12]. ROM which is addressed by the computed total synaptic
input. The reason for separating activation function from
Data representation is very important as it can have a the rest of the neuron lies in another opportunity for
major impact on FPGA resource usage. Data can be conservation of FPGA resources. This implementation of
represented as floating-point or as fixed-point numbers. neuron computes the inner product sequentially and takes
Floating-point arithmetic has the advantage that it can input data as a sequence, one element at a time. As
represent a wider range of numbers then fixed-point, for consequence the output needs to be serialized for the next
the same number of bits used. However, it also requires layer. Since all neurons in a layer use the same activation
significantly more FPGA resources to implement [13]. function, the activation function LUT can be placed after
Since fixed-point arithmetic is sufficient to achieve good output serialization logic. In this way only one activation
performance of ANN [14], the hardware implementation function LUT is used per layer, instead of one per neuron.
uses fixed-point arithmetic.
One layer of an MLP is assembled by using one basic
Inner product is essential in computation of total functional unit per neuron, as illustrated in Figure 4, so
synaptic input. It can be implemented fully parallel but at
that computations of total synaptic inputs are carried out
a huge cost since it would require large amounts of FPGA for all neurons simultaneously. Computation results are
resources. Fully parallel implementation also requires all loaded into an assembly of parallel-load shift registers and
the input data to be loaded before any computations can then shifted to the activation function LUT which gives
be made, which requires extra storage elements. Inner the neuron outputs. Once the total synaptic inputs are
product is computed sequentially, using multiply- loaded into the shift register assembly, the basic functional
accumulate unit. Sequential implementation also enables units can start taking in a new set of inputs. Computations
carrying out the computations as the data arrives. on the new set of inputs can be carried out simultaneously
Nonlinear activation functions can be very difficult to
implement efficiently. Usual methods of implementation
are look-up tables (LUT) and piece-wise linear
approximations. This framework supports LUT
implementations.
Weights can be stored in external memory or internal
FPGA memory. This framework supports storing weights
in FPGA’s internal distributed or block memory. Weight
storage is implemented as a ROM.

B. FPGA Implementation of MLP


Implementing ANN starts with implementation of a
neuron. Hardware implementation of a neuron, shown in
Figure 3, is split into two parts. First part is the basic
functional unit which implements computation of inner
product and second part is the activation function look-up
table (LUT). Basic functional unit consists of a ROM
table which contains neuron’s synaptic weights and a
multiply accumulate units which computes the inner
product. The computation is carried out sequentially and
requires data to be input one element at a time. The weight
ROM is addressed by a counter which counts number of Figure 4. Stucture of one layer in MLP

276
with shifting of old results through the activation function
LUT. The neuron and layer architectures described here is TABLE I. FALL DETECTOR NEURAL NETWORK
similar to one in [15].
Layer Number of nodes Activation function
Multilayer networks with this neuron and layer
architectures can be built in two ways: with pipelined input 266
layers and with sequential layers. With pipelined layers 1 st
176 linear
the network is built simply by cascading the basic layer nd
structure as many times as there are layers in the network, 2 88 tan-sigmoid
resulting in a pipelined MLP. With sequential layers, rd
3 /output 2 log-sigmoid
computation results are routed back as a new input.
Sequential layers require more complex operation control,
but are more compact in terms of FPGA resources. The target FPGA for the neural network is Xilinx
Described hardware architectures are implemented Virtex-5 XC5VSX50T, speed grade -1. This device
using VHDL. The basic functional units and activation contains 32640 LUT and flip-flop pairs, and 288 DSP48E
function LUT are implemented as distinct entities and blocks [19]. LUTs and flip-flops are basic building blocks
configured using generics. Relevant parameters for for general logic. The DSP48E block is a specialized
configuring basic functional unit are: number of inputs; arithmetic block containing 25×18-bit multiplier, 48-bit
input, output and weight precisions; bias; and weights. accumulator and a certain number of registers for
Parameters for configuring activation function LUT are pipelining [20].
input and output precisions, and table data. There are two Arithmetic precisions for synthesis are: 16 bits for
MLP generators: one for pipelined and one for sequential input, 14 bits for weights, and 16 bits for output. With
layers. Both MLP generators are configured through a these parameters the multiply-accumulate unit fits into a
VHDL package. They use basic functional units and single DSP48E block. Weight ROMs are implemented
activation function LUT as components, and implement a using LUTs. Implementation of the activation function
Finite State Machine (FSM) for network operation LUT for the first layer was not required, since it uses
control. The MLP structures are built by utilizing the linear activation function.
“generate” statements. Individual components are
configured by passing the relevant data from package to The design has been successfully synthesized and
components. The configuration package file is unique for placed and routed with 85 MHz operating frequency. A
every network, and independent from multilayer summary of device utilization is given in Table II. It
implementation type. It contains all the information consumes 176 DSP48E blocks, 2825 flip-flops, and 20197
necessary to fully define an MLP: number of inputs and LUTs. It is clear that, on the targeted device, this network
layers; number of neurons, data precision, and activation cannot be implemented with pipelined layers. However,
functions data for each layer; and weights for each neuron. remaining free resources are more than sufficient for
activation function LUTs for 2nd and 3rd layers, and
operation control FSM for implementation with sequential
IV. CASE STUDY
layers.
Use of the proposed framework is demonstrated
through implementation of an embedded system for Full evaluation of network with sequential layers
assistive independent living for elderly people [16]. A requires one clock cycle per input in a layer, one cycle per
major issue among the elderly is falling accidents. There layer for neuron bias insertion, and three clock cycles per
are several approaches to addressing this problem, ranging layer for pipeline latency. In this network there are 266,
from video monitoring to wearable personal alarm devices 176 and 88 inputs in 1st, 2nd and 3rd layer respectively.
[17]. One of new promising approaches is using With 3 bias insertions and 9 extra clock cycles for pipeline
asynchronous temporal contrast (ATC) sensors [18]. ATC latency, this gives a total of 542 clock cycles for one
sensors sense dynamic scenes. They quantize local evaluation. At 85 MHz clock rate this is 6.4 μs for one
relative intensity changes in a scene and generate spike evaluation which is well under the allowed maximum of
events which appear on the output of the sensor as a 10 ms dictated by input data rate, meaning that this timing
stream of digital pixel addresses. The fall detector takes constraint has been met.
this event stream as its input. It extracts features which are In evaluation of this network a total of 62746 multiply-
related to fall, and classifies them as a “fall” or “not fall”. accumulate operations are executed with a maximum of
The classifier in this detector can be a neural network. 176 operations in one clock cycle. A completely
The classifying neural network is a multilayer
perceptron with two hidden and one output layers.
Structure of the network is given in Table I. The network TABLE II. FPGA DEVICE UTILIZATION
receives a new set of features every 10 ms. Since the fall
detector is a real-time system, the hardware Resource Available Used Utilization
implementation of its neural network will have to meet DSP48E 288 176 61%
this timing constraint. To assess implementation
possibilities, the first layer of this neural network has been Flip-flop 32640 2825 9%
synthesized, and placed and routed. LUT 32640 20197 62%

277
sequential implementation with only one multiply- REFERENCES
accumulate unit would need up to 64000 clock cycles to [1] T. M. Mitchell, Machine Learning, The McGraw-Hill Companies,
complete one network evaluation. Assuming 85 MHz Inc., 1997.
clock rate, this would take up to 750 μs which is still well [2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford
under allowed maximum of 10 ms. University, 1995.
[3] C. M. Bishop, Pattern Recognition and Machine Learning,
Implementation of this first layer required little effort Springer, 2006.
form the user’s point of view of. The low-level hardware [4] C. M. Bishop, “Neural networks and their applications,” Review
details are handled by the generator written in VHDL, and of Scientific Instruments, vol. 65, no. 6, 1994, pp. 1803−1832.
the configuration package file is generated automatically [5] M. L. Chang, “Device Architecture,” in S. Hauck and A. Dehom,
by a MATLAB script which uses MATLAB’s neuron Eds. Reconfigrable Computing, Morgan Kaufmann, 2008,
network structure [21] as its input. However, at this stage pp. 3−27.
of development, the framework doesn’t provide automatic [6] A. R. Omondi, J. C. Rajapakse and M, Bajger, “FPGA
generation of contents of activation function LUTs. As neurocomputers,” in A. R. Omondi and J. C. Rajapakse, Eds.
FPGA Implementations of Neural Networks, Springer, 2006,
consequence, the activation function LUTs must be pp. 1−36.
provided by the user and their generation may require
[7] S. Kilts, Advanced FPGA design: Architecture, Implementation,
significant effort from the user. and Optimization, Wiley-IEEE Press, 2007.
[8] B. Holland, M. Vacas, V. Aggarwal, R. DeVille, I. Troxel and A.
V. CONCLUSION D. George, "Survey of C-based application mapping tools for
reconfigurable computing", Proc. 8th Annual Int. Conf. on
In this paper a framework for implementing MLP Military and Aerospace Programmable Logic Devices (MAPLD
ANNs on FPGAs has been introduced. The major issues 2005), September 2005.
in implementation of ANNs on FPGAs have been [9] A. Rosado-Muñoz, E. Soria-Olivas, L. Gomez-Chova, J. V.
reviewed and the neuron and layer structures for ANN Francés, “An IP core and GUI for implementing multilayer
implementation have been defined. A single neuron is perceptron with a fuzzy activation function on configurable logic
devices,” Journal of Universal Computer Science, vol. 14, no. 10,
implemented using one multiply-accumulate unit and 2008, pp. 1678−1694.
weight ROM. Activation function is implemented using [10] D. Ferrer, R. Gonzalez, R. Fleitas, J. Pérez Acle, and R. Canetti,
LUT, and is shared by all neurons in a single layer. Two “NeuroFPGA – Implementig artificial neural networks on
types of multilayer implementations have been defined: programmable logic devices,” in Proc. Design, Automation and
pipelined and sequential. Test in Europe Conference and Exhibition, 2004.
[11] S. Marsland, Machine Learning: an Algorithimc Perspective,
In a case study use of framework has been evaluated Chapman & Hall/CRC, 2009, pp. 47−91.
on an implementation of ANN developed for a fall [12] J. Zhu and P. Sutton, “FPGA implementations of neural networks
detection system. Based on FPGA device utilization it was – a survey of a decade of progress,” in P. Y. K. Cheung, G. A.
concluded that sequential type of ANN implementation is Constantinides, and J. T. de Sousa, Eds. Field-Programmable
more compact in terms of FPGA resources, and is likely to Logic and Applications, Lecture Notes in Computer Science,
vol. 2778, Springer, 2003, pp. 1062−1066.
be used more often than pipelined. The framework
[13] K. D. Undewood and K. Scott Hemmert, “The implications of
successfully relieves the user from dealing with low-level floating point for FPGAs,” in S. Hauck and A. Dehom, Eds.
details of hardware implementation and in this way Reconfigrable Computing, Morgan Kaufmann, 2008,
facilitates deployment of ANNs in hardware. pp. 671−695.
[14] S. Dragichi, “On the capabilities of neural networks using limited
In future work an automatic generation of activation precision weights,” Neural Networks, vol. 15, 2002, pp. 395−414.
function LUT will be added to the framework. [15] A. Canas, E. M. Ortigosa, E. Ros and P. M. Ortigosa, “FPGA
Additionally, a fully sequential architecture option will be implementation of a fully and partially connected MLP”, in A. R.
developed to enable use of the framework on smaller Omondi and J. C. Rajapakse, Eds. FPGA Implementations of
FPGA devices. Neural Networks, Springer, 2006, pp. 271−296.
[16] M. A. Estudillo-Valderrama, L.M. Roa, J. Reina-Tosina, I.
Roman-Martinez, "Ambient Assisted Living: A methodological
ACKNOWLEDGMENT approach", Proc. of Engineering in Medicine and Biology Society
This work is supported by the bi-lateral scientific (EMBC), 2010, pp. 2155−2158.
research project “Reconfigurable embedded systems based [17] N. Noury, A. Fleury, P. Rumeau, A. K. Bourke, G. O. Laighin, V.
Rialle and J. E. Lundy, “Fall detection – principles and methods,”
assistive applications for elderly people”, founded by the Proc. EMBS, 2007, pp. 1663−1666.
Hungarian Ministry of Education and the Ministry of [18] Á. Srp and F. Vajda, “Possible techniques and issues in fall
Science, Education and Sports of the Republic of Croatia. detection using asynchronous temporal-contrast sensors,”
The project is in the framework of the joint Croatian- Elektrotechnik und Informationstechnik, vol. 127, no. 7−8, 2010,
Hungarian cooperation in science and technology between pp. 223−229.
Budapest University of Technology (BME) and Ruđer [19] Virtex-5 Family Overview, Xilinx Inc., Data sheet DS100, 2009.
Bošković Institute (RBI). [20] Virtex-5 FPGA XtremeDSP Design Considerations, Xilinx inc.,
User guide UG193, 2010.
[21] M. Hudson Beale, M. T. Hagan and H. B. Demuth, Neural
Network Toolbox™ 7 User’s Guide, pp. 2-16 − 2-28, The
MathWorks Inc., 2010. Avaliable:
http://www.mathworks.com/help/toolbox/nnet/nnet_product_page.
html

278

You might also like