Image Convolution On FPGAs The Implementation of A multi-FPGA FIFO Structure
Image Convolution On FPGAs The Implementation of A multi-FPGA FIFO Structure
Abstract
In this paper, we present an implementation of a
real-time convolver, based on Field Programmable Gate
Arrays (FPGA's) to perform the convolution operations.
Main characteristics of the proposed approach are the
usage of external memory to implement a FIFO buffer
where incoming pixels are stored and the partitioning of
the convolution matrix among several FPGA's, in order to
allow data-parallel computation and to increase the size of Figure 1. Convolution of a raster-scan image
the convolution kernel. (K=R=3, AkW5)
123
1089-6503/98$10.00 0 1998 IEEE
I.............
SHIFT, -........ .. . .
..................................................................................................... ;
...e d ...... m .. . .. ..... .... . .
...........
a.... a .* * * * * * * * * * *
products.
This approach - quite obvious in dedicated devices - is
not well suited to an FPGA implementation. In fact,
commercially available FPGA devices contain a limited
number of memory elements, since they are constituted
by few hundreds of Configurable Logic Blocks (CLB’s),
each provided with one or two flip-flops and a
configurable logic network. Implementing a memory
structure on FPGA’s means using a lot of devices, thus
wasting their (re)configuration properties (e.g., with (4 (b)
rather large FPGA devices like the Xilinx XC401OE, a = pixels already grabbed $$$/A
3x3 convolution matrix applied to a 512x512 image using = convolution matrix
8 bits per pixels needs (512~2+3)~8=1027~8=8216 flip- 0 = central pixel
flops, requiring 11 FPGAs).
In fact, previous works (as [5] and [6]) have presented Figure 3. Subsequent steps in the convolution of a
methods for synthesizing FIFO memories using the raster-scan image (K=R=3, A.k1\1=5)
flip-flops present inside the CLB’s of the Xilinx XC4000
series devices: unfortunately, the number of available
2. The FIFO memory interfaced with a single
CLB’s permits the implementation of FIFO memories far
too small for our needs. For this reason, the solution we FPGA
propose implements the shift registers of Fig. 2 using also
external RAM memory interfaced with the convolution To better understand the requirements of the external
FPGA’s. In particular, flip-flops storing pixel values FIFO memory, it is worth to refer to Fig. 3. I
belonging to the convolution window are stored into In order to minimize the number of external memory
internal CLB’s, while FIFO blocks storing pixels accesses necessary to obtain the required FIFO behavior,
necessary for subsequent convolution operations are we propose an apporach based on the following
stored into the external RAM memory. assumptions:
The implementation of such a FIFO and its 0 each new pixel entering the convolution system (D23
possibilities in terms of speed and kernel size are the main in Fig. 3-b) is routed both to the FPGA convolver and
topics of this paper. to the external FIFO memory, where it is stored in a
In Sec. 2, we discuss the requirements of the FIFO single write cycle;
memory and the solution that we have implemented for 0 the K-1 pixels belonging to the convolution “column”
interfacing it with a single FPGA. of the new pixel (Do3 and D13 in Figure 3-b) must be
In Sec. 3, we extend FIFO capabilities to allow parallel read in K-1 read cycles in order to be routed to the
processing by several FPGAs, thus relaxing constraints on FPGA convolver;
the convolution matrix dimension. 0 the remaining (R-1)xKpixels (Dol, D02, DII, D12, D21
In Sec. 4, an experimental prototype written in VHDL and D2* in Figure 3-b) necessary to perform the
language and implemented in a G800 GigaOps Spectrum convolution are stored inside the FPGA.
board containing 8 Xilinx XC40 1OE devices is discussed. Summarizing, for each new pixel coming from the video
Finally, possible developments are outlined in Sec. 5. signal source via the frame grabber, 1 memory write and
K-1 memory reads must be performed to correctly feed
the FPGA convolver.
124
Extemal RAM
0000 I I
0001 I I
0002 I I
~ ..............
0003 I I
.......................................................................
SHIFT1 SHIFT? SHIFT3
Extemal RAM
0003
0004
i.......................................................................... L ................ i : i
Extemal RAM
........................................................................
++$-pq-l
:..................................................................... i i .................................................................. i .......................................................................
i i
125
This has been accomplished in the FPGA system by as summarized in Fig. 4, where a 3x3 convolution
implementing the FIFO memory in a circular buffer window and a 5x5 input image are assumed.
stored in RAM, based on the following elements: Before starting to fill the FIFO, R-1 pixels have to be
The inpub‘output clock signal, here called video shifted into the SHIFT1 register (Fig. 4.(a)). The next N-
clock, which synchronizes frame grabber and video R+l pixels are stored in the FIFO denoted FIFO1 in Fig.
encoder operations. During each cycle of clock, a 2, filling it until counter, reaches location 0002 (Fig.
new pixel enters the convolver and a new complete 4.(b)). At this time counterrdll, is enabled and starts to
convolution takes place. address data from FIFO1 and to shift them into SHIFT2.
A convolution clock, whose frequency fo,, is K After N additional pixels (i.e., a full image row has been
times higher than the frequency fplay of the play read) counter,.d(o) is enabled in turn and begins to store
clock. During each cycle of the convolution clock, a data in SHIFT3 The situation is shown in Fig. 4.(c). After
memory access (either read or write) to the external N-R+l additional pixels, the first convolution window is
FIFO is performed. available and the FPGA starts the computation using
A write pointer counter,,., stored inside the FPGA counter,.d(o)and counter,d(l) to fetch DO2and DI2from the
convolver is used to address the first free location FIFO (Fig. 4.(d)).
inside the external FIFO, where the incoming pixel It should be noticed that this approach poses the
has to be stored. following limits on the dimension of the convolution
K- 1 read pointers counter,do), also implemented matrix:
inside the FPGA are used to address the locations of 1) a first limit derives from the maximum frequency
the pixels to be read from the external FIFO to allowed for the convolution clock, used to access the
complete the computation of the convolution external RAM memory to fetch the next “column” of
window. pixels. This limit affects then the maximum number
Using the above elements, the external FIFO, requiring a of rows (thus the value of K ) in the convolution
memory able of accommodating at least matrix: referring to the 8.33 MHz of the video clock
(K-2)xN+(N-R+ 1)+1 =(K-l)xN-R+2 pixels, is obtained (adopted by the european PAL video standard) and
I Input I
FPGA2 0
0
0 0
4- ..............................................................................................................................
0 0
:..................................................................................................................................................
........................................ i:
FPGA,.I f p
i
;
E
2.
f 3s
FPGA,.I f:............................................................
FPGAn
Figure 5. Multi-FPGA architectural schema
126
using 20 nsec access time RAM (i.e.,50 MHz can be obtained as follows.
maximum Lon”), this results in a convolution matrix Referring to Fig. 5, we see that the original K rows
of at most 6 rows; must be partiotioned into p stripes of R columns: one
a second limit derives from the capacity of the FPGA stripe of k2 rows and p - 1 stripes of kI rows. For previous
convolver, that, even in case of particularly favorable considerations, kz=kl+2. Obviously, to correctly partition
convolution matrices (containing only Os, Is and the original K rows, klx(n-l)+k2=K must hold. From this
power-of-2 coefficients, thus requiring only logical equation we can show that
operations and shifts - e.g. Sobel, Prewitt and Laplace K-2 K-2
matrices [2] - which can be implemented using few kl=, , kz=-+2. (2)
CLBs), has to store inside the internal flip-flops the Delay units in Fig. 5 allow each FPGA to obtain its
KxR pixels to be handled. This limit affects then the own stripe of data.
overall dimension of the convolution matrix. Besides the advantages already discussed, this
It is easier shown that the most critical parameter in terms approach allows to optimally map the convolution matrix
of capability of handling large convolution kernels is the into n FPGAs provided that:
number K of rows; moreover, it can be noticed that, for a (K-2) mod n = 0
given “area” of the convolution matrix, it is more without any other constraint upon R.
convenient to minimize K and maximize R in order to To estimate the effectiveness of such approach, let us
relax clock constraints. revise the practical example made at the end of previous
On the basis of the above considerations, we have section. For a 20 nsec access time RAM to be used in a
devised a multi-FPGA structure that is presented in the system working with the european PAL video standard,
next section. we have:
&lay = 8.33 MHz
3. Multi-FPGA solution 0 maximumfconv= 50 MHz
When several FPGA devices are available, the most
0 maximum number k of accesses to external RAM in a
convenient way of partitioning computation among them
is represented in Fig. 5: n FPGAs are used to process in
parallel n different portions of the convolution window,
video clock cycle:
l&.1=6
maximum K dimension of the convolution window:
and an additional FPGA (the 0-th one in Figure 5) merges (n- l)x(6-2)+6=4n+2
intermediate results performing (usually) simple sum For a system with n=8 FPGAs, using only square kernels,
operations. This would mean to subdivide the KxR this means a 34x34 upper limit to the kernel dimension.
convolution matrix into n sub-matrices of size kxr, where If we remove the choice of square matrices, it is
the following relations hold: possible to furtherly increase R up to the space limit given
by the number of available CLB’s.
A great deal of work exist about partitioning FPGA
4. Prototypal implementation
designs (e.g., [7], [8] and [9]), however, given the
considerations expressed in previous section, the best In this section, we present a VHDL [IO]
choice for the convolution matrix subdivision in our case implementation of the multi-FPGA structure, discussed in
is given by previous section, on a multi-FPGA board designed for
p=n , q=l , rapid prototyping [ 1I].
which means a set of horizontal “stripes”. To this purpose, we first describe the main
It must be noticed however that the second level FIFO characteristics of the prototyping board, with particular
blocks shown in Fig. 5 have to be included to ensure that reference to the ones limiting the degrees of freedom in
the results of each single convolution stripe reach FPG& mapping computation and data paths onto the available
aligned in time. These FIFO’s introduce the need of two FPGA’s. Then, we show our solution and we evaluate the
additional accesses to external RAM: one to write the results.
input and one to read the output. In other words, all the
FPGA’s but the last one must use two cycles of the 4.1 Prototypal board
convolution clock to access the second level FIFO’s,
which implies two less accesses to their first-level FIFO, The prototyping board we used is the GigaOps G800
which contains the pixel values to be multiplied. Spectrum board [ 121 , schematically shown in Fig. 6.
Since the number of allowed accesses to fn-st-level In Fig. 6 it is possible to notice the main blocks of this
FIFO’s is the upper limit to the number of rows in each system.
stripe, and since the last FPGA does not need a second 0 The actual computation is performed by pairs of
level FIFO, the best subdivision of the convolution matrix Xilinx XC4010E FPGA’s, connected in modules
127
Figure 6. Block diagram of the GigaOps G800 board
called XMODs: in Fig. 6, four modules (MODO thru both cases, the VMC FPGA outputs the results of its
MOD3) are shown. These FPGAs are named YPGA processing upon data coming from the XBUS or YBUS.
and XPGA (from the name of the bus they are
connected with). Both these FPGAs have two 4.2 The convolver prototype
memory ports: one connected only to a 2 MBytes
DRAM and one connected both to a 2 MBytes A first characteristic that limits the degrees of freedom
DRAM and to a 128 KBytes SRAM device. XPGA of this system is the kind of access to inpudoutput data. In
and YPGA communicate through a bus switch on the fact, XPGA’s are not connected to the YBUS, therefore
first memory port. This switch works on two virtual they receive data to be processed only through the
busses: a 16-bit data bus and a 10-bit address bus. It YPGA’s. This affects the effectiveness of the
is important to stress that only YPGA’s are connected implementation: in particular, it makes more difficult to
to YBUS, i.e. to the inpudoutput data bus. handle a multi-FPGA architecture that uses also XPGAs
0 A module called SCVIDMOD (S-VIDEO, than an architecture using only YPGAs.
COMPOSITE, VIDEO MODULE), that Another limit is constituted by the busses that
decodesiencodes video signals (PAL or NTSC). This interconnect XPGA ’s with YPGA ’s. These busses are also
module interfaces to YBUS for data input and output. used to access to the 2 MBytes DRAM’S. It is then
0 An input FPGA (here called VLPGA) connected to extremely difficult to manage and to synchronize both
the VESA local bus of the PC hosting the board. The connections withXPGA and with DRAM.
VLPGA is interfaced with the HBUS and the YBUS. It It is easy to note that the above restrictions make very
contains all the registers needed for correct board hard and not effective to use both XPGAs and YPGAs to
operation (e.g., the CLKMODE register, that sets map computations.
frequencies of the clocks distributed on the board). On the contrary, XPGAs become useful to implement
An output FPGA (here called VMc) connected to the second level FIFOs of Figure 5: this approach allows,
SCVIDMOD. This is an additional FPGA, directly in fact, to relax the constraints on the maximum number
interfaced with the video output and the B U S . of accesses defined in the previous section. Thus, we can
0 Three main busses that allow connections among the perform on each YPGA s a convolution with a kxr kernel.
various blocks of the board. These busses are: Referring to the above considerations we decided to
P YBUS, a 32-bit IiO bus connected with VLPGA, implement our prototype using only YPGA s to perform
VMC and the YPGA ’s of the XMOD s . convolution. In particular, since our own device has four
P HBUS, a 16-bit bus used to configure and to load XMOD’s, our multi-FPGA convolver uses the 4 YPGAs to
the FPGA’s. map computation and the 4 XPGA ’s to simply implement
P XBUS, a 64-bit bus normally used as four 16-bit second level FIFO’s.
data busses. Each of these busses is connected Some additional considerations are needed to
only to the XPGAs of the and to VMC. completely understand our prototype. Since every
The main data path of our application is the following: memory access requires first to present the memory
pixels generated by the video decoder are passed through address and to assert control signals (like memory read
the YBUS both to VLPGA and to the YPGA’s of the and memory write) then to deactivate these control signals
XMOD s . These modules process data and pass the results before the next memory access, it is necessary to identify
either to the VMC FPGA through YBUS or to the XPGA ’S two timing events for starting and ending each memory
through the bus switches. In the latter case, the X P G A s access.
can perform a further computation or simply pass the
results to the VMC FPGA through the 64-bit XBUS. In
128
o o ~ ; ~ Rl A r i Extemal RAM2
SHIFT1
................
i........................................................................
SHIFT2
i
SHIFT3
"FI
000 I
0004
0002
0003
0004
Extemal RAM2
o o o ~Dol
R~ A4--:countei;,,.
l~ I ................ r \ O O O O
0001 000 1
0002
Do3
0003 0003
0004
> > 0005
L ........................................................................ i i ........................................................................
SHIFT] SHIFT2 SHIFT3 0006 0006
0007 0007
Ooo8 0008
(b) after N-R+ 1=3 more pixels
Extemal RAMI Extemal RAM?
Diu1 0000
Do2 000 1
+! ."counter,,, L,
D14 D12
OO021"'.+
..J.. ........
.................L 0002
0003 Diz 0003
0004
> > > > 0005
.........................................................................
i i ........................................................................
i i .......................................................................
i i
0006
SHIFT1 SHIFT2 SHIFT3
0007 0007
0008 0008
...............c
after N=5 more pixels
Extemal RAM I Extemal RAM2
counter,,~,o)
0000
000 1 000 1
0002
D2n Diz DII Din Dn2 Dni
0003
0004
> ? i
-
> 1
>
: 0005
....................................................................
i i ............................................
: ........... ; ...........................................
: " .............. i
Figure 7. Behavior of the external FIFO implemented on dual-port memory (K=R=3, MM5)
129
Due to board and memory characteristics, this poses an Larger convolution matrices can be considered only
additional limitation: the maximum-frequency clock using new versions of the G800 board, featuring faster
available (33.33 MHz: four times the PAL video standard DRAMS.
fplay) generates an event (either a rising edge or a falling A more significant evolution of the present architecture
edge) every 15 nsecs. This does not match the 30 nsec is the implementation of a series of convolutions. In fact,
access time ED0 DRAM access time (present on recent the experiences on the 3x3 matrices showed that both
versions of the G800 board), nor the 60 nsec access time YPGA and XPGA are underutilized (25% to 30% of CLBs
DRAM (present on our prototype board), thus implying to not used).
reduce the performance by using only some edges of the It seems then feasible to use both YPGA and XPGA for
clock waveform. computation and FIFO management, thus allowing to
To avoid this problem, strictly related to the type of perform, on the same image, two subsequent
board and memory chips available, we exploited the convolutions. This can be useh1 to have an initial image
second RAM port present on the X210 modules, to filtering (to enhance signal-to-noise ratio) or to perform
perform two memory accesses at a time. Clearly, we must edge detection and computation of the brightness gradient
properly alternate memory writes and reads on the two (as required, for instance, in the Hough transform).
ports. In Fig. 7 an example is reported.
With this approach, the FPGA’s make two memory References
accesses per cycle of video clock to each of the two RAM
banks available, allowing to reach an upper limit of a [l] D. Buell (editor). Splash 2 : “FPGA’s in a Custom
16x16 convolution matrix with 30 nsec access time E D 0 Computing Machine ”. IEEE Computer Society Press,
DRAM and an upper limit of an 8x8 kernel with a 60 nsec 1996
access time DRAM. In fact, in this case, FPGA’s can [2] Virginio Cantoni, Stefan0 Levialdi, “La visione delle
perform only one memory access per cycle of play clock macchine - Tecniche e strumenti per l’elaborazione di
to each of the two banks. immagini e il riconoscimento di forme”, Tecniche Nuove,
Milano, 1990
5. Conclusion and future work [3] Vito Cappellini, “Elaborazione numerica delle
immagini ”,Boringhieri, Torino, 1985
In this paper, we have presented the architecture of a [4] The Programmable Logic Data Book. Xilinx, San
real-time convolver implemented by using FPGA’s. In Jose, CA, 1996
particular we have focused on the constraints that limit [5] Peter Alfke, “Synchronous and Asynchronous FIFO
the maximum matrix dimension, which is directly related Designs”. Xilinx Application Note XAPP 05 1, September
to the effectiveness of convolver. We have described a 1996, Version 2.0
possible solution to relax these constraints, based on data [6] Jazi Eko Istiyanto, “The Formation of Super-cliques
parallelism. in the Behavioural Synthesis of FIFO RAMS”, Tech
In our prototype, we started testing the multi-FPGA Report, Gadjah Mada University
architecture with several 3x3 convolution matrices, e.g., [7] P.Athanas and L. Abbott. “Addressing the
Sobel, Prewitt, Kirsch, Laplace and Cantoni operators [2] Computational Requirements of Image Processing with a
[3]. The limited dimension of these convolution matrices Custom Computing Machine: An Overview”. In
allowed us to perform convolution row-by-row on three Proceedings of the 2”d Workshop on Reconjgurable
different YPGA ’s, without requiring first-level FIFO Architecture, Santa Barbara, CA, April 1995
memories, thus avoiding the two external memory [SI Stephen L. Wasson, “FPGA Design: Early
accesses discussed in the previous section. Implications In Partitioning ”, Integrated System Design
Second level FIFO’s and final calculations are magazine, February 1997
performed by the XPGA’s and the VMC, respectively. [9] S. Hauck, Multi-FPGA Systems PhD thesis,
This choice led to meet the real-time constraints also in Department of Computer Science, University of
OUT G800 prototypal board: in fact, even with a 60 nsec Washmgton, Sep 1995
access time DRAM, the dual port used by the XPGAs [lo] IEEE Standard VHDL Language Reference Manual.
allows to implement second-level FIFOs. IEEE Standards
The next step will be the implementation of an 8x8 [ l l ] FPGA Compiler User Guide v3.5. Synopsys,
convolution matrix, spreading over 4 YPGAs the 8 Mountain View (USA), Sep 1996
convolution rows and using again the XPGAs for second [ 121 Giga Operations Spectrum Documentation. Giga
level FIFO’s. The fact that each YPGA has to process two Operations Corporation, Berkley, CA, 1995
convolution rows implies the implementation of first level
FIFO’s, requiring two external memory accesses per
video clock as in the case of the XPGAs (thus still viable
on our G800 board).
130