GCD-FPGA-Based-Design Using HDL
GCD-FPGA-Based-Design Using HDL
Contents
Bibliography 26
List of Figures
Fig.1 Prime Factorization method for finding the GCD of two integers 2
Fig.2 Euclidean Algorithm 3
Fig.3 Simplified Euclidean GCD Algorithm 3
Fig.4 XC6SLX25 Floor-plan View in PlanAhead 5
Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead 5
Fig.6 The Design Strategy Window from Xilinx Project Navigator 8
Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm 9
Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP 10
Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model 11
Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model 11
Fig.11 From ASM GCD to Finite State Diagram 13
Fig.12 The Reduced Finite State Diagram with VHDL Code 13
Fig.13 The Behavioural Simulation of ASM2FSM Model 14
Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model 14
Fig.15 Block Diagram of the “Original” GCD Data-Path 16
Fig.16 Block Diagram of the Modified GCD Data-Path 16
Fig.17 The Control Unit (FSM) with VHDL Code 17
Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3) 17
Fig.18.b Primitives: CARRY4 Fast Carry-Chain 18
Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model 18
Fig.20 GCD with Sum of Absolute Difference (GCDSAD) 20
Fig.21 Carry-out Generation Functions for SAD 20
Fig.22 Results in a Chart (FOR_LOOP dominated) 23
Fig.23 The Area-Delay Product 23
List of Tables
The main idea of this project is to design a Digital Circuit that calculates the GCD of two 16-
bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx Spartan6
FPGA using different techniques/architectures. The first attempt was to see how far the
compiler goes with the behavioural loop that represents Euclidean Algorithm. Because the
tools kept copying the hardware inside the loop all the time, a massive area of the FPGA was
occupied and the number of iterations was limited. Thus, an RTL behavioural architecture
was implemented, in which only one iteration can run per each clock cycle. The compiler still
have the freedom for placement and routing with the aid of “Design and Goals Strategies”.
Then, the design was built structurally, by port-mapping all functions of the previous design
as components, to see how the compiler is going to utilize the FPGA differently from the
behavioural one. The structural model consists of two parts: GCD data-path unit and GCD
control unit (FSM). Another version of the structural design was created as an attempt to
adapt the idea of the “Sum of Absolute Difference (SAD)” in order to have only one
subtraction instead of two. Finally, Spartan-6 Primitives and Macros were utilized to reduce
the Area-Delay product of the design, and the optimized GCD with two subtractors has been
proved to give the minimum Area-Delay product among all other design architectures.
Introduction
Implementing mathematical calculations on hardware platforms such as FPGA is quite
more challenging than performing them in a software environment, where the hardware itself
is already equipped with the calculation data-path and control unit for almost infinite number
of algorithms and arithmetic operations. Behind this pain of the hardware implementation is a
priceless gain in terms of performance as there is a great opportunity to utilize smaller area,
obtain higher speed, consume less power, or get a reasonable combination of all of these.
Calculating The greatest common divisor (GCD), is one of the problems that need number of
steps in order to be solved correctly. These steps can be transformed into an iterative
algorithm such as Euclidean algorithm, which makes the computation understandable and
traceable. This section is divided into three parts; a brief mathematical background about the
GCD computation, an overview of Xilinx Spartan-6 FPGA, and an outline of the project
The greatest common divisor (GCD) of two positive integers is the largest integer that
divides both numbers without a remainder [2]. It is also know as Greatest Common Factors
(GCF), Greatest Common Measure (GCM), Highest Common Divisor (HCD), or Highest
Common Factor (HCF) [1]. GCD can be computed by determining the prime factors of both
numbers, then multiplying the common prime factors. Practically, this method is not feasible
for great numbers. (Fig. 1) shows an example of how prime factorization method works.
Fig.1 Prime Factorization method for finding the GCD of two integers
An efficient method for solving GCD problems is Euclidean algorithm, which is based
on the fact that the GCD of two numbers divides the remainder of the division between them:
gcd(a,b) = gcd(b,r)
where, a = qb + r
It is an iterative process (Fig. 2), that
gcd(a,b) = gcd(b,r1 )
gcd(b,r1 ) = gcd(r1 ,r2 )
! Fig.2 Euclidean Algorithm
As division is simply a subtraction, it was observed that the GCD of two numbers also
divides their difference [1], in which the design and implementation of the circuit gets easier.
From the previous part, the chosen FPGA should have some properties to accommodate
the design units efficiently. For example, subtraction/addition may take advantage of some
dedicated components in the FPGA slices such as ripple carry-chain or DSP. Spartan-6 FPGA
family from Xilinx provides the designers with such components, which would help a great
deal in designing GCD circuit in different levels. “The thirteen-member family delivers
expanded densities ranging from 3,840 to 147,443 logic cells, with half the power consumption
of previous Spartan families, and faster, more comprehensive connectivity,” [4]. (TABLE 1)
shows a feature summary of some devices from this family; the smallest (XC6SLX4), the
largest (XC6SLX150T), and the choice of this project (XC6SLX25), which was the smallest
member of the family to accommodate the first design, i.e., the reference one. More about this
CLB
Logic
DSP RAM User
Device
Cells Slices FFs RAM (kb) LUT6 Slices Blocks I/O
In the next two pages, the Floor-plan views in PlanAhead for the device XC6SLX25 help
to illustrate the internal construction of the chosen device. (Fig. 4) is a full-scale floor-plan
view that shows the device layout indicating some important elements such as IOB cells,
Memory
Controller
Block
Block RAM
Column
Clock
Management
Tile Column
DSP Column
CLB Cell
IOB Cells
In (Fig. 5), a closer view of the layout reveals the three different slices inside the
Configurable Logic Block (CLB) surrounding a DSP block. It is clear from the figure that Each
CLB contains two slices, one of them is SLICEX and the other one is either SLICEL or SLICEM.
LUT6
Storage
LUTs DSP
Flip- Carry-
Flops Chain
6-Input LUTs √ √ √
8 Flip-flops √ √ √
Wide Multiplexers √ √
Carry Logic √ √
Distributed RAM √
Shift Registers √
The objective of this project is to design a Digital Circuit that calculates the GCD of two
16-bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx
Spartan6 FPGA. In this project, the Euclidean GCD circuit was implemented using different
architectures in order to examine the tradeoff between area and speed, i.e., Area-Delay
product, and decide which design is more implementable in terms of dedicated configuration
components inside the FPGA. The first step was to implement a simple behavioural loop,
i. e., a direct interpretation of the Euclidean GCD Algorithm, using FOR_LOOP to see how
the compiler would represent a large number of iterations. Considering this design reference,
the next step was to implement the Euclidean GCD circuit in the following levels:
A. RTL Behavioural level, where the design is simply a Finite State Machine (FSM) that
performs the GCD calculation sequentially as a lower level of interpreting the Algorithm.
In this case, the compiler was free to translate the operations into different units/
components and place all these components and rout all the connections with the aid of
B. Structural level, where the data flow of the GCD Algorithm is transformed into an
arithmetic circuit, i.e., data-path unit (DP), and the iteration process is attained by a simple
control unit (CU), FSM basically. The design was built abstractly transforming all
functions, such as comparison, subtraction, and data transfer, to components and port-
mapping them in a top level entity, to see how the compiler would utilize the FPGA
Difference,” (SAD) circuit has been introduced in order to replace the two subtraction units
with one computation unit. Finally, some functions were designed utilizing Primitives,
Macros, e.g., ADDSUB macro, with which the calculation unit has been optimized in terms
Description Done
Design the Simple Behavioural Loop and examine its aspects and limitations 100
Design the Behavioural FSM Model and test its features and margins 100
Design the direct Structural Model (DP+CU) and compare it with the behavioural 100
Design the Optimized Structural Model (SAD) and compare it with the direct structural 100
Get into Primitive Level and utilize the dedicated elements for faster computation 100
Report the Area/Delay Comparison between all the architectures and propose suggestions 100
In the next section all of the above stages will be presented and discussed in sequence. It
is helpful, by this point, to mention that the target is to obtain minimum Area-Delay product,
which could be achieved by reducing the area and/or the time delay of the circuit. By
determining optimization gaol to be area (Fig. 6), smaller area of the FPGA will be utilized in
order to reduce Area*Delay. At the same time, it might also lead to higher speed, assuming
that the smaller area is obtained, the fewer jumps through interconnections is needed.
in order to see how the compiler understands loops and how it deals with a large number of
iterations. Then, the results of this design, as a reference, kept pushing towards trying
different architectures in order to obtain smaller area and less jump through the
interconnection hoops.
Starting with the direct WHILE_LOOP, that represents the Euclidean GCD Algorithm
(Fig. 7), the result tells a lot about how the system treats loops. It was clear that the compiler
has just copied the corresponding circuit along the way until the loop ends. It was not too
surprising that the compiler did not synthesize the (While) function, simply, because it
generates an infinite loop, which means the number of the circuit copies is infinity. In
hardware world, infinity does not usually exist, it needs to be a finite number.
While (A /= B) Then
If (A > B) Then
A := A - B;
Else
B := B - A;
End If;
End Loop;
GCD <= B;
Thus, the transition to the Finite FOR_LOOB (Fig. 8) was obvious, where the maximum
number of iterations must be defined from the beginning. In fact, determining the number of
iterations before even starting computing the GCD creates limitation to the design, with
If (A /= B) Then
If (A > B) Then
A := A - B;
Else
B := B - A;
End If;
Else
GCD <= B;
End If;
End Loop;
Before going through the design and implementation results of this model and
proceeding to the other different levels, it is essential to point out that the WHILE_LOOP
Model is a perfect transformation of the Euclidean GCD Algorithm. Therefore, all the
The number of iterations in the FOR_LOOP Model was defined as 100, which means
that for any two numbers that require more than hundred iterations to compute their GCD
(e.g., 511 and 2), the result will be zero. The behavioural and Post-Route simulation of this
design are shown in (Fig. 9) showing the delay for some input examples.
Behavioural Simulation
Post-Route Simulation
Furthermore, the complexity of the generated circuit was very high as the system has
converted the loop into a massive number of components. The system has just copied the
comparators, subtractors, and multiplexers a hundred times (i.e., no registers at all). (Fig. 10)
shows RTL, Technology Schematic, and Floor-plan (from PlanAhead) of FOR_LOOP Model.
(TABLE 4) highlights the huge number of units that are mapped to satisfy FOR_LOOP
design requirements, whereas (TABLE 5) summarizes the Synthesis report including Timing.
# DSP 38 0 0.00%
In this model, there is nothing could be done further except changing the maximum
number of the iterations which affects the performance (i.e., generates poorer latency for
higher max #iterations). In fact, it is supposed to be faster, see behavioural simulation, as the
design is purely parallel design. Yet, the huge circuitry raise the need to jump through
interconnection hoops. Finally, this model works faster in larger devices such as XC6SLX150.
Recalling again the “Simplified Euclidean GCD Algorithm” in (Fig. 3), it can be
considered as an Arithmetic State Machine (ASM) that describes the behaviour of the GCD
circuit. Then, the three states FSM is an RTL implementation of the ASM circuit (Fig. 11).
⇒
Fig.11 From ASM GCD to Finite State Diagram
Using the basic “States Reduction” rule, S1 => S2. The new FSM with sample of the
(Fig. 13) highlights the Behavioural Simulation results of the ASM2FSM Model, while
(Fig. 14) shows RTL, Technology Schematic, and Floor-plan (from PlanAhead).
!
Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model
Again, (TABLE 6 & 7) highlight the Mapping and Synthesis reports including Timing
TABLE 7: SYNTHESIS AND TIMING REPORT - ASM2FSM GCD VS. FOR_LOOP GCD
There is no comparison between the results that was obtained with the ASM2FSM
model with the FOOR_LOOP ones, considering the huge area saving and the ability to
The next step was to build a Data-Path for the computation unit, which could be as
use the CARRY_OUT signals of the subtractors in (Fig. 15) as AGB and ALB signals (Fig. 16).
The “FSM block” in (Fig. 16) refers to the Control Unit (Fig. 17) that drives the Control
signals of the GCD data-path (i.e., Registers’ Enable signals). It is important to note that the
MUXs’ select signals are driven by the signals AGB and AEB directly, whereas for the REGs’
enable signals, smaller MUXs (i.e., 1-bit) were built by the control unit.
Else Else
In this model, subtractors are the bottle neck of the design as they combined the
subtraction and comparison at the same time. They need to be as fast as their results must be
ready before the next clock occurrence. Therefor, fast CARRY4 primitive (Fig. 18.b), which
utilizes the dedicated Carry-Chain in SliceL and SliceM inside Spartan-6 FPGA, was adapted
in the design to perform faster subtraction. Furthermore, LUT2 and LUT3 Macros where used
to accommodate some logic functions such AND, XOR, and multiplexer (Fig. 18.a).
Perfectly, the Mapping and Synthesis reports (TABLE 9 & 10) prove the presumable
results of the design and it was clearly “Faster” and “Areas saver”.
TABLE 10: SYNTHESIS AND TIMING REPORT - OPTIMIZED VS. SIMPLE GCD2SUB
The comparison was between two versions of the GCD2SUB; The Optimized version
using primitives and macros, and a simple version with high level components (i.e., “-“ for
subtraction, “Select” for Multiplexer, …etc, even the comparator was defined in this version).
It was clear that although the tool is capable of Optimizing Macros in a good way, the
designer could utilize the dedicated Primitives and Macros for more efficient optimization.
Sum of Absolute Different (SAD) replaces the two subtractors using Carry-Out
Generation Function (Fig. 20 & 21). It expected to give better result than GCD2SUB as it uses
Before implementing the primitive of the GCDSAD circuit (Optimized GCDSAD), there
was an attempt to try a function called (ABS), which does the same job as GCDSAD, in order
to see how the compiler accommodates such function in the hardware level. Also, the simple
GCDSAD has been designed using high level component definition. ABS_GCD has given a
significant result in terms of speed, while the simple GCDSAD was a bit better in terms of
area. (TABLE 11 & 12) compare between ABSGCD, Simple, and Optimized GCDSAD.
Macros Statistics ABS SSAD OSAD Time Element | ns ABS SSAD OSAD
# 16-bit Add/Sub/Acc 2 1 0
R to R Paths 5.40 10.92 8.40
# Registers 33 33 33
In to R Paths 3.19 3.19 2.96
# 2-1 MUX (1, 16-bit) 5 5 3
R to Out Paths 5.80 13.11 3.63
# XOR 15 33 0
In to Out Paths 0 0 0
# DSP 0 0 0
Recalling all the design architectures and their area/time figures, this section reveals
the conclusion in numbers and charts (TABLE 13 & Fig. 22, & Fig. 23).
# Registers 0 33 33 33
(TABLE 14) Summarizes the work that has been done and compares between all the
versions of the Euclidean GCD design and its implementation on Xilix Spartan-6 FPGA.
The overall results shows that the optimized GCD2SUB design has the least Area-Delay
product among the other models in this project, which means it provides fast computation of
the Euclidean GCD Algorithm, while saving area a great deal. Apart from the slow, limited
and area consuming FOOR_LOOP GCD model, the other architectures were not too far for
GCD2SUB model, especially, the Simple GCD2Sub and the Optimized GCDSAD. However,
GCDSAD could be better than Simple GCD2Sub because of the full control over placement
which might make its Area-Delay product significantly better. Furthermore, some
components, such as FSM Flip-Flops and MUXs, could be implemented using primitives and
The Euclidean GCD Algorithm design journey has brought great experience, from
infinite loop to loop limitations, then thought RTL behavioural architecture to the structural
architecture, to the optimized design, where, Primitives and Macros were utilized to reduce
the Area-Delay product of the design. Pro's & Con's of the main architectures can be:
Behavioural
- Apart from “Design Strategies & Goals,” there in no control at any level on the implemented
circuit or the placement and routing of the design.
✤ It is High Level Coding approach, which is easier to write and manage.
Structural
- The design could be much more complex than behavioural especially with Primitives.
✤ By utilizing Primitives & Macros efficiently, there is gain of full control over the placement.
✤ Having the data-path and control units separated, allows for better optimization.
It is important to note that utilizing primitives and macros efficiently helps to reduce the
jumps through interconnections and maintain a logical and persistent data flow in the design.
For instance, 16-bit Carry-Look-Ahead Subtractor (CLASub) is assumed to be faster than the
ripple carry. However, utilizing CARRY4 primitive to benefit form the dedicated Carry-chain
with the help of propagation function (i.e., Half-Adder SUM - XOR), gives an Area-Delay
product of about 10 times better than using full CLASub with primitives.
Finally, it would be fair to mention that both, ASM2FSM, Simple GCD2SUB, and Simple
GCDSAD were implemented with the enforcement of using DSP as a primitive. The time
delay in all cases was not promising, and the occupied area inside the FPGA was greater than
using the CLB’s slices. However, it seems somehow possible to utilize the DSP itself in order
to benefit from its features to perform the whole computations of the Euclidean GCD
Algorithm. This might be a reasonable suggestion for future work related to GCD design on
FPGA in addition to learning more about the tools and their helpful features.
Bibliography
s/254/ee254l_lab_manual/.
3. Lesson 93 - Example 63: GCD Algorithm - VHDL while Statement [A tutorial on datapaths and state
machines for computing the GCD / While Loops accompanies the book Digital Design Using Digilent FPGA
data_sheets/ds160.pdf.
5. “Spartan-6 FPGA Configurable Logic Block, UG384 (v1.1),” Xilinx, 2010. http://www.xilinx.com/support/
documentation/user_guides/ug384.pdf.
6. “Spartan-6 Libraries Guide for HDL Designs, UG615 (v 14.1),” Xilinx, 2012. http://www.xilinx.com/support/
documentation/sw_manuals/xilinx14_1/spartan6_hdl.pdf.
7. “XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, UG687 (v 13.4),” Xilinx, 2012. http://
www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/xst_v6s6.pdf.
sw_manuals/xilinx14_1/spartan6_hdl.pdf.
10. Devi, R., Singh, J. and Singh, M. (2011). VHDL Implementation of GCD Processor with Built in Self Test
11. C.P, N. and M. Ravi Kumar, K. (2014). Efficient Comparator based Sum of Absolute Differences Architecture
for Digital Image Processing Applications. International Journal of Computer Applications, 96(4), pp.17-24.
12. TechOnlineIndia, (2014). An introduction to FPGA timing analysis [online] Available at http://
www.techonlineindia.com/techonline/news_and_analysis/170126/introduction-fpga-timing-analysis.