Implementation of Modulo 2n-1 Multiplier Using Radix-8 Modified Booth Algorithm
Implementation of Modulo 2n-1 Multiplier Using Radix-8 Modified Booth Algorithm
INTRODUCTION
1.1
dynamic range is specific. Special moduli sets have been used extensively to reduce the
hardware complexity in the implementation of converters and arithmetic operations.
Among which the triple moduli set {2n+1,2n,2n-1} have some benefits. Since the
operation of multiplication is of major importance for almost all kinds of processors,
efficient implementation of multiplication modulo 2n-1 is important for the application of
RNS.
1.2
RNS DEFINITION
A residue number system is characterized by a base that is not a single radix but an
1.3 ADVANTAGES
The RNS system provide a unique feature of parallelism that make arithmetic
operations such as addition, subtraction and modulation very easy to handle and perform
increasing speed and reducing chip area.
Carry free
High-Speed
Parallel Operation
CHAPTER-2
LITERATURE SURVEY
CRYPTO SYSTEM:
There are two different meanings of the word cryptosystem. One is used by the
cryptographic community, while the other is the meaning understood by the public. In
this meaning, the term cryptosystem is used as shorthand for "cryptographic system". A
cryptographic system is any computer system that involves cryptography. Such systems
include for instance, a system for secure electronic mail which might include methods for
digital signatures, cryptographic hash functions, key management techniques, and so on.
Cryptographic systems are made up of cryptographic primitives, and are usually rather
complex.
Typically, a cryptosystem consists of three algorithms: one for key generation, one
for encryption, and one for decryption. The term cipher (sometimes cypher) is often used
to refer to a pair of algorithms, one for encryption and one for decryption. Therefore, the
term "cryptosystem" is most often used when the key generation algorithm is important.
For this reason, the term "cryptosystem" is commonly used to refer to public key
techniques; however both "cipher" and "cryptosystem" are used for symmetric key
techniques.
Public-key cryptography refers to a cryptographic system requiring two separate
keys, one of which is secret and one of which is public. Although different, the two parts
of the key pair are mathematically linked. One key locks or encrypts the plaintext, and
the other unlocks or decrypts the cipher text. Neither key can perform both functions by
itself. Public-key cryptography is a fundamental, important, and widely used
technology. It is an approach used by many cryptographic algorithms and cryptosystems.
It underpins such Internet standards as Transport Layer Security (TLS), PGP, and GPG.
There are three primary kinds of public key systems.
obtain the same results. Grouping starts from the LSB, and the first block only uses two
bits of the multiplier..
the one on its left. Thus adding two n-digit numbers has to take a time proportional to n,
even if the machinery we are using would otherwise be capable of performing many
calculations simultaneously.
The carry-save unit consists of n full adders, each of which computes a single sum
and carry bit based solely on the corresponding bits of the three input numbers. Given the
threen - bit numbers a, b, and c, it produces a partial sum ps and a shift-carry sc:
The entire sum can then be computed by:
1.
2.
Appending a 0 to the front (most significant bit) of the partial sum sequence ps.
3.
Using a ripple carry adder to add these two together and produce the resulting n +
1-bit value.
When adding together three or more numbers, using a carry-save adder followed
by a ripple carry adder is faster than using two ripple carry adders. This is because a
ripple carry adder cannot compute a sum bit without waiting for the previous carry bit to
be produced, and thus has a delay equal to that of n full adders.
Advantages:
1. Produce all of its output in parallel resulting in the same as a full adder.
2. Very little propagation delay when Cary save adder plus ripple adder=n+1 and 2 ripple
carry adders=2n.
3. Allow for high clock speeds
Disadvantages:
1. We do not know whether the result is positive or negative
2. This is the draw back when performing modulo multiplication since you didnt know
whether the inter mediate result is greater than or less than the modulation.
CHAPTER-3
ADDERS & BINARY MULTIPLIERS
ADDER
In electronics, an adder is a digital circuit that performs addition of numbers. In
modern computers adders reside in the arithmetic logic unit (ALU) where other
operations are performed. Although adders can be constructed for many numerical
representations, such as Binary-coded decimal or excess-3, the most common adders
operate on binary numbers. In cases where two's complement is being used to represent
negative numbers it is trivial to modify an adder into an adder-subtracter
Types of adders
For single bit adders, there are two general types.
A half adder has two inputs, generally labelled A and B, and two outputs, the sum
S and carry C. S is the two-bit XOR of A and B, and C is the AND of A and B. Essentially
the output of a half adder is the sum of two one-bit numbers, with C being the most
significant of these two outputs.
The second type of single bit adder is the full adder. The full adder takes into
account a carry input such that multiple adders can be used to add larger numbers. To
remove ambiguity between the input and output carry lines, the carry in is labelled Ci or
Cin while the carry out is labelled Co or Cout.
9
Half adder
10
0 0 0
1 0 1
0 0 1
1 1 0
Full adder
Fig 3.2: Inputs: {A, B, Carry In} Outputs: {Sum, Carry Out}
11
Input Output
ABCiCo S
000 0
001 0
010 0
011 1
100 0
101 1
110 1
111 1
Note that the final OR gate before the carry-out output may be replaced by an
XOR gate without altering the resulting logic. This is because the only discrepancy
between OR and XOR gates occurs when both inputs are 1; for the adder shown here, one
can check this is never possible. Using only two types of gates is convenient if one
desires to implement the adder directly using common IC chips.
A full adder can be constructed from two half adders by connecting A and B to the input
of one half adder, connecting the sum from that to an input to the second adder,
connecting Ci to the other input and or the two carry outputs. Equivalently, S could be
made the three-bit xor of A, B, and Ci and Co could be made the three-bit majority
12
function of A, B, and Ci. The output of the full adder is the two-bit arithmetic sum of
three one-bit numbers.
BINARY MULTIPLIER
A Binary multiplier is an electronic hardware device used in digital electronics or
a computer or other electronic device to perform rapid multiplication of two numbers in
binary representation. It is built using binary adders.
The rules for binary multiplication can be stated as follows
1. If the multiplier digit is a 1, the multiplicand is simply copied down and
represents the product.
2. If the multiplier digit is a 0 the product is also 0.
For designing a multiplier circuit we should have circuitry to provide or do the following
three things:
1. It should be capable identifying whether a bit is 0 or 1.
2. It should be capable of shifting left partial products.
3. It should be able to add all the partial products to give the products as sum of
partial products.
4. It should examine the sign bits. If they are alike, the sign of the product will be a
positive, if the sign bits are opposite product will be negative. The sign bit of the
product stored with above criteria should be displayed along with the product.
From the above discussion we observe that it is not necessary to wait until all the partial
products have been formed before summing them. In fact the addition of partial product
can be carried out as soon as the partial product is formed.
13
Notations:
a multiplicand
b multiplier p
product
Binary multiplication (eg n=4)
p=ab
an1 an2 a1a0
bn1bn2 b1b0
p2 n1 p2 n2 p1 p0
xxxx
xxxx
--------xxxx
xxxx
xxxx
xxxx
b0a20
b1a21
b2a22
b3a23
--------------xxxxxxxx
14
15
16
BOOTH MULTIPLIER
The decision to use a Radix-4 modified Booth algorithm rather than Radix-2
Booth algorithm is that in Radix-4, the number of partial products is reduced to n/2.
Though Wallace Tree structure multipliers could be used but in this format, the
multiplier array becomes very large and requires large numbers of logic gates and
interconnecting wires which makes the chip design large and slows down the operating
speed.
17
ten
(0010
two
19
We have finished four cycles, so the answer is shown, in the last rows of U and V which
is: 11111000two.
Note: By the fourth cycle, the two algorithms have the same values in the Product
register.
20
X(i)
X(i1)
X(i2)
+0
+y
+y
+2y
2y
+0
21
CHAPTER-4
PROBLEM IDENTIFICATION
Multipliers are most commonly used in various electronic applications e.g. Digital
signal processing in which multipliers are used to perform various algorithms like FIR,
IIR etc. Earlier, the major challenge for VLSI designer was to reduce area of chip by
using efficient optimization techniques to satisfy MOORES law. Then the next phase is
to increase the speed of operation to achieve fast calculations like, in todays
microprocessors millions of instructions are performed per second. Speed of operation is
one of the major constraints in designing DSP processors and todays general-purpose
processors. However area and speed are two conflicting constraints. So improving speed
results always in larger areas. Now, as most of todays commercial electronic products are
portable like Mobile, Laptops etc. that require more battery backup. Therefore, lot of
research is going on to reduce power consumption. So, in this paper it is tried to find out
the best solution to achieve low power consumption, less area required and high speed for
multiplier operation. The basic principle used for multiplication is to evaluate partial
products and accumulation of shifted partial products. In order to perform this operation
number of successive addition operation is required. Therefore one of the major
components required to design a multiplier is Adder. Adders can be Ripple Carry, Carry
Look Ahead, Carry Select, Carry Skip and Carry Save [1-3]. A lot of research work has
been done to analyze performance of different fast adders. The effect of the RCA wordlength, on the time complexities of each constituent component of the multiplier is
analyzed qualitatively and the multiplier delay is shown to be almost linearly dependent
on the RCA word-length. Consequently, the delay of the multiplier can be directly
controlled by the wordlength of the RCAs. By means of modulo arithmetic properties, we
show that the compensation constant that negates the effect of the bias introduced in this
process can be precomputed and implemented by direct hardwiring with no delay
overhead for all feasible combinations of and it is shown that the proposed multiplier
lowers power dissipation of the radix-4 Booth encoded multiplier.
22
CHAPTER-5
IMPLEMENTATION DETAILS
5.1 MODULAR MULTIPLICATION
Modular arithmetic operations (i.e., inversion, multiplication and exponentiation)
are used in several cryptography applications, such as decipherment operation of RSA
algorithm, Difie-Hellman key exchange algorithm, elliptic curve cryptography, and the
Digital Signature Standard including the Elliptic Curve Digital Signature Algorithm.
Modular Multiplication is the key algorithm of RSA and other public key
cryptosystems, and so provides an indication of the efficiency of the RNS
implementation. The majority of the currently established Public-Key Cryptosystems
(RSA, Difie-Hellman, Digital Signature Algorithm (DSA), Elliptic Curves (ECC), etc.)
require modular multiplication in finite fields as their core operation which accounts for
up to 99% of the time spent for encryption and decryption.
Modular Multiplication in Public Key Cryptosystems
One of the cornerstones of public-key cryptography is modular arithmetic, on
which nearly all established schemes are based. An efficient software implementation of
modular arithmetic is therefore desirable. While modular additions and subtractions are
rather trivial cases, efficient modular multiplication remains an elusive target for
optimization.
23
Multiplier :
modulo
2n 1
Partial products :
reduction modulo
2n 1
Let A=
a
k=0
2k
.1
24
n1
A.2j=
2k+j
k=0
n1 j
=
k=0
j1
2k+j +
n-j+k
k=0
2k Mod (2N-1)
Now let
n1
B=
m=0
2m
.. 4
AB=
m=0
III. The third group of multipliers handle medium to large values of moduli but it uses
mainstream arithmetic components that have been developed beforehand, thus facilitating
the job of the hardware designer by reducing the overall project lifespan. These
components could be regular binary multipliers, adders, subtractors, logic components
and small size ROM architectures.
26
Xi
+2
+1
Parti
Xi
al
pro
duct
0 0Y
1 +1Y
0 +1Y
1 +2Y
0 -2Y
1 -1Y
0 -1Y
10
( n2 )1
X=
D i.4i
.6
i =0
27
Quartet
Signed
values
digit
values
0000
0001
+1
0010
+1
0011
+2
0100
+2
0101
+3
0110
+3
0111
+4
1000
-4
1001
-3
1010
-3
1011
-2
1100
-2
1101
-1
1110
-1
1111
28
Here we have an odd multiple of the multiplicand, 3Y, which is not immediately
available. To generate it we need to perform this previous add: 2Y+Y=3Y. But we are
designing a multiplier for specific purpose and thereby the multiplicand belongs to a
previously known set of numbers which are stored in a memory chip. We have tried to
take advantage of this fact, to ease the bottleneck of the radix-8 architecture, that is, the
generation of 3Y. In this manner we try to attain a better overall multiplication time, or at
least comparable to the time we could obtain using radix-4 architecture (with the
additional advantage of using a less number of transistors). To generate 3Y with 21-bit
words we only have to add 2Y+Y, that is, to add the number with the same number
shifted one position to the left, getting in this way a new 23-bit word, as shown in figure
4:
29
30
Let X =
xi
i=0
n1
and Y=
yi
i=0
multiplier of the modulo 2n-1 multiplier, respectively. The radix-8 Booth encoding
algorithm can be viewed as a digit set conversion of four consecutive overlapping
multiplier bits y3i+2 y3i+1y3i (y3i-1) to a signed digit, di ,di
[]
n
3
y-1 =yn=yn+1=yn+2=0
31
(1)
di
|di.X|2n-1
+0
00
+1
TABLE 5:
+2
CLS(X,1)
+3
|+3x|2n-1
+4
CLS(X,2)
ModuloReduced
Multiples
For The
di
|di.X|2n-1
-0
11
-1
-2
CLS( X ,1)
-3
|-3x|2n-1
-4
CLS( X ,2)
reduced multiples of X for all possible values of the radix-8 Booth encoded multiplier
digit, di, where CLS(X, J) denotes a circular-left-shift of X by j bit positions. Three
unique properties of modulo 2n-1 arithmetic that will be used for simplifying the
combinatorial logic circuit of the proposed modulo multiplier design are reviewed here.
The all possible two operand adder implementations, the RCA has indubitably the
least area and dynamic power dissipation. The addends X2n-1 and 2X2n-1are added with
carry propagation through full adders (FAs), and the end-around-carry addition is realized
with carry propagation through half adders.
32
n
2 -1
stages of the modulo 2n-1multiplier. Hence, this approach for hard multiple generation
can no longer categorically ensure that the multiplication in the modulo 2 n-1 channel still
falls in the noncritical path of a RNS multiplier.
PROPOSED RADIX-8 BOOTH ENCODED MODULO 2 n-1 MULTIPLIER
DESIGN
To ensure that the radix-8 Booth encoded modulo multiplier does not constitute
the system critical path of a high-DR moduli set based RNS multiplier, the carry
propagation length in the hard multiple generation should not exceed n-bits. To this end,
the carry propagation through the HAs in Fig. 5 can be eliminated by making the endaround-carry bit c7 a partial product bit to be accumulated in the CSA tree. This technique
reduces the carry propagation length to n bits by representing the hard multiple as a sum
and a redundant end-around-carry bit pair.
GENERATION OF PARTIALLY-REDUNDANT HARD MULTIPLE
Let |X2n-1 and 2X2n-1be added by a group of M=(n/k) k-bit RCAs such that there is
no carry propagation between the adders. shows this addition for n=8 and k=4.
33
where the sum and carry-out bits from the RCA block are represented as
respectively. In Fig. 6, the carry-out of RCA 0,
is not propagated to the carry input of RCA 1 but preserved as one of the partial
product bits to be accumulated in the CSA tree. The binary weight of the carry-out
of
RCA 1 has, however, exceeded the maximum range of the modulus and has to be modulo
reduced before it can be accumulated by the CSA tree.
From Fig., the partially-redundant form of |+3X 2n-1 is given by the partial-sum
and partial-carry pair (S, C) where
(5)
34
is
(6)
M 1
B=
2k . j
j=0
=
0. .....01...0....01
. (7)
The addends for the computation of the biased hard multiple, |B+3X 2n-1 in a
partially-redundant form are X2n-1 and 2X2n-1 and B or equivalently S , C and B. Since B
is chosen to be a binary word that has logic ones at bit positions 2kj , and logic zeros at
other bit positions,| B+3X 2n-1 can be generated by simple XNOR and OR operations on
the bits of and at bit positions 2kj . Fig. 7 illustrates how these bits in the sum and the
carry outputs of RCA 0 and RCA 1 are modified. In general |B+3X 2n-1, is given by the
partial-sum and partial-carry pair (BS, BC) such that
.. (8)
Where
(9)
35
And
... (10)
For j= 0, 1.M-1.
Let
.. (11)
modulo 2n-1 is |
36
Fig. 9 illustrates the partial product matrix of |X .Y|28-1 with (N/3+1) partial products in
partially-redundant representation. Each PPi consists of an n-bit vector, ppi7, ppi1, ppi0 and
a vector of n/k=2, redundant carry bits qi1,qi0 . Since qi0 and qi1 are the carry-out bits of the
RCAs, they are displaced by k-bit positions for a given PP i. The bits, qij is displaced
circularly to the left of q(i-1)j by 3 bits, i.e., q20 and q21 are displaced circularly to the left
of q10 and q11 by 3 bits, respectively q10 and q11 are in turn displaced to the left of q 00
and q01 by 3 bits, respectively. The last partial product in Fig. 9 is the Compensation
Constant (CC) for the bias introduced in the partially- redundant representation.
The generation of qij the modulo-reduced partial products, PP0, PP1, and PP2, in a
partially-redundant representation using Booth Encoder (BE) and Booth Selector (BS)
blocks are illustrated in Fig. 10. The BE block produces a signed one-hot encoded digit
38
from adjacent overlapping multiplier bits as illustrated in Fig. 11(a). The signed one-hot
encoded digit is then used to select the correct multiple to generate PP i. A bit-slice of the
radix-8 BS for the partial product bit, ppij is shown in Fig.
As the bit positions of do not overlap, as shown in Fig., they can be merged into a
single partial product for accumulation. The merged partial products, PP i and the constant
39
CC are accumulated using a CSA tree with end-around-carry addition at each CSA level
and a final two-operand modulo 2n-1 adder as shown in Fig.
MOTIVATIONS
To humans, decimal numbers are easy to comprehend and implement for
performing arithmetic. However, in digital systems, such as a microprocessor, DSP
(Digital Signal Processor) or ASIC (Application-Specific Integrated Circuit), binary
numbers are more pragmatic for a given computation.
41
doing modulo 2n-1 addition. The basic idea is to add the carry-out to the sum as in the
fashion of end-around add.
MODULO (2N-1) ADDITION
Modulo (2n-1) addition or, which is the same, ones complement addition can be
formulated as
A + B( 2n 1 )
( A+ B+1 ) mod 2n
if A+ B 2n 1 .. (5.3.1)
A+ B Ot h erwise
A + B( 2n 1 )
( A+ B+1 ) mod 2n
.. (5.3.2)
if A +B 2n
A+ B Ot h erwise
new condition
A+B 2n is equivalent to cout=1,where cout is the carryout of the addition A + B, equation
(5.3.2) can be rewritten as
(A+B) mod = (A+B+ cout) mod2n
.(16)
are computed. These adders have tree structures within a carry-computing stage similar to
the carry propagate adder. The Process Steps involved in Parallel Prefix Addition is
depicted in fig 15.
A parallel prefix adder can be seen as a 3-stage process:
Pre-computation:
In pre-computation stage, each bit computes its carry generate (g)/propagate (p)
signals and a temporary sum as below. These two signals are said to describe how the
Carry-out signal will be handled.
gi=ai .bi
pi=ai xor bi
ci+1=gi+pi .ci
Prefix:
In the prefix stage, the group carry generate/propagate signals are computed to
form the carry chain and provide the carry-in for the adder below. Various signal
graphs/architectures can be used to calculate the carry-outs for the final sum. A few of
them are as follows.
Sklansky
Brent-kung
Ladner-Fischer
43
Post-computation:
In the post-computation stage, the sum and carry-out are finally produced. The
carry-out can be omitted if only a sum needs to be produced.
si=pi xor ci
In the prefix tree, group generate/propagate are the only signals used. The group
generate/ propagate equations are based on single bit generate/propagate, which are
computed in the pre-computation stage.
gi=ai .bi
pi=ai xor bi (5.5.1)
where 0 i n. g-1 = cin and p-1 = 0. Sometimes, pi can be computed with OR logic
instead of an XOR gate.
In the prefix tree, group generate/propagate signals are computed at each bit.
Gi:k=Gi:j+Pi:j.Gj-1:k
Pi:k=Pi:j.Pj-1:k (5.5.2)
More practically, Equation (5.5.2) can be expressed using a symbol
denoted by Brent and Kung . Its function is exactly the same as that of a black cell. That
is
Or
Gi:k=(gi,pi) o (gi-1,pi-1)o.o(gk,pk)
pi:k=pi.pi-1..pk
The
. (5.5.4)
In the post-computation, the sum and carry-out are the final output.
45
Si=pi.Gi-1:-1
Cout=Gn:-1.. (5.5)
Where -1 is the position of carry-input. The generate/propagate signals can be
grouped in different fashion to get the same correct carries. Based on different ways of
grouping the generate/propagate signals, different prefix architectures can be created.
Figure 17 shows the definitions of cells that are used in prefix structures, including black
cell and gray cell. Black/gray cells implement Equation (5.5.2) or (5.5.3), which will be
heavily used in the following discussion on prefix trees
EMPTY PREFIX TREE
.
Fig 5.4.5: 8-bit Empty Prefix Tree
46
Step 1:
Step 2 :
47
Step 3:
The way of building a prefix tree can be processed as the arrows indicate (i.e.
from LSB to MSB horizontally and then from top logic level down to bottom logic level
vertically).
The example shown in Figure 19.3 is an 8-bit Sklansky prefix tree.
Sklansky prefix tree takes the least logic levels to compute the carries. Plus, it
uses less cells than Knowles and Kogge-Stone structure at the cost of higher fan-out.
48
Figure 19.4 shows the 16-bit example of Sklansky prefix tree with critical path in solid
line.Few of them are given below.
Kogge-Stone prefix tree
Brent-kung
Ladner-Fischer
Han-Carlson Prefix Tree
LOGIC
AREA
LEVELS
FAN-
WIRE
OUT
TRACKS
Brent-Kung
2 log 2 n1
2 nlog2 n2
Kogge-Stone
log 2 n
n log 2 nn+1
n/2
Ladner-
log 2 n+1
n/4+1
log 2 n
n log 2 nn+1
n/4
log 2 n
(n/2) log 2 n
n/2+1
Fischer
Knowles[2,1,1,
1]
Sklansky
49
Han-Crlson
log 2 n
(n/2) log 2 n
n/4
Harris
log 2 n+1
(n/2) log 2 n
n/8
(Cout ,S)=
Cout + S= A + B + Cin
PRE-PROCESSING:
gi=
a0 b0 +a 0 c 0 +b 0 c 0 if i=0
a i b i otherwise
pi=ai xor bi
PREFIX COMPUTATION:
(G0i:i,P0i:i) = (gi,pi)
50
POST- PROCESSING:
m
Ci+1= G
i:0
si=pi xor ci
The cell definitions for the above mentioned codes have been depicted in figure 19.
51
52
The prefix-structure size is only increased by n black nodes and the critical path
by one black node, which results in highly area and delay efficient end-around-carry
adders. Note that an n-bit end-around-carry parallel-prefix adder has the same delay but is
smaller compared to an ordinary 2n-bit parallel-prefix adder.
COMPARISON OF DIFFERENT PREFIX ADDERS
SKLANSKY ADDER
53
KOGGE-STONE ADDER
54
LADNER-FISCHER ADDER
BRENT-KUNG ADDER
55
CHAPTER-6
56
EXECUTION DETAILS
6.1SOFTWARE REQUIREMENTS
MODELSIM 6.4b
XILINX 14.6
It requires Xilinx ISE 10.1 version of software where Verilog source code can be
used for design implementation.
Introduction To Modelsim
In ModelSim, all designs are compiled into a library. You typically start a new
simulation in ModelSim by creating a working library called "work". "Work" is the
library name used by the compiler as the default destination for compiled design units.
Compiling Your Design: After creating the working library, you compile your design
units into it. The ModelSim library format is compatible across all supported
platforms. You can simulate your design on any platform without having to recompile
your design.
Loading the Simulator with Your Design and Running the Simulation With the design
compiled, you load the simulator with your design by invoking the simulator on a toplevel module (Verilog) or a configuration or entity/architecture pair (VHDL).
Assuming the design loads successfully, the simulation time is set to zero, and you
enter a run command to begin simulation.
Debugging Your Results
If you dont get the results you expect, you can use ModelSims robust debugging
Environment to track down the cause of the problem.
Run simulation
Debug results
58
This tool can be used to create, implement, simulate, and synthesize Verilog designs for
implementation on FPGA chips.
ISE: Integrated Software Environment
Environment for the development and test of digital systems design targeted to
FPGA or CPLD
Integrated collection of tools accessible through a GUI
Based on a logical synthesis engine (XST: Xilinx Synthesis Technology)
XST supports different languages:
Verilog
VHDL
XST produce a net list integrated with constraints
Supports all the steps required to complete the design:
Translate, map, place and route
Bit stream generation
Supports verification at different steps of the design
INTRODUCTION
FPGA stands for Field Programmable Gate Array which has the array of logic
module, I /O module and routing tracks (programmable interconnect). FPGA can be
configured by end user to implement specific circuitry. Speed is up to 100 MHz but at
present speed is in GHz.
FPGA DESIGN FLOW
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable. Logic
blocks are programmed to implement a desired function and the interconnects are
programmed using the switch boxes to connect the logic blocks.
FPGAs, alternative to the custom ICs, can be used to implement an entire System
On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can
reprogram an FPGA to implement a design and this is done after the FPGA is
manufactured. This brings the name FieldProgrammable.
SRAM is used to implement a LUT.A k-input logic function is implemented using
2^k * 1 size SRAM. Number of different possible functions for k input LUT is 2^2^k.
Advantage of such an architecture is that it supports implementation of so many logic
functions, however the disadvantage is unusually large number of memory cells required
to implement such a logic block in case number of inputs is large.
60
LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with trade off between
performance and logic density. An n-LUT can be shown as a direct implementation of a
function truth-table. Each of the latch holds the value of the function corresponding to
one input combination. For Example: 2-LUT can be used to implement 16 types of
functions like AND , OR, A+not B .... etc.
FPGA DESIGN FLOW
In this part of tutorial we are going to have a short intro on FPGA design flow. A
simplified version of design flow is given in the flowing diagram.
61
Design Entry
There are different techniques for design entry. Schematic based, Hardware
Synthesis
The process which translates VHDL or Verilog code into a device netlist formate.
i.e a complete circuit with logical elements( gates, flip flops, etc) for the design.If the
design contains more than one sub designs, ex. to implement a processor, we need a CPU
as one design element and RAM as another and so on, then the synthesis process.
Implementation
This process consists a sequence of three steps
1. Translate
2. Map
3. Place and Route
Translate
Process combines all the input netlists and constraints to a logic design file. This
Map
63
Process divides the whole circuit with logical elements into sub blocks such that
they can be fit into the FPGA logic blocks. That means map process fits the logic defined
by the NGD file into the targeted FPGA elements (Combinational Logic Blocks (CLB),
Input Output Blocks (IOB)) and generates an NCD (Native Circuit Description) file
which physically represents the design mapped to the components of FPGA. MAP
program is used for this purpose.
blocks from the map process into logic blocks according to the constraints and connects
the logic blocks.
Device Programming
64
Now the design must be loaded on the FPGA. But the design must be converted to
a format so that the FPGA can accept it. BITGEN program deals with the conversion. The
routed NCD file is then given to the BITGEN program to generate a bit stream (a .BIT
file) which can be used to configure the target FPGA device. This can be done using a
cable. Selection of cable depends on the design.
Behavioral Simulation:This is first of all simulation steps; those are encountered throughout the hierarchy of
the design flow. This simulation is performed before synthesis process to verify RTL
(behavioral) code and to confirm that the design is functioning as intended.
6.3 RESULT
6.3.1:SIMULATIONRESULT
65
66
67
:8
Flip-Flops
:8
# Xors
: 109
1-bit xor2
: 61
1-bit xor3
: 48
===============================================================
======
Final Register Report
Macro Statistics
# Registers
:8
Flip-Flops
:8
===============================================================
======
*
Final Report
===============================================================
======
Final Results
RTL Top Level Output File Name
: TOP_NIS.ngr
: TOP_NIS
Output Format
: NGC
Optimization Goal
: Speed
68
Keep Hierarchy
: NO
Design Statistics
# IOs
: 26
Cell Usage:
# BELS
: 193
LUT2
: 10
LUT3
: 56
LUT4
: 120
MUXF5
:7
# Flip Flops/Latches
#
:8
FDR
:8
# Clock Buffers
:1
:1
BUFGP
# IO Buffers
: 25
IBUF
: 17
OBUF
:8
===============================================================
===
Number of Slices
2%
1%
Number of IOs
: 26
: 26 out of
: 8
Number of GCLKs
: 1 out of
232
24
11%
4%
===============================================================
===
Timing Summary:
------------------------------------------------------------------------------------------------------------Speed Grade: -4
Total : 24.020ns
Timing Detail:
--------------------------------------------------------------------------------------------------------------All values displayed in nanoseconds (ns)
RTL SCHEMATICS
71
72
73
CHAPTER-7
APPLICATIONS
The residue number system is very attractive solution to many researchers
especially during the last decade. Extensive research have been put on the theory of
improving the RNS system and applying it in some application areas such as, digital
signal processing, digital filters, fast Fourier transform (FFT), and image processing.
The RNS is inherently parallel, modular and fault tolerant. Performing operations
such as addition, subtraction, and multiplication is inherently carry-free, thus reducing a
great amount of circuit integration area where carry-detection circuitry had to be
implemented before.
RSA Algorithm
Digital Signal Processing
74
Digital Filtering
Image Processing
Error Detection and Correction
CHAPTER-8
CONCLUSION
A new approach for multiplication, modulo (2n-1) is proposed. In this design
Partial Product Generator, Carry Save Adder and Parallel Prefix Adder are Used. Similar
to the binary multiplier, the generation of the partial products is accomplished by AND
gates. The Partial Product Generator (Radix-8) is applied to increase the speed by
compression of row size from N to (N/2)-1. Carry Save Adder is used to add the PPG
output values. To completely utilize the unequal delay of a full adder, an algorithm for
delay optimization of the Wallace tree is developed. The proposed parallel Prefix Adding
approach exhibits superior performance, in terms of either speed of hardware
requirement, in comparison with a recent counterpart for the same purpose. In addition,
the proposed multiplier modulo (2n-1) shows an extremely regular structure and is very
suitable for VLSI implementation.
75
I have Used ModelSim- 6.4b for simulation, Xilinx ISE 14.6 for Synthesis, Time
Analysis and Power Analysis and FPGA SPARTAN- 3E Kit for dumping and Post
Simulation of the design. I achieved the total delay value is 24ns and total power value is
0.0076 W.
CHAPTER-9
FUTURE SCOPE
Montgomery modular multiplication algorithm is a well-known method that is
employed in efficient modular multiplication architectures and therefore is widely used in
GF( p) elliptic curve applications.
The complexity of Montgomery multiplier makes the testing process a big
challenge. A methodology for developing testing modules is introduced. Including a selftesting block in the multiplier's system will be beneficial and will reduce the time and
effort for testing. A self-testing block will perform Montgomery multiplication of
hardwired numbers and compare the result with predefined values. A flag bit can be used
to indicate an error.
Power dissipation study of the design is also needed in the context of power
differential attack. This type of attack on a cryptographic system tries to deduce
76
parameters of the system by observing system's power dissipation. This study would be
applicable to show the adequacy of this design approach to hw-power devices, such as
portable computers.
More study need to be done to see the effect of applying re-timing technique to
radix-2 design, and how the re-timing will affect the performance of the design. Some
investigations need to be done to show how the radix-4 design presented in this text can
be extended to cover the unified architecture presented . The integration of multiplication
and exponentiation can be included as part of a hardware co-processor.
CHAPTER-10
BIBLIOGRAPHY
1)
Radix-8
Booth
VERILOG HDL A Guide to Digital Design and Synthesis IEEE 1364-2001 Complaint
By SAMIR PALNITKAR.
3)
4)
V. Miller, Use of elliptic curves in cryptography, in Proc. Advances in CryptologyCRYPTO85, Lecture Notes in Computer Science, 1986, vol. 218, pp. 417426.
5)
77
6)
7)
8)
9)
10)
R. Muralidharan and C. H. Chang, Fast hard multiple generators for radix-8 Booth
encoded modulo 2n-1and modulo 2n+1multipliers, in Proc. 2010 IEEE Int. Symp.
Circuits and Systems, Paris, France, Jun. 2010, pp. 717720.
78