Final Documentation (Mee) 123 PDF
Final Documentation (Mee) 123 PDF
Using Parallelism
A thesis report submitted in partial fulfillment of the requirements for the award of the degree
of
BACHELOROFTECHNOLOGY
in
ELECTRONICSANDCOMMUNICATIONENGINEERING
Submitted By
N.TEJASWI RAMANI (15B01A04A2)
V.ROHITHA (15B01A04G7)
i
SHRI VISHNU ENGINEERING COLLEGE FOR WOMEN
(AUTONOMOUS)
(Approved by AICTE, New Delhi, Affiliated to J.N.T.University, Kakinada)
DEPARTMENT OF
ELECTRONICS AND COMMUNICATION ENGINEERING
CERTIFICATE
This is to certify that the thesis entitled, “VLSI Architecture Implementation for CRC-9
polynomial Using Parallelism” is being submitted by N.TEJASWI RAMANI
(15B01A04A2), V.ROHITHA (15B01A04G7), P.JYOSYASRI ALEKHYA
(15B01A0473), and N.MADHURA MEENAKSHI (16B05A0417) in partial fulfillment of
the requirement for the award of Bachelor of Technology in Electronics and
Communication Engineering, to Shri Vishnu Engineering College for Women
(Autonomous), Bhimavaram is a record of bonafied work done by them under our guidance
and supervision.
External Examiner
DECLARATION
We are the students of Shri Vishnu Engineering College for Women (Autonomous)
hereby declare that this project work entitled “VLSI Architecture Implementation of
CRC-9 polynomial using parallelism” being submitted to the Department of ECE,
SVECW(A) affiliated to JNTU, Kakinada for the award of BACHELOR OF
TECHNOLOGY in Electronics and Communication Engineering is a record of
bonafide work done by us and it has not been submitted to any other Institute or
University for the award of any other degree or prize.
PROJECTASSOCIATES
V.ROHITHA (15B01A04G7)
We take an opportunity to express our sincere gratitude to each and every one who
supported and guided us for the completion of our project.
We express our sincere gratitude to Dr. G. SRINIVASA RAO, Principal, SVECW
(A), Bhimavaram and Dr. P. SRINIVASA RAJU, Vice Principal, SVECW (A),
Bhimavaram, whose support from time to time helped us to complete the project
successfully.
We are very much thankful to Dr. G.R.V.L.N.SRINIVASA RAJU, Head of the
Department, Department of Electronics and Communication Engineering, for his
continuous support and guidance.
PROJECT ASSOCIATES
V.ROHITHA (15B01A04G7)
ii
CONTENTS
Page No.
DECLARATION i
ACKNOWLEDGEMENT ii
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT vii
1 INTRODUCTION TO CRC 1
1.1 Introduction..………………………………………………………..…… 1
1.2 CRC Implementation……………………………………..……………… 2
1.3 Message Augmentation ………………………………………..………… 3
1.4 Designing Polynomials ……..………………………..………………….. 4
2 INTRODUCTION TO LFSR IN CRC 6
2.1 CRC Generation using LFSR ………………………...……………….… 6
2.2 Cyclic redundancy check (CRC) applications …………...……...………. 7
3 SERIAL LFSR ARCHITECTURE 9
3.1 IIR Filter Representation of LFSR ……………..………..………………. 10
iii
6 Simulation results 25
6.1 Serial CRC …………………………….………………………………... 25
6.1.1 Technology schematic …………………………………………… 25
6.1.2 R.T.L Schematic …………………………………...…………….. 26
6.1.3 Simulation waveforms …………………………………..……….. 27
6.2 One stage parallel CRC …………………………………….…………. 28
6.2.1 Technology schematic …………………………………………… 28
6.2.2 R.T.L Schematic ……………………………….………………… 29
6.2.3 Simulation waveforms …………………………………………… 30
6.3 Three stage parallel CRC ……………………………..……………….. 31
6.3.1 Technology schematic ……………………………..…………….. 31
6.3.2 R.T.L Schematic ….………………………...……………………. 32
6.3.3 Simulation waveforms …………………………..……………….. 33
7 Conclusion and discussion 34
iv
List of Tables
3.1 Data flow of Fig. 3.3 when the input message is 101011010 ..................... 9
4.1 Data flow of Fig. when the input message is 101011010…………….….12
v
List of Figures
1.1 CRC generator at sender side………………………………………………………….1
1.2 CRC generator at sender side………………………………………………………....2
2.1 CRC generation using LFSR…………………………………………………………..6
2.2 Serial LFSR architecture………………………………………………………...…….7
2.3 M-Parallel architecture …...……………………………………………………...…....7
3.1 General LFSR architecture……………………………………………………...….....10
3.2 LFSR architecture for g(x) = 1 + x + x8 + x9…………………………....……….........10
4.1 One stage parallel CRC circuit …………………………………………………....…..13
4.2 Three stage parallel CRC circuit....................................................................................16
5.1 Project properties............................................................................................................22
5.2 module defining..............................................................................................................23
vi
Abstract
Error detection is important whenever there is a non-zero chance of data getting corrupted. A
Cyclic Redundancy Check (CRC) is the remainder, or residue, of binary division of a
potentially long message, by a CRC polynomial. This technique is ubiquitously employed in
communication and storage applications due to its effectiveness at detecting errors and
malicious tampering.
The hardware implementation of a bit-wise CRC is a simple linear feedback shift register. Such
a circuit is very simple and can run at very high clock speeds, but it requires the stream to be
bit-serial. This means that ‘n’ clock cycles will be required to calculate the CRC values for an
n-bit data stream. This latency is intolerable in many high speed data networking applications
where data frames need to be processed at high speed and hence implementation of CRC
generation and checking on a parallel stream of data becomes desirable.
A parallel Cyclic Redundancy Check (CRC) based upon DSP algorithms of pipelining,
retiming and unfolding is implemented using Verilog hardware description language,
simulated using Xilinx ISE tools.The architectures are first pipelined to reduce the iteration
bound by using novel look-ahead techniques and then unfolded and retimed to design high
speed parallel circuits.It also shows that parallel implementation uses less number of clock
cycles than the serial implementation of CRC-9 thereby increasing the speed of the
architecture.
vii
Chapter 1
INTRODUCTION TO CRC
1.1 Introduction
A Cyclic Redundancy Check (CRC) is a verification method used to ensure that data being sent
is not corrupted during transfer. The use of CRCs is common in communication mediums that
transmit digital data, such as WiFi and Ethernet. There is a need to check for communication
errors in embedded systems, as technology drives them to be capable of creating and sending
larger data packets in a faster and more complex manner. This application note discusses a
method for computing and verifying a CRC.
CyclicRedundancyCodes are a type of consistency check that treats the message data as a
(long) dividend of a modulo-2 polynomial division. Modulo-2 arithmetic doesn't use
carries/borrows when combining numbers. A specific CRC defines a set number of bits to work
on at a time, where said number is also the degree of a fixed polynomial (with modulo-2
coefficients) used as a divisor.
Since ordering doesn't apply to modulo arithmetic, the check between the current high part of
the dividend and the trial partial product (of the divisor and the trial new quotient coefficient)
1
is done by seeing if the highest-degree coefficient of the dividend is one. (The highest-degree
coefficient of the divisor must be one by definition, since it's the only non-zero choice.) The
remainder after the division is finished is used as the basis of the CRC checksum.
For a given degree x for the modulo-2 polynomial divisor, the remainder will have at
most x terms (from degree x - 1 down to the constant term). The coefficients are modulo-2,
which means that they can be represented by 0's and 1's. So a remainder can be modeled by an
(unsigned) integer of at least x bits in width.
The divisor must have its x degree term be one, which means it is always known and can be
implied instead of having to explicitly include in representations. Its lower x terms must be
specified, so a divisor can be modeled the same way as remainders. With such a modeling, the
divisor representation could be said to be truncated since the uppermost term's value is implied
and not stored.
The remainder and (truncated) divisor polynomials are stored as basic computer integers. This
is in contrast to the dividend, which is modelled from the input stream of data bits, where
2
each new incoming bit is the next lower term of the dividend polynomial. Long division can
be processed in piecemeal, reading new upper terms as needed. This maps to reading the data
a byte (or bit) at a time, generating updated remainders just-in-time, without needing to read
(and/or store(!)) the entire data message at once.
Long division involves appending new dividend terms after the previous terms have been
processed into the (interim) remainder. So the remainder it the only thing that has to change
during each division step; a new input byte (or bit) is combined with the remainder to make the
interim dividend, and then combined with the partial product (based on the divisor and top
dividend bit(s)) to become a remainder again
When all of the input data has been read during division, the last x bits are still stuck in the
interim remainder. They have not been pushed through the division steps; to do so, x zero-
valued extra bits must be passed into the system. This ensures all of the message's data bits get
processed. The post-processed remainder is the checksum. The system requires the message to
be augmented with x extra bits to get results.
Alternatively, if the post-division augmentation bits are the expected checksum instead, then
the remainder will "subtract" the checksum with itself, giving zero as the final remainder. The
remainder will end up non-zero if bit errors exist in either the data or checksum or both. This
option requires the checksum to be fed from highest-order bit first on down (i.e. big endian).
Exploiting the properties of how the division is carried out, the steps can be rearranged such
that the post-processing zero-valued bits are not needed; their effect is merged into the start of
the process. Such systems read augmented messages and expose the checksum directly from
the interim remainder afterwards.
3
1.4 Designing Polynomials
The selection of the generator polynomial is the most important part of implementing the
CRC algorithm. The polynomial must be chosen to maximize the error-detecting capabilities
The most important attribute of the polynomial is its length (largest degree(exponent) +1 of
any one term in the polynomial), because of its direct influence on the length of the computed
check value.
9 bits (CRC-8)
17 bits (CRC-16)
33 bits (CRC-32)
65 bits (CRC-64)
A CRC is called an n-bit CRC when its check value is n-bits. For a given n, multiple CRCs are
possible, each with a different polynomial. Such a polynomial has highest degree n, and
hence n + 1 terms (the polynomial has a length of n + 1). The remainder has length n. The CRC
has a name of the form CRC-n-XXX.
The design of the CRC polynomial depends on the maximum total length of the block to be
protected (data + CRC bits), the desired error protection features, and the type of resources for
implementing the CRC, as well as the desired performance. A common misconception is that
the "best" CRC polynomials are derived from either irreducible polynomials or irreducible
polynomials times the factor 1 + x, which adds to the code the ability to detect all errors
affecting an odd number of bits. In reality, all the factors described above should enter into the
selection of the polynomial and may lead to a reducible polynomial. However, choosing a
reducible polynomial will result in a certain proportion of missed errors, due to the quotient
ring having zero divisors.
4
A polynomial that admits other factorizations may be chosen then so as to balance the maximal
total block length with a desired error detection power. The BCH codes are a powerful class of
such polynomials. They subsume the two examples above. Regardless of the reducibility
properties of a generator polynomial of degree r, if it includes the "+1" term, the code will be
able to detect error patterns that are confined to a window of r contiguous bits. These patterns
are called "error bursts".
5
Chapter 2
The shift register is driven by a clock. At every clock pulse, the input data is shifted into the
register in addition to transmitting the data. When all input bits have been processed, the shift
register contains the CRC bits, which are then shifted out on the data line.
Assume now that the check bits are stored in a register referred as the CRC register, a software
implementation would be:
1) CRC <= 0
2) if the CRC left-most bit is equal to 1, shift in the next message bit, and XOR the CRC register
with the generator polynomial; otherwise, only shift in the next message bit
3) Repeat step 2 until all bits of the augmented message have been shifted in Faster
implementations can be achieved by handling the data as larger units than bits, as long as the
size does not exceed the degree of the generator polynomial.
6
However, the speed gain corresponds to a memory increase, since precomputed values (lookup
tables) will be used.
A traditional application for LFSRs is in cyclic redundancy check (CRC) calculations, which
can be used to detect errors in data communications. The final CRC value stored in the LFSR
is known as a checksum, and is dependent on every bit in the data stream. After all of the data
bits have been transmitted, the transmitter sends its checksum value to the receiver. The
receiver contains an identical CRC calculator and generates its own checksum value from the
incoming data. Once all of the data bits have arrived, the receiver compares its internally
generated checksum value with the checksum sent by the transmitter to determine whether any
corruption occurred during the course of the transmission. This form of error detection is very
efficient in terms of the small number of bits that have to be transmitted in addition to the data.
7
A 4-bit CRC calculator would not be considered to provide sufficient confidence in
the integrity of the transmitted data. This is due to the fact that a 4-bit LFSR can only represent
16 unique values, which means that there is a significant probability that multiple errors in the
data stream could result in the two checksum values being identical. However, as the number
of bits in a CRC calculator increases, the probability that multiple errors will cause identical
checksum values approaches zero. For this reason, CRC calculators typically use a minimum
of 16-bits providing 65,536 unique values.
There are a variety of standard communications protocols, each of which specifies the number
of bits employed in their CRC calculations and the taps to be used. The taps are selected such
that an error in a single data bit will cause the maximum possible disruption to the resulting
checksum value. Thus, in addition to being referred to as maximal-length, these LFSRs may
also be qualified as maximal-displacement.
In addition to checking data integrity in communications systems, CRCs find a wide variety of
other uses: for example, the detection of computer viruses. For the purposes of this discussion,
a computer virus may be defined as a self-replicating program released into a computer system
for a number of purposes. These purposes range from the simply mischievous (such as
displaying humorous or annoying messages) to the downright nefarious (such as corrupting
data or destroying – or subverting – the operating system).
One mechanism by which a computer virus may both hide and propagate itself is to attach itself
to an existing program. A cursory check of the system shows only the expected files to be
present, but whenever the infected program is executed it will first trigger the virus to replicate
itself. In order to combat this form of attack, a unique checksum can be generated for each file
on the system, where the value of each checksum is based on the binary instructions forming
the program with which it is associated. At some later date, an anti-virus program can be used
to recalculate the checksum values for each program, and to compare them to the original
values. A difference in the two values associated with a program may indicate that a virus has
attached itself to that program.
8
Chapter 3
Parallel architectures are developed using pipelining, retiming and look ahead techniques.
Further, combined parallel processing and pipelining techniques are presented to reduce the
critical path and eliminate the fan-out effect for long BCH codes.
CRC is basically an international standard which is used for error detection. It makes the data
secure by using a checksum or cyclic redundancy check. CRC was first introduced by the
CCITT (ComitéConsultatif International Telegraphique et Telephonies) which is now known
as ITU – T (International Telecommunications Union). The CRC method performs nothing but
simple binary division. What happens exactly in CRC is that a sequence of redundant bits called
the CRC remainder is appended to the end of the original data stream prior to the transmission.
The data thus obtained becomes perfectly divisible by another predetermined CRC polynomial.
When it reaches its destination, the received data bits are then divided by the same number and
ensure that the remainder is zero. A CRC calculation can be mathematically described as a
polynomial division which is performed over the input data by a generator polynomial called
G(x) [3]. G(x) is commonly called as CRC polynomial or feedback polynomial. The remainder
obtained after this division is then appended with data. This appended value is known as CRC
bit. The error detection calculation is done at the receiver side and if the CRC bit obtained by
the sender and receiver are different, then an error is detected. For obtaining an eight bit CRC
value a nine bit polynomial is used. Let’s assume that the input to the LFSR is u(n), and the
required output, i.e., the remainder, is y(n). Then the LFSR can be described using the
following equations;
9
Substituting (1) into (2), we get
where,
In the equations above, 0+0 denotes XOR operation. We can observe that equation resembles
an IIR filter with g0,g1,...,gK−1 as coefficients.
Consider a generator polynomial g(y) = 1 + x + x 8 + x9, the CRC architecture isshown in Fig.
3.2. By using the above formulation,
The following example illustrates the correctness of the proposed method. The CRC
architecture for the above equation is shown in Fig. 3.3. Let the message sequence be
101011010; Table 1 shows the data flow at the marked points of this architecture at different
time slots. In Table 1, we can see that this architecture requires 18 clock cycles to compute
10
the output. Since this is a serial computation, we need to input 0’s for the next 9 clock cycles
after the message has been processed. In addition, the feedback has to be removed after the
message bits are processed for the correct functioning of the circuit. Since this operation is in
GF(2), guaranteeing the feedback to be 0 after 9 clock cycles will have the same effect.
CRC value is calculated by using a predetermined polynomial which has been known by both
the sender and receiver.CRC value is calculated using a technique called LFSR (linear feedback
shift register) which enables the calculation of the CRC value by any polynomial. This helps
the system to work in a more efficient way because here the sender have to set the polynomial
according to the serial input data. The LFSR is an efficient way for generating CRC value.
LFSR is built by using D flip-flops and Exclusive-OR gates.
In this method the input data is inserted serially into the LFSR and after calculating the CRC
value the output is obtained in parallel i.e., it is a serial input parallel output device. First the
LFSR will represent the polynomial in the form of binary sequence. For this the coefficients of
polynomials are considered, it can be either 0 or 1. The system will perform the XOR function
on the basis of coefficients of the polynomials. If the coefficient of the polynomial is zero it
will get shifted without performing the XOR operation and if the coefficient of the polynomial
is one, before getting shifted it will perform the XOR operation with the shift value from the D
flip-flop is performed. After inserting the last input bit the value stored in the D flip-flops are
taken as the output CRC value. Here we have used the VHDL language for hardware
description
Table 3.1: Data flow of Fig. 3.3 when the input message is 101011010
12
Chapter 4
There are different techniques for parallel CRC generation given as follow.
Parallel architecture for a simple LFSR described in the previous section is discussed first.
Consider the design of a 3-parallel architecture for the LFSR in Fig. 3.2. In the parallel system,
each delay element is referred to as a block delay where the clock period of the block system
is 3 times the sample period.
13
Therefore, instead of (3), the loop update equation should update y(n) using inputs and y(n−3).
The loop update process for the 3-parallel system is shown in Fig. 3.4, where y(3k+3), y(3k+4)
and y(3k+5) are computed using y(3k), y(3k + 1) and y(3k + 2). By iterating the recursion or
by applying look-ahead technique, we get
where
The 3-parallel system is shown in Fig. 3.5. The input message 101011010 in the previous
example leads to the data flow as shown in Table 3.2. We can see that 6 clock cycles are
needed to encode a 9-bit message. The proposed architecture requires(N + K)/L cycles where
N is the length of the message, K is the length of the code and L is the parallelism level.
Similar to the serial architecture, the feedback path has to be broken after the message bits
have been processed. The critical path (CP) of the design is 6Txor, where Txor is the XOR
gate delay.
14
Clock u(3k) u(3k + 1) u(3k + 2) y(3k) y(3k + 1) y(3k + 2)
1 1 0 1 0 1 1
2 0 1 1 0 0 1
3 0 1 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 1 0
6 0 0 0 1 1 0
7 0 0 0 1 1 0
Table 4.1: Data flow of Fig. when the input message is 101011010
The pipelined architecture in Figure has eight blocks to store input. They are used to read data
from the message in each iteration. They are converted into CRC using lookup tables. LUT
contains CRC values for the input. The rightmost block does not need any lookup table. It is
because this architecture assumes CRC-16, the most popular CRC, and byte blocks. If the
length of a binary string is smaller than the degree of the CRC generator, its CRC value is the
string itself. The rightmost block it does not have any following zero and thus its CRC is the
block itself. The results are combined using XOR, and then it is combined with the output of
LUT n, the CRC of the value from the previous. In order to shorten the critical path, we
introduce another stage called the temporary register stage. This makes the algorithm more
scalable because more blocks can be added without increasing the critical path of the pipeline.
With the pre-XOR stage, the critical path is the delay of LUT and a two-input XOR gate, and
the throughput increases.
15
Since the CRC of the first block is the first block itself, it can be easily combined with the
following four blocks by appending zeros using LUT. To exploit this property, the first iteration
loads the first eight bits from the message. Small lookup tables to construct LUT kin the CRC
for blocks is calculated in the first iteration. The CRC is calculated according to the standard
of CCITT crc-16. This standard defines the CRC generator polynomial by which the generator
circuit is constructed. Conventional circuits are synchronous that use buffered synchronous
pipelines. In these pipelines, "pipeline registers" are inserted in between pipeline stages, and
are clocked synchronously. The time between each clock signal is set to be greater than the
longest delay between pipeline stages, so that when the registers are clocked, the data that is
written to them is the final result of the previous stage.
Technique uses a buffer to store the input data in bytes, which allows the computation
to run at one cycle per byte (instead of one cycle per bit).
At the same time generated crc automatically processed with the data and stores into
buffer.
Calculates the CRC of a message in parallel to achieve better throughput.
16
Chapter 5
IMPLEMENTATION
1. Software Implementation
2. Hardware Implementation
For a long time, programming languages such as FORTRAN, Pascal, and C were
being used to describe computer programs that were sequential in nature. Similarly, in the
digital design field, designers felt the need for a standard language to describe digital circuits.
Thus, Hardware Description Languages (HDLs) came into existence. HDLs allowed the
designers to model the concurrency of processes found in hardware elements. Hardware
description languages such as Verilog HDL and VHDL became popular. Verilog HDL
originated in 1983 at Gateway Design Automation. Later, VHDL was developed under contract
from DARPA. Both Verilog® and VHDL simulators to simulate large digital circuits quickly
gained acceptance from designers. Even though HDLs were popular for logic verification,
designers had to manually translate the HDL-based design into a schematic circuit with
interconnections between gates. The advent of logic synthesis in the late 1980s changed the
design methodology radically. Digital circuits could be described at a register transfer level
(RTL) by use of an HDL. Thus, the designer had to specify how the data flows between
registers and how the design processes the data. The details of gates and their interconnections
to implement the circuit were automatically extracted by logic synthesis tools from the RTL
description. Thus, logic synthesis pushed the HDLs into the forefront of digital design.
Designers no longer had to manually place gates to build digital circuits. They could describe
complex circuits at an abstract level in terms of functionality and data flow by designing those
circuits in HDLs. Logic synthesis tools would implement the specified functionality in terms
of gates and gate interconnections.
17
HDLs also began to be used for system-level design. HDLs were used for simulation of system
boards, interconnect buses, FPGAs (Field Programmable Gate Arrays), and PALs
(Programmable Array Logic). A common approach is to design each IC chip, using an HDL,
and then verify system functionality via simulation.
18
Verilog HDL allows different levels of abstraction to be mixed in the same model.
Thus, a designer can define a hardware model in terms of switches, gates, RTL, or
behavioural code. Also, a designer needs to learn only one language for stimulus and
hierarchical design.
Most popular logic synthesis tools support Verilog HDL. This makes it the language of
choice for designers.
All fabrication vendors provide Verilog HDL libraries for post logic synthesis
simulation. Thus, designing a chip in Verilog HDL allows the widest choice of vendors.
The Programming Language Interface (PLI) is a powerful feature that allows the user
to write custom C code to interact with the internal data structures of Verilog. Designers
can customize a Verilog HDL simulator to their needs with the PLI.
There are two basic types of digital design methodologies: a top-down design
methodology and a bottom-up design methodology. In a top-down design methodology, we
define the top-level block and identify the sub-blocks necessary to build the top-level block.
We further subdivide the sub-blocks until we come to leaf cells, which are the cells that cannot
further be divided.
In a bottom-up design methodology, we first identify the building blocks that are
available to us. We build bigger cells, using these building blocks. These cells are then used
for higher-level blocks until we build the top-level block in the design.
That's really all there is to it. XILINX ISE provides the HDL and schematic editors, logic
synthesizer, fitter, and bit stream generator software.
19
The XSTOOLs from XESS provide utilities for downloading the bit stream into the FPGA on
the XSA Board.
Xilinx ISE. Implementing a logic design with an FPGA usually consists of the following steps
1. You enter a description of your logic circuit using a hardware description language (HDL)
such as VHDL or Verilog .You can also draw your design using a schematic editor.
2. You use a logic synthesizer program to transform the HDL or schematic into a net list. The
net list is just a description of the various logic gates in your design and how they are
interconnected.
3. You use the implementation tools to map the logic gates and interconnections into the FPGA.
The FPGA consists of many configurable logic blocks, which can be further decomposed
into look-up tables that perform logic operations. The CLBs and LUTs are interwoven with
various routing resources. The mapping tool collects your net list gates into groups that fit
into the LUTs and then the place & route tool assigns the groups to specific CLBs while
opening or closing the switches in the routing matrices to connect them together.
4. Once the implementation phase is complete, a program extracts the state of the switches in
the routing matrices and generates a bit stream where the ones and zeroes correspond to
open or closed switches. (This is a bit of a simplification, but it will serve for the purposes
of this tutorial.)
5. The bit stream is downloaded into a physical FPGA chip (usually embedded in some larger
system).The electronic switches in the FPGA open or close in response to the binary bits in the
bit stream. Upon completion of the downloading, the FPGA will perform the operations
specified by your HDL code or schematic
20
To create a new project:
1. Select File >New Project... The New Project Wizard appears.
2. Type tutorial in the Project Name field.
3. Enter or browse to a location (directory path) for the new project. A tutorial Sub
directory is created automatically.
4. Verify that HDL is selected from the Top-Level Source Type list.
5. Click Next to move to the device properties page.
6. Fill in the properties in the table as shown below:
♦ Product Category: All
♦ Family: Spartan3
♦ Device: XC3S200
♦ Package: FT256
♦ Speed Grade: -4
♦ Top-Level Source Type: HDL
♦ Synthesis Tool: XST (VHDL/Verilog)
♦ Simulator: ISE Simulator (VHDL/Verilog)
♦ Preferred Language: Verilog (or VHDL)
♦ Verify that Enable Enhanced Design Summary is selected.
Leave the default values in the remaining fields.
When the table is complete, your project properties will look like the following:
Figure.5.1.project properties
21
7. Click Next to proceed to the Create New Source window in the New Project Wizard. At the
end of the next section, your new project will be complete.
7. Click Next, then Finish in the New Source Wizard - Summary dialog box to complete
the new source file template.
22
8. Click Next, then Next, then Finish.
When the source files are complete, check the syntax of the design to find errors and
types.
Note: You must correct any errors found in your source files. You can check for errors in
the
Console tab of the Transcript window. If you continue without valid syntax, you will not
be able to simulate or synthesize your design.
5. Close the HDL file.
4) Design Simulation
Create a test bench waveform containing input stimulus you can use to verify the
functionality of the counter module. The test bench waveform is a graphical view of a test
bench.
Create the test bench waveform as follows:
1. Select the example HDL file in the Sources window.
2. Create a new test bench source by selecting Project → New Source.
3. In the New Source Wizard, select Test Bench Waveform as the source type, and type
Counter_tbwin the File Name field.
23
4. Click Next.
5. The Associated Source page shows that you are associating the test bench waveform
with the source file counter. Click Next.
6. The Summary page shows that the source will be added to the project, and it displays
the source directory, type and name. Click Finish.
7. You need to set the clock frequency, setup time and output delay times in the Initialize
8. Click Finish to complete the timing initialization.
9. The blue shaded areas that precede the rising edge of the CLOCK correspond to the
Input Setup Time in the Initialize Timing dialog box.
10. Save the waveform.
24
Chapter 6
SIMULATION RESULTS
25
6.1.2 Technology schematic
26
6.1.1 Simulation waveforms
27
6.2 One stage parallel CRC
28
6.2.2 Technology Schematic
29
6.2.3 Simulation results
30
6.3 Three stage parallel CRC
31
6.3.2 Technology schematic
32
6.3.3 Simulation Results
33
Chapter 7
CONCLUSION AND DISCUSSION
A three stage parallel CRC is designed and implanted using Xilinx ISE design tools. Initially
serial CRC is implemented using linear feedback shift registers. Here the input is given bit by
bit serially. Similarly the output is also generated serial. But this design requires 18 clock
cycles. This introduces latency in the circuit. This latency should be reduced. Hence we go for
parallelism.
Parallelism can be done by novel look ahead technique. Then a one stage parallel CRC is
designed. Even though it is parallel it is one stage parallelism. So the input and output are still
single bit. This doesn’t alter any latency so we go for 3-stage parallel CRC architecture.
The 3 stage parallel CRC can be implemented by using pipelining technique. In this design 3
bits are fed as input at the same time so the latency will be decreased to 6 clock cycles. This
latency can still be decreased. But an attempt to decrease this latency leads to an increase in
critical path and also the cost of production. This design can reduce the critical path without
increasing the hardware cost at the same time. The design is applicable to any type of LFSR
architecture.
34
REFERENCES
[1] T. V. Ramabadran and S.S. Gaitonde, ”A Tutorial on CRC Computations,” IEEE Micro., Aug.
1988.
[2] R. E. Blahut, Theory and Practice of Error Control Codes. Reading, MA: Addison- Wesley, 1984
[3] W. W. Peterson and D. T. Brown, ”Cyclic codes for errot detection”, Proc. IRE,vol.49, pp. 228-
235, Jan.1961
[4] N. Oh, R. Kapur, T. W. Williams, ”Fast seed computation for reseeding shift register in test
pattern compression,” IEEE ICCAD, 2002, pp. 76-81.
[5] M. Y. Hsiao and K. Y. Sih, ”Serial-to-parallel transformation of linear feedback shift register
circuits”, IEEE Trans. Electronic Computers, vol. EC-13, pp. 738- 740, Dec. 1964
[6] A. M. Patel, ”A multi-channel CRC register”, in AFIPS Conference Proceedings, vol. 38, pp. 11-
14, Spring 1971
[7] Tong-Bi Pei, Charles Zukowski, ”High-Speed Parallel CRC circuits in VLSI”, IEEE Trans. on
Communications, vol. 40, no. 4, April 1992 pp. 653-657.
[8] K. K. Parhi, ”Eliminating the fan out bottleneck in parallel long BCH encoders”, IEEE
Transactions on Circuits and Systems I, Reg. Papers, vol. 51, no. 3, pp. 512-516, Mar. 2004.
[9] C. Cheng, K. K. Parhi ”High Speed Parallel CRC Implementation based on Unfolding, Pipelining,
Retiming,” IEEE Transaction on Circuits and Systems II, Express Briefs, vol. 53, no. 10, pp. 1017-1021,
Oct. 2006.
[10] C. Cheng, K. K. Parhi, ”High Speed VLSI Architecture for General Linear Feed-back Shift
Register (LFSR) Structures,” Proc. of 43rd Asilomar Conf. on Signals, Systems, and Computers, Nov.
2009, Monterey, CA, pp. 713-717.
[11] X. Zhang and K. K. Parhi, ”High-speed architectures for parallel long BCH en-coders,” in Proc.
ACM Great Lakes Symp. VLSI, Boston, MA, April 2004, pp.1-6
[12] R. J. Glaise, ”A two-step computation of cyclic redundancy code CRC-32 for ATMnetworks”, IBM
J. Res. Devel., vol. 41, pp. 705-709, Nov. 1997
35
[13] J. H. Derby, ”High Speed CRC computation using state-space transformation,” inProc. Global
Telecomm. Conf. 2001, GLOBECOM’01, vol. 1, pp. 166-170
[15] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.Hoboken, NJ:
Wiley, 1999.
[16] G. Campobello, G. Patane, and M. Russo, ”Parallel CRC Realization,” IEEE Trans. Computers,
vol. 52, no. 10, pp. 1312 - 1319, Oct 2003
[17] R. Lidl and H. Niederreiter, Introduction to finite fields and their applications.Cambridge
University Press, 1986.
[18] K. Hoffman and R. Kunze, Linear Algebra. Englewood Cliffs, NJ: Prentice Hall,1971
[19] G. Albertengo and R. Sisto, ”Parallel CRC generation”, IEEE Micro, vol. 10, pp.63-71, Oct. 1990
[20] S. L. Ng and B. Dewar, ”Parallel realization of the ATM cell header CRC”, Com-puterCommun.,
vol. 19, pp. 257-263, March 1996
[21] Manohar Ayinala, K. K. Parhi, ”Efficient Parallel VLSI Architecture for LinearFeedback Shift
Registers”, IEEE Workshop on SiPS, pp. 52-57, Oct. 2010.
[22] Manohar Ayinala, K. K. Parhi, ”High Speed Parallel Architectures for Linear Feed-back Shift
Registers”, IEEE Trans. on Signal Processing, (under review)
[23] K. K. Parhi, D. G. Messerschmitt, ”Pipeline interleaving and parallelism in recur-sive digital filters
-part II: pipelined incremental block filtering,” IEEE Trans. on Acoustics, Speech and Signal Processing,
vol. 37, pp. 1118-1135, July 1989.
[25] J. W Cooley and J. Tukey, ”An algorithm for machine calculation of complex fourierseries,” Math.
Comput., vol. 19, pp. 297-301, Apr. 1965
36
[26] S. He and M. Torkelson, ”A new approach to pipeline FFT processor,” Proc. ofIPPS, 1996, pp.
766 - 770.
[27] S. He and M. Torkelson, ”Design and Implementation of 1024-point pipeline FFT processor,” in
Proc. Custom Integer. Circuits Conf., Santa Clara, CA, May 1998, pp. 131-134.
[29] P. Duhamel, ”Implementation of split-radix FFT algorithms for complex, real, andreal-symmetric
data,” IEEE Trans. on Acoust., Speech Signal Process., vol. 34, no. 2, pp. 285-295, Apr. 1986
[30] L. R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing. Pren-tice Hall Inc.,
1975.
[31] E. H. Wold and A. M. Despain, ”Pipeline and parallel-pipeline FFT processors for VLSI
implementation,” IEEE Trans. Computers, C-33(5): 414-426, May 1984.
[32] A. M. Despain, ”Fourier transform using CORDIC iterations,” IEEE Trans. Com-put.,C-233(10):
993-1001, Oct. 1974.
[33] E. E. Swartz lander, W. K. W. Young, S.J. Joseph, ”A radix-4 delay commutator for fast Fourier
transform processor implementation,” IEEE Journal of Solid-state Cir., SC-19(5): 702-709, Oct. 1984.
[34] E. E. Swartz lander, V.K. Jain, H. Hikawa,”A radix-8 wafer scale FFT processor,” Journal. VLSI
Signal Process., 4(2,3): 165-176, May 1992.
[35] G. Bi, E.V. Jones, ”A pipelined FFT processor for word-sequential data,” IEEE Trans. Acoust.,
Speech, Signal Process.,37(12):1982-1985, Dec. 1989.
[36] Y. W. Lin, H. Y. Liu, C. Y. Lee, ”A 1-GS/s FFT/IFFT processor for UWB applications,” IEEE
Journal of Solid-state Circuits, vol. 40, no.8 pp. 1726-1735, Aug.2005.
[37] J. Lee, H. Lee, S. I. Cho, S. S. Choi, ”A High-Speed two parallel radix-24 FFT/IFFT processor
for MB-OFDM UWB systems,” IEICE Trans. on Fundamentals of Electronics, Communications and
Computer Sciences, pp. 1206-1211, April 2008.
[38] J. Palmer, B. Nelson, ”A parallel FFT architecture for FPGAs”, Lecture Notes in Computer
Science, vol. 3203, pp. 948-953, 2004.
37
[39] M. Shin, H. Lee, ”A high-speed four parallel radix-24 FFT/IFFT processor for UWB
applications”, IEEE ISCAS 2008, pp. 960 - 963, May 2008.
[40] K. K. Parhi, C. Y. Wang, A. P. Brown, ”Synthesis of control circuits in folded pipelined DSP
architectures,” IEEE Journal Solid State Circuits, vol. 27, no. 1, pp. 29-43, 1992.
[41] K. K. Parhi, ”Systematic synthesis of DSP data format converters using lifetime analysis and
forward-backward register allocation,” IEEE Trans. on Circuits and Systems - II, vol. 39, no. 7, pp.
423-440, July 1992.
[42] K. K. Parhi, ”Calculation of minimum number of registers in arbitrary life time chart,” IEEE
Trans. on Circuits and Systems - II, vol. 41, no. 6, pp. 434-436, June 1995.
38