Zhang Mastersthesis 2018
Zhang Mastersthesis 2018
by
THESIS
Of the Requirements
For the Degree
MASTER OF SCIENCE
in Computer Engineering
DECEMBER, 2018
A SCALABLE IMAGE/VIDEO PROCESSING PLATFORM WITH
APPROXIMATE DESIGN
by
APPROVED BY
possible.
Acknowledgments
First and foremost, I am grateful to my major advisor, Dr. Xiaokun Yang, for
being friendly, caring, supportive, and help in numerous ways. Without his support,
I could not have done what I was able to do. He was very generous in sharing his
experiences on electrical and computer engineering, academic life and beyond. He is
not only my adviser, but also, a friend inspiring me for the rest of my life.
Next, I would like to thank the members of my committee, Dr. Jiang Lu and Dr.
Lei Wu for their support and suggestions in improving the quality of this dissertation.
It is truly honored to have such great fantastic and knowledgeable professors serving
as my committee members.
I would also like to thank all the lab mates and members at the Advance Digital
System Design (ADSD) Laboratory for creating an amazing working environment,
and thank my friends, Archit Gajjar and Cui Xue, for their assistance on work related
to my research.
Furthermore, I would also like to acknowledge the research support provided from
Finally, I want to thank my family for their unconditional love, faith, and encour-
agement.
iv
ABSTRACT
A SCALABLE IMAGE/VIDEO PROCESSING PLATFORM WITH
Language (HDL) and the verification environment including six Open Verification
Components (OVCs) are provided. Compared to prior works, our proposed work
achieves the least FPGA resource cost (753 Look Up Tables (LUTs) and 277 Reg-
mations of multipliers and two approximations of adders, along with the exact designs,
v
are presented and integrated as twelve benchmarks to implement RGB to grayscale
conversion as a case study. Experimental results show that the minimum slice-energy
cost, integrated with approximate#2 adder and approximate#3 multiplier, achieves
25.17% slice-energy saving compared with the exact design by sacrificing the quality
vi
TABLE OF CONTENTS
Chapter Page
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Scalable Image/Video Processing Platform . . . . . . . . . . . . . . . . . 1
1.2 Approximate Design on Combinational Circuits . . . . . . . . . . . . . . 3
1.3 Advance Approximate application on Sequential Circuit . . . . . . . . . 5
1.4 Structure Of The Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 7
3. Approximate Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Hierarchical Synthesis of Approximate Design . . . . . . . . . . . . . . . 27
3.2 Proposed Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Advance Approximate application . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Slice-Energy Saving on FPGA Platform with Approximate Computing . 38
3.8 Static Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.9 FPGA Implementation and Simulation . . . . . . . . . . . . . . . . . . . 47
3.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii
4. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . 52
4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
viii
LIST OF TABLES
Table Page
ix
LIST OF FIGURES
Figure Page
x
3.8 Design Structures of RGB2Grayscale Coverter . . . . . . . . . . . . . . 40
xi
CHAPTER 1
INTRODUCTION
To date, computer vision applications are growing rapidly, bringing many chal-
lenges of computation speed and power consumption on traditional software based
frameworks such as object and facial recognition [25], [12]. As a result of the ad-
In prior works, such designs on FPGA were mostly intended on high-level synthesis
(HLS) design and sometimes involved with software and GPU. For example, Ref. [16]
proposed a real-time image acquisition design by using LabVIEW with GPU-based
acceleration which is able to sustain the rate of data acquisition. Similarly, Ref. [28]
presented an implementation of a camera with LabVIEW frame grabber, a mask
generating by MatLab, and an image processing design on LabVIEW FPGA. This
implementation is able to prove a system easily by directly using the block based
design, but it is hard to customize or improve the system because the block libraries
are usually not open source to users.
Under this context, many researchers have proposed their works of register-transfer
level (RTL) designs on image/video processing. For example, Mike Field’s OV7670-
FPGA-VGA project [8] has been widely reused and expanded into several prototypes
on new research ideas related to the image/video processing systems. This open
source code is written by VHDL and performed on Zedboard FPGA. Moreover, a
reconfigurable platform preforming edge detection by interfacing an OV7610 camera
as an image input and VGA as the result output, has been presented in Ref. [2]. Since
1
the FPGA resource cost and power consumption have not been provided in these two
board. It consists of 1,616 logic elements and 818 registers. The main concern of this
project is the lack of a verification environment, making the project being difficult
to be reused and expanded. To avoid the problems aforementioned, in this thesis we
on the Nexys-4 board, it is able to display the original and processed images on a
VGA-interfaced monitor. Our contributions from this thesis include:
Models (OVCs) such as Bus Function Models (BFMs) and scoreboards [39],
in order to make the design reusable and expandable. The implementation is
writen by Verilog Hardware Description Language (HDL) and preformed on
data between DUT and golden models, aiming to increase the re-usability and
reliability of the open source designs.
2
• We presented a 640 × 480 resolution VGA display with showing up to four 320
× 240 resolution images at the same time. In such a way the original images,
the in-process images, and the final results of the images are able to be shown
in the same window.
resource cost - 753 slices Look-up-tables (LUTs) and 277 slices as register. And
the power consumption is around 220 mW for displaying and processing a single
frame of color images.
data mining [15], [31], [11], [43], and [30]. For these kinds of applications, approx-
imate computing is served as an important part to reduce the design area, power
consumption, and computation delay in digital systems. This is a tradeoff between
research groups [29], [5], [22], and [27] proposed some effective results on approximate
computing. However, most of them are limited to the software applications. Due to
the advantages of the pure hardware implementations, such as the reconfigurability
and hardware parallelism, we believe that FPGAs will be adopted in the future or the
next-generation of high-performance SoC. For example, adders have been commonly
3
considered for the approximate implementation as one of the important components
in the circuit design [38]. Some of the approximate adder has been discussed in [19]
and [21]. These works focused on the subcomponent designs, however, the impact
of the approximations on structural implementations has not been considered. In
Ref. [18] used 2 × 2 approximate multiplier blocks to compute the final results. And
Ref. [14] had introduced approximate speculative adders used in a multiplier. In this
thesis, thus, we propose a set of approximate FPGA design components, in order to
find a better balance between accuracy and power cost, by providing a wide rang of
solutions for different energy-quality tradeoffs corresponding to different applications.
One of the big challenges of this integrated system is the hardware programma-
bility on FPGA. Basically, FPGA development needs to balance resource cost with
algorithm accuracy and extensive hand-coding in RTL. To fill the gap between soft-
ware programming and hardware design and provide more RTL choices for different
project requirements, we presented a basic FPGA design library including adders and
multipliers with five different accuracy levels of components: exact design (EX), ap-
proximate design #1 (AP1), approximate design #2 (AP2), approximate design #3
(AP3) and approximate design #4 (AP4). More specifically, the main contributions
of this thesis are:
test cases.
4
• We evaluated the tradeoffs between design accuracy and resource cost, by which
In order to reduce the computational cost and improve the energy efficiency, ap-
proximate design on FPGA platforms has been widely used in many application
domains, such as artificial intelligent [13], edge computing [10, 32, 36], and Internet-
of-Things (IoT) security [41], [37].
tiple approximations of computational components affects the energy cost and slice
utilization, making the tradeoff between accuracy and slice-energy reduction difficult.
To find the tradeoff between quality of the results and energy cost on FPGA is
one of the important points driving the research of approximate computing. Previous
works in such field have mainly focused on combinational circuit design [17], and some
of the researchers concentrated on optimizing the flow chart to reduce the energy con-
5
sumption [42]. In this thesis we provided a design of a sequential circuit with twelve
overcome this issue, Ref. [44] proposed four approximations of addition models and
evaluated the quality of results on a histogram equalization algorithm. The idea was
simulated on Matlab so the hardware performance was not estimated. Therefore, our
work focuses on proposing several approximate multipliers and employing the approx-
imate adders as well, and more important, the slice-energy savings are estimated and
demonstrated on an FPGA platform.
Under this background, this thesis paper proposes several approximations of adders
and multipliers, and further applies all the components on a sequential circuit design
of color to grayscale converter. More specifically, the main contributions are:
tions of RTL design, and synthesized the DUT with a Nexys-4-artix-7 . The
performance in terms of slice count and dynamic energy consumption were es-
timated using the performance evaluation methodology [35].
6
• We evaluated the cost saving on weighted slice count and energy dissipation
using a slice-energy metric in our work. Experimental results show that the
minimum slice-energy reduction can reach 25.17% compared with the exact
design.
The rest of this dissertation is organized as follows. The chapter 2 presents a scal-
able image/video platform with approximate computing design on Field-Programmable
Gate Array (FPGA). The chapter 3 presents a novel hierarchical synthesis for approx-
imating FPGA components library. The section 3.6 discusses the advance approxi-
mate application that providing several slice-energy cost solutions corresponding to
different application constrains. Finally, in Chapter 4, we conclude this dissertation
7
CHAPTER 2
in real time. This platform includes an OV7670 camera, a Nexys-4 FPGA board,
and a monitor with VGA port. The FPGA platform is designed with Verilog HDL,
including three sub modules: I2C Controller, Image Capture and VGA Master. The
I2C Controller is the control module of the OV7670 camera, by using the I2C protocol
to set functional registers. After being configured the camera enables to capture and
send images pixel by pixel through the ‘VSYNC-HREF-DATA’ interface.
The Image Capture module receives and stores the image data into four memory
blocks, named the ‘Frame Buffer’; each block can store one 320 × 240 resolution
image. Then, the VGA output module reads the data from buffers and sends them
to a VGA-interfaced monitor. In this platform, the full screen monitor (640 × 480
resolution) is splited into four regions (320 × 240 resolution) as shown in Fig. 2.1.
The ‘Region0’ is used to display the data from ‘FrameBuffer0’. ‘Region1’, ‘Region2’,
and ‘Region3’ are applied to display the data processed by three different algorithms:
8
Figure 2.1: 640 × 480 Window with Four 320 × 240 Regions
‘Alg1’, ‘Alg2’, and ‘Alg3’. Notice that multiple clock cycles or memory blocks might
be needed based on the complexity of the algorithms.
The open source packet shown in Fig. 2.3 includes a synthesizable design with
9
Figure 2.3: FPGA Design, Verification, and Synthesis
implementation, including a constrain file and a FPGA netlist. Notice that the syn-
thesis files are also able to be used to generate performance results in terms of slice
count and power cost. The most important contributions of this section is the verifi-
cation environment shawn in the blue box. The verification environment contains a
tcl script, a file list, and a testbench. The tcl script is created to configure the design
and verification models working in different modes.
The testbench provides three BFMs-Capture Master, I2C Slave, VGA Slave, and
three Scoreboards (SB)-Capture SB, I2C SB, VGA SB, as OVCs. For instance, the
DUT receives images from the capture master BFM. The input image data stored
in ‘rgb565 input.txt’ file is in the RGB565 format-5-bit red pixel, 6-bit green pixel,
and 5-bit blue pixel. The BFM is able to tranfer 8 bits of the data in each clock
cycle. Therefore, a 16-bit pixel requires two clock cycles to finish the transmission.
Similarly, the I2C slave BFM is designed to receive and response to commands from
10
the I2C master. The VGA slave is created to collect the images data from the VGA
interface.
Since there are three BFMs, then there are three SBs paired with. For example, the
VGA scoreboard compares the red, green, and blue pixels driven by VGA master with
the golden data in ‘golden r.txt’, ‘golden g.txt’, and ‘golden b.txt’ files. Likewise, the
I2C scoreboard compares each command received by the I2C slave BFM with the
original register configuration. The capture scoreboard verifies the data stored into
memory blocks with the golden data from ‘golden rgb444.txt’ file. All the DUT are
discussed in the section. 2.4.
All the design submodules and Intellectual Properties (IPs) are introduced in this
section, including the I2C Master, the Image Capture slave, the VGA Master, the
clock PLL, and Frame Buffers.
The design of the OV7670 Controller is shown in Fig. 2.5. The most important
module of this controller is the OV7670 register and the I2C sender. All register
setting values for OV7670 camera are generated in the module OV7670 register. These
register values are sent by the module I2C sender. I2C sender generates the clock line
(SIOC) and the data line (SIOD). These data lines follow the I2C-interface. When the
SIOD signal is being pulled low and the SIOC signal continues being high the camera
initiates data transfer. After the SIOC signal pulled low, the SIOD signal begins to
send the first data bit. The camera receives the first data bit when the SIOC is pulled
high again. This process repeats until the camera receives a stop signal. The stop
11
Figure 2.4: Signal ‘data sr’
signal is received when the SIOC signal is pulled high, and is followed by the SIOD
signal being pulled high. In the OV7670 datasheet [26], the I2C slave has an 8-bit ID
to specify the write command as ‘ox42’ and the read command as ‘ox43’.
In the I2C sender, multi-bit signal ‘data sr’ is sequentially driven on SIOD. This
32-bit signal consists of 3-bit ‘100’, 8-bit hex ‘42’, 1-bit ‘0’, 8-bit register address,
1-bit ‘0’, 8-bits register value, 1-bit ‘0’ and 2-bit ‘01’ as shown in Fig. 2.4. Another
signal called ‘busy sr’ shows the situation of data writing. For example, when data is
just stored into ‘data sr’, ‘busy sr’ should be 32 bit hex ‘FFFFFFFF’. Then after one
bit data is sent, ‘busy sr’, the least significant bit of ‘data sr’ signal becomes zero.
This process repeats until all bits of ‘busy sr’ become zero. After ‘busy sr’ becomes
zero, the ‘token’ signal becomes one and is sent to OV7670 register module to ask for
next register value. To make the process correct, SIOC must follow the I2C protocol.
This protocol was described in detail in the previous paragraph. The design detail of
By trying the register values in the datasheet, the OV7670 camera is not working
functionally as expected. Many of the register settings shown in the datasheet are
without detail description. Thankfully, Mike Field [8] with help from Chirs Wilson,
designed the necessary register values. Based on his design, some of the register values
are changed to fit this platform. For example, register COM7 at address 12 write-in
binary value ‘00010100’ instead of ‘00000100’. The only change here is the fifth bit
12
Figure 2.5: OV7670 Controler
which controls the output frame size as QVGA (320×240 resolution). The register
HSTART, HSTOP, VSTART, VSTOP, HREF and VREF are also change to make
the right timing as the QVGA output.
The design of OV7670 Capture is shown in Fig. 2.8. Each set of data needs 4
clock cycles shown in Table. 2.1. The address signal is named ‘addr’ and the next
address signal is named ‘addr next’. The ‘wr hold’ signal is a 2-bit signal which holds
the horizontal ref value, named as ‘href’, from the previous clock cycle. The ‘d latch’
is a hold signal of the input data, and only on the third clock cycle does it have all
the RGB 565 format data. And at the fourth clock cycle, ‘dout’ signal will consist of
‘d latch’ [15:12], [10:7], [4:1]. And then the write enable (we) signal pulls high. The
signal ‘dout’ is not equal to ‘d latch’ because VGA output can only display the most
significant 4 bits of the RGB. The vertical sync signal named as ‘vsync’ is initialized
at low. When ‘vsync’ signal is pulled high, all the signals reset and the capture
process begins. The value of horizontal reference signal (Href) is held ‘wr hold’. The
13
Figure 2.6: OV7670 I2C Sender
14
Figure 2.8: OV7670 Capture
‘wr hold[0]’ is the current value of the ‘Href’ and the ‘wr hold[1]’ is the previous of
the ‘Href’.
In this platform, two types of frame buffers are designed for simulation and FPGA
implementation. The reason is that Xilinx Vivado uses logic cells on the FPGA board
instead of the RAM if the buffer is not designed by block memory generator. The
number of logic cells in the FPGA is not enough to implement all the buffers. However,
if the frame buffer is generated by the block memory generator, ModelSim can not
15
Figure 2.9: OV7670 Capture Timing
use the files to run the simulation. This conflict can be solved by separating the buffer
design into a simulation design and an implementation IP.
The following Fig. 2.10 and Table. 2.2, display the design of the VGA port.
There are 5 signals output to monitor: red data, green data, blue data, horizon-
tal sync(Hsync), and vertical sync(Vsync). The display is enabled when receiving a
logic high from Hsync and Vsync. The VGA refresh rate can be in between 50 Hz
to 120 Hz based on different input sync signal. The sync signal timing required for a
640 × 480 resolution at a 60Hz refresh rate is shown in Table. 2.2. The signal Vsync
is counted by the number of the lines, and Hsync is counted by the number of pixels
in each of the line. More specifically, there are several processes that make up one
sync process of Hsync. These are the display time (DIS), pulse width (PW), front
porch (FP) and back porch (BP). Each process has different multiple parameters it
needs to follow.
16
Figure 2.10: Nexys-4 VGA timing
Fig. 2.10 displays the timing detail of Hsync and Vsync. These two signals follow
the VGA protocol, which states that only during the DIS process monitor can receive
pixel data from red, green and blue channel. Following those details, the design of
the VGA Master needs two counters. One counter counts the number of clock cycles
for the horizontal lines and the other counter counts the number of the vertical lines.
Fig. 2.1 displays the final result of this platform. There are four different regions
of the output showing the frame data from different frame buffers. The 2-bit signal
called ‘Region’ is designed to control the delivery of frame data. In the VGA master
module, the counter of horizontal lines and vertical lines are being used to separate
the frame data of the designed region. When the frame data is ready to go, the Vsync
signal needs to wait for the delay back porch (BP) then pulls high. To finish the first
frame, Vsync need to wait for front porch (FP) then pulls low. Similar to the Hsync,
17
VYNC HSYNC
Symbol
Time (us) Clock Lines Time (us) Clock
SP 16,700 416,800 521 32 800
DIS 15,360 384,000 480 25.6 640
PW 64 1,600 2 3.84 96
FP 320 8,000 10 0.64 16
BP 928 23,200 29 1.92 48
The 640 × 480 display window can simultaneously display four 320 × 240 images.
To place each image into the correct position, two counters are designed to recognize
the timing. The counter ‘hcnt’ for the horizontal sync timing and the counter ‘vcnt’
for the vertical sync timing. Fig. 2.11 shows that when ‘vcnt’ is in between 0 to
239 Region0 and Region1 will be selected. And when ‘vcnt’ is in between 240 to
479 Region2 and the Region3 will be selected. Likewise, while ‘hcnt’ is in between 0
to 319 the Region0 and the Region2 will be selected. Region1 and Region3 will be
selected when ‘hcnt’ is in between 320 to 639 counts.
In the top module of this design, five IPs are generated for FPGA implementation
on Xilinx Vivado. These five IPs are one PLL and four Frame Buffers. Fig. 2.12(a)
18
Clock Input Form Freq-(MHz)
Clk 100MHz Clock Generator FPGA 100
Clk 50MHz I2C Master PLL output 50
Clk 25MHz VGA Master PLL output 25
PClk image caputre slave OV7670 Camera 20
SIOC Camera Register I2C Master 0.2
shows the design configuration of both the PLL and the Frame Buffer. The PLL is
used to divide the original 100MHz clock rate into 50MHz and 25MHz clock rate. The
50MHz clock rate is applied to the I2C master and the 25MHz clock rate is applied
to the VGA master. Table 2.3 summarizes the result of this process.
Fig. 2.12(b) displays the process diagram of the frame buffer. The calculation of
the size of the frame buffer is based on 320 × 240 × 12bits = 76,800 × 12bits. The
write clock rate is the pixel clock from the OV7670 called ‘PClk’. This clock rate is
usually around 20MHz. The read clock rate has to be the same as the VGA Master,
The verification result and implementation result are discussed in this section.
19
Figure 2.13: Verification Overview
functional. Fig. 2.13 shows the three scoreboards that are designed to check each
input and output following the design protocol. The subsections below shows the
verification detail of each module.
To verify the I2C controller in ModelSim, a run script is designed to include all the
files and display the important signals in the waveform. The I2C scoreboard sends 25
Mhz clock and a ‘0’ rst signal to the I2C Master module, and concurrently monitors
the outputs signal SIOD and SIOC. After receives the clock and ‘0’ rst signals, the
I2C master begins to generate the signal SIOD and SIOC. I2C slave is designed as a
memory blcok meant to receive and store the register data from the I2C master by
20
Figure 2.14: Controller Verification Result
the I2C protocol. On each received register data, the I2C scoreboard will output the
count number of this data, received time, the exact data and the data displayed to
the ModelSim transcript. In this simulation case, 54 register data sets are received
correctly. First data is received at 774650 ns and the last data at 45016850 ns or
0.045 seconds.
The register setting of OV7670 camera mentioned in section 2.5, the capture
module needs to transfer RGB 565 data to RGB 444. Shown in Fig. 2.13, VSYNC-
HREF Master is required to send the RGB 565 data, VSYNC and HREF. All the
signals need to follow the same timing displayed in OV7670 camera datasheet. This
is shown in Fig. 2.9. A file named “RGB565 Golden.txt” that is produced by Matlab
from a regular RGB image used as the golden model of input. The golden model of
the ouput is produced from the same image but in the RGB 444 format, named as
“RGB444 Golden.txt”. The size of this image is 320 × 240 resolution which same
as the register settings. The OV7670 capture scoreboard compares the output data
to the golden model when write enable is high. The result are in the ModelSim
21
Figure 2.15: Capture Verification Result
transcript which are displayed in Fig.2.15. To receive one frame requires 15999950 ns
or 0.016 seconds.
The VGA Master module is needed to verify the output signal of vsync and hsync.
This two signals follow the VGA protocol which was discussed in section 2.7. The
VGA Slave receive the data from the VGA Master, place the data by vsync and hsync
signal and then save the data into txt file. The VGA Scoreboard collects the input
and output signal from the VGA Master. The sync signal timing should follow the
Table. 2.2, if it does not there can be a dislocation in the .txt file which compared to
the golden model. The ModelSim transcript shows that the running time is 15999640
ns or 0.0159 seconds per frame. For the 60Hz VGA refresh rate, the time for each
frame should be 0.01667 seconds which is very close to the result from the previous
simulation.
After all the module included in this design pass verification, a system verification
is needed to ensure the design works in system level. In the verification of System On
Chip (SOC), all the modules are combined to test as a system. So testbench module
22
Figure 2.16: Soc Verification Result
is designed to receive control data and give feedback data in the correct timing. For
reasons mentioned previously, the RAM used in this system is designed for simulation
only. Fig. 2.13 displays overview of the verification environment.
In the testbench module, we designed 4 memory block; one store input data and
the rest are use to store output. This output is considered to the golden model. The
input data is a 320 × 240 resolution image in RGB 565 format. The output golden
model is the same image but in RGB 444 format. In addition, it is separated in
three channels: R, G, and B. The register values are sent first, after all 54 register
values are set into the camera, a signal called ‘finished’ goes high, and the testbench
process to send the image data, VSYNC, and HREF. Following this the ‘address’
signal, ‘write-enable’ signal, and image data in RGB 444 format are sent to the RAM
called ‘framebuffer’. The VGA Master is going to generate the address from the
counter. This address will then be used to read the data from the ‘framebuffer’.
Then we processed to compare the RGB 444 data to the golden model stored in the
memory block. The result displayed in ModelSim transcript are shown in Fig. 2.16.
As displayed, the results show that our plat from functional correctly.
23
Name Slice LUTs Slice Register RAM IO
blk mem0 134 11 26.5 0
blk mem1 134 11 26.5 0
blk mem2 134 11 26.5 0
blk mem3 134 11 26.5 0
image capture 2 43 0 0
ov7670 controller 87 90 0 0
vga 134 100 0 0
Top 753 277 0 34
In this work, the simulation was performed on Mentor Graphic ModelSim 10.4d
and the synthesis/implementation on Xilinx Vivado. The test device utilized was a
Nexys-4 FPGA. Eventually the power consumption result are analysed by XPower
Analyzer [35].
The resource cost are determined by slice count, RAM utilization, and the number
of IOs as shown in the Table 2.4. In the last row shows, the total number of slice
LUTs are 753, slice registers are 277, RAM unit are 106, and IO port are 34 used in
this work.
24
DP(mW)
TP(mW) SP(mW)
Clock Signal Logic BRAM PLL I/O
220 102 2 4 1 10 97 4
Moreover, the Table 2.5 compares the resource cost of this design with the existing
work [28], [1]. The third column shows the resource cost of image acquisition
and processing using LabView FPGA [28]. Obviously, the design uses much more
hardware resource compared to the RTL designs in the second and fourth columns.
The implementation resource cost of color to grayscale conversion and edge detec-
tion on Altera DE-115 FPGA board [1] is showm in the second column. Compared
with our proposed design shown in the fourth column, Ref. [1] has 53.4% more LUTs
and 66.1% more registers. It also uses more than twice of the IOs than our platform.
Generally, the high number of logics and IOs increases the switching activities of
Table 2.6 shows the total power consumption (TP) is 220mW, the static power
consumption (SP) is 102mW, and the dynamic power consumption (DP) is 118mW.
Due to the reduced number of logics and IOs, the power measured from the toggle
rate of clocks, signals, logics, and IOs is only 11mW or 9.3% of DP, as shown in the
third, fourth, and fifth columns. The rest of power consumption is comes mainly
from the BRAM and the PLL. This result to totally 107mW or 90.7% of DP. The
two previous works, [1] and [28], did not estimate the power cost. Therefore, a
comparison with our work is not possible
25
Figure 2.17: FPGA Prototype
After programming the design netlist on Xilinx Nexys 4 FPGA, by Xilinx Vivado.
Fig. 2.17 shows the demo of displaying original video, and the enhanced images in
grayscale and binary. Notice that the region not utilized in this demo shows all black
pixels.
2.20 Summary
taining not only open source design code but also a verification environment. The
power consumption and slice count of our work is significantly reduced when com-
pared to the prior works. The most important here is that the reusable and the
26
CHAPTER 3
APPROXIMATE DESIGN
This section presents a novel hierarchical synthesis for FPGA based adders and
multipliers. Our proposed work is able to implement the multiplier design with the
following contributions: 1) providing four types of single-bit approximate adders em-
the results. The novel hierarchical synthesis of approach has been integrated into our
prior project, an FPGA-IoTmesh system in the field of fog computing for hardware
acceleration. Combining the merits of reconfigurability of FPGAs and long-distance
In this subsection, we start from the fundamental single-bit adders’ design. Then,
four approximate additions are adopted to implement the 4-bit multipliers as a case
study. The tradeoffs between accuracy and resource cost are further estimated.
The sum and carry bits, denoted as Sum and Cout, of the conventional single-bit
full adder can be expressed as
27
Sum(EX) = A0 B 0 C + A0 BC 0 + AB 0 C 0 + ABC. (3.1)
where A and B represent 2 single-bit inputs and C indicates the carry-in bit.
In what follow, we modify the K-map of the basic single-bit adder in order to
reduce the gate count. As an example shown in Fig. 3.1(a), the Sum result is modified
from 1 to 0 and the Cout result is changed from 0 to 1 when A=1, B=0 and C=0.
After plotting the maximum group of 1’s on the map, the algebraic expressions can
be simplified as
Comparing with the conventional full adder design, the AP1 design simplifies
algebraic expressions so as to reduce the hardware cost and power consumption.
Likewise, Fig. 3.1(b), 3.1(c), and 3.1(d) also show the modified K-maps of the other
three different approximate adders. In the same way the algebraic expressions can be
rewritten as
Sum(AP 2) = A0 C + BC + A0 B (3.5)
Cout(AP 2) = A (3.6)
Sum(AP 3) = B + A0 C (3.7)
Cout(AP 3) = A (3.8)
28
(a) AP1 Adder’s K-map
29
Sum(AP 4) = B (3.9)
Cout(AP 4) = A (3.10)
It can be observed that the AP1 expression costs the largest number of gate count
and the AP4 consumes the least in the four approximate designs. In theory, the
implementation with more resource cost, is likely to achieve higher accuracy and
vice-versa, which will be proved in the following subsection.
Based on the aforementioned adders’ design, the 4-bit unsigned multipliers can be
Significant Bit (LSB) to the Most Significant Bit (MSB), are calculated by exact
adders.
exact single-bit adders by approximate adders (AP1, AP2, AP3, and AP4) from LSB
to MSB. The Error Distance (ED), formulated as ED = Absolute(R − R∗) where
30
(a) Average Error Distance (b) Maximum Error Distance
R represents the exact result and R* indicates the approximate result, is applied to
evaluate the multiplications’ accuracy. Since the 4-bit multiplication has 24 ×24 = 256
possible combined inputs, the Average Error Distance (AED) can be written as
P255
EDi
i=0
AED = ; (3.11)
256
Experimental results in Fig. 3.3(a) demonstrate our expectation, that the AP1 based
design achieves the minimum average error distance in all the approximate implemen-
tations, it costs the most gate count however. For example, in the case of replacing
all the 8-bit additions, the average error distances are 16.25, 18.23, 18.09, and 26.25,
using AP1, AP2, AP3, and AP4, respectively.
The Maximum Error Distance (MED) is also applied to estimate the worst case
As shown in Fig. 3.3(b), the worst case for each approximate design happens when
all the 8-bit additions are modified. For example, when they are replaced with AP1
and AP2, the maximum error distances are 81 and 105, respectively.
31
3.3 Implementation
that the histogram of the output image approximately matches a specified histogram.
The pseudocode for implementing the histogram equalization algorithm is depicted
in Algo. 1. It basically contains two procedures. In procedure#1 we count the number
of each grayscale pixel. Then, the histogram results are computed in procedure#2
as the division of the approximation of multiplications over the size of the image.
Finally the regulated results are distributed to each pixel in order to achieve a better
Fig. 3.4 and 3.5 show the results of RGB and grayscale images, respectively. To
demonstrate the difference, we use the Peak Signal-to-Noise Ratio (PSNR), a term
for the ratio between the maximum possible power of a signal and the power of cor-
rupting noise that affects the fidelity of its representation, as one of the performance
32
(a) Original RGB (b) EX Result (c) AP1 PSNR=49.05
MSE=0.81
2552
P SN R = 10 × log10 ; (3.13)
M SE
where MSE is the mean squared error. Typical values for the PSNR in lossy image
and video compression are between 30 and 50 dB, provided the bit depth is 8 bits,
where higher is better.
First, we compare the Fig. 3.4(a) and Fig. 3.4(b). It can be observed that the
equalized rgb image achieves higher contrast compared with the original image using
the exact multipliers. From Fig. 3.4(c) to Fig. 3.4(f), the quality of the results has
been degraded but the contrast is still enhanced compared with the Fig. 3.4(a). The
higher approximations of the multipliers are employed, the worse quality of the results.
33
(a) Original Grayscale (b) EX Result (c) AP1 PSNR=37.41
MSE=11.79
Similarly, Fig. 3.5 depicts the experimental results of the grayscale images. Note
that the higher approximation degrees of the multipliers are employed, the lower of
Using the four approximate multipliers in the histogram equalization, the his-
tograms of RGB images and the histograms of grayscale images are shown in Fig. 3.6
and Fig. 3.7, respectively. The goal of using the histogram equalization is to make
the images to use entire range of values available to them.
Basically, histogram equalization is a nonlinear normalization that stretches the
area of histogram with high abundance intensities and compresses the area with low
34
abundance intensities. As an example shown in Fig. 3.7, the equalized grayscale
normalized to the original image in Fig. 3.7(a). In some of the application domains
with a tolerance of errors, the imprecise results are acceptable within the quality
bound, and the resource cost and power consumption can be significantly reduced
3.5 Conclusion
tions. The experimental results show that our proposed work achieves similar results
to the exact design on a very common image processing algorithm, the histogram
equalization. As a tradeoff, the hardware resource cost can be significantly reduced
due to the imprecise computation on FPGA. Our future work will keep developing
more approximate design components in our approximate library and focus on more
complicate algorithm on image processing and data mining.
35
(a) Histogram of Original RGB image (b) Histogram of Equalized RGB image
(c) Histogram of Equalized RGB image (d) Histogram of Equalized RGB image
(e) Histogram of Equalized RGB image (f) Histogram of Equalized RGB image
36
(a) Histogram of Original Grayscale Image (b) Histogram of Equalized Grayscale Image
(c) Histogram of Equalized Grayscale Image (d) Histogram of Equalized Grayscale Image
(e) Histogram of Equalized Grayscale Image (f) Histogram of Equalized Grayscale Image
37
will use both Matlab simulation and FPGA application to find out the accuracy and
energy consumption.
Generally energy can be computed by running time and power as Energy = Power
× Time. Thus to reduce the energy cost we can either increasing the Maximum Op-
erating Frequency (MOF) or decreasing the power consumption. However, decreasing
the clock frequency can lower the system speed and cost more time on each cycle,
resulting an increasing of energy dissipation. Thus the power reduction is mainly con-
sidered in this work to save energy. Since the static power on FPGAs is dependent
on the specific designs, in this work we focus on dynamic power estimation, which
can be expressed as
n
X
2
Pdyn = (V ) × Cef f −i × Ui × fi (3.14)
i=1
where the total switching capacitance is the product of its effective capacitance Cef f −i ,
the number of instances in the design Ui , and the average switching frequency across
all the instances fi including the logic, signals, and IOs. The dynamic power of switch-
ing all instances of resource i is the product of (V 2 ) and its switching capacitance.
From Eq. (1), one of the most effective ways to lower the dynamic power is reducing
38
written as below
Fig. 3.8(a) shows the design structure with three floating-point multipliers and two
adders. Although the floating-point multiplier is precise in results but it costs much
more slices compared to the fixed-point designs. Therefore, this paper presents a
fixed-point design shown in Fig. 3.8(b) by multiplying 28 for the three floating-point
constants then rounding the fractions up, and finally dividing by 28 after the addition.
W/2 × W/2-bit designs. As an example shown in Fig. 3.9, two multi-bit inputs A
and B can be represented as (AH AL ) and (BH BL ) with the MSB AH and BH, and the
LSB AL and BL. Then the sum of four partial products, denoted as ALBL, AH × BL ,
AL ×BH and AH ×BH , is the final product of WW-bit multiplication. In what follows,
we propose three approximate 2×2-bit multipliers for different energy dissipation
corresponding different quality constrains. Before discussing the approximate design,
the exact 2×2-bit multiplication can be implemented using the K-map shown in Fig.
3.10 and written as In a 2 × 2 multiplier, there have two 2-bits input and one 4-
bits output. Two inputs denoted as ‘A’ and ‘B’, the output called ‘Mulout’. The
39
(a) Floating-Point
(b) Fixed-Point
40
Figure 3.10: Exact multiplier K-map
(3.17b)
M ulout[2](EX) = A[1]B[1]B[0]0 + A[1]A[0]0 B[1]. (3.17c)
where Mul[3], Mul[2], Mul[1], and Mul[0] are the four bits of the products, from the
MSB to LSB. And A[1:0] and B[0] and the 2-bit input of the multiplier.
The corresponding design structure of the exact multiplier is shown in Fig. 3.11,
requiring sixteen AND gates and four OR gates. More specifically, the MSB bit
computation takes three AND gate and the LSB takes one AND gate. The middle
bit Mul[2] needs four AND gates and one OR gate, and the Mul[1] bit requires eight
AND gates and three OR gates. Since the LSB only uses one AND gate, it does
not need to be simplified. To reduce the gate count for Mul[3:1], we present some
approximations by modifying three bits from 0 and 1 in the K-map. For example
shown in Fig. 3.12, we change 0 to 1 for the case A[1:0]=11 and B[1:0]=01, leading
to a simple boolean expression as
41
(a) Mul at 0bit (b) Mul at 1bit
Comparing to Fig. 3.11(d) with the exact Mul[3] bit computation, the approximate
design in Fig. 3.12 reduces one AND gate within the criticial path and one AND gate
for the totoal gate count as well, which theoritically would achieve a higher MOF and
lower power dissipation.
After the optimization, the total gate cost is reduced from five gates to one gate.
In other words, the approximation sacrifices one bit error for saving 80% gate numbers
42
Figure 3.12: Approximate Design for Mul[3]
43
Figure 3.14: Approximate Design for Mul[1]
Finally, the Mul[1] is simplifed as the boolean expression below with changing 0
to 1 for the case of A[1:0]=11 and B[1:0]=11.
Comparing to the exact design on Mul[1] shown in Fig. 3.11(b), the approximate
design shown in Fig. 3.14 significantly reduces the gate count by 72.7% with one bit
error tallerance.
It can be observed that the resource cost is saved by 40%, 60%, 65%, respectively, for
AP#1, AP#2, AP#3, compared with the exact design, leading to a significant slice
and energy saving by employing inexact computing. Notice that the combinational
design on FPGA is based on LUT not logic gate, so the results might be a little
difference, which is proved in Section 3.9.
44
Designs Hardware cost
EX MUL 16 AND + 4 OR
AP#1 MUL 10 AND + 2 OR
AP#2 MUL 7 AND + 1 OR
AP#3 MUL 6 AND + 1 OR
Generally, the multi-bit adders can be simply integrated with several single-bit
adders. The exact single-bit adder can be expressed as
and
AP 1Cout = a (3.23a)
where a and b are the two inputs, and c is the carry in bit. AP1cout and AP1sum
are the carry out bit and summation bit for the first approximate adder, and AP2cout
and AP2sum are bits the second approximate adder. The static analysis of the
approximate adder is shown in Table 3.2. Compared to the exact adder design, it
45
Designs Hardware cost
EX Adder 7 AND + 2 OR
AP#1 Adder 4 AND + 2 OR
AP#2 Adder 3 AND + 1 OR
can be observed that the gate counts are saved by 77.8% and 44.4%, respectively, for
using AP#1 and AP#2.
This subsection evaluates the quality of the results by using exact design and
approximate implementations. In Fig. 3.15, six grayscales image converted by differ-
ent approximations of RGB2Grayscale designs are depicted. Fig. 3.15(a) shows the
quality of result using the exact design. Fig. 3.15(b), Fig. 3.15(c) and Fig. 3.15(d)
depict the results employing the AP#1, AP#2, and AP#3 multipliers respectively.
And Fig. 3.15(e)and Fig. 3.15(f) show the results with AP#1 and AP#2 adders re-
spectively. It is not clear to see the difference between images in Fig. 3.15 by human
eyes, so the error rate is further graphed in Fig. 3.16 As depicted in Fig. 3.16(a), the
horizontal axis represents the different approximates of the multipliers, and the verti-
cal axis indicates the error rate for each specific implementation. It can be observed
that the error rates for converting the color image into grayscale image are 0.999%,
3.243%, and 5.64%, respectively, by using AP#1, AP#2, and AP#3 multipliers. In
Fig. 3.16(b), the error rate decreases by replacing less number of addition bits, which
is represented by the horizontal axis from the third bit (3b) to the least bit (1b). It
is obvious that the higher bits have more error effect compared to the lower bits. To
keep the error tolerance acceptable (less than 4%), therefore, only the least three bits
are considered in this benchmark. When replacing all the three bits with AP#1 and
46
AP#2 adders, the error rates are 1.43% and 2.85%, respectively. The error rates drop
to around 1% with replacing the least signifiant two bits for both AP#1 and AP#2
adders, and the error rates are less than 1% when replacing the least significant bit.
In this subsection, first the register transfer level (RTL) design and verification
with Verilog hardware description language (HDL) is discussed. The Mentor Graphic
ModelSim is used as the simulator, and the Xilinx Vivado 2018 with the target device
Nexys-4 is employed as the synthesis tool. In our work, the FPGA performance
After simulation, the toggle activities of signals, IOs, and logic are collected by
Value Changed Dump (VCD) files. And then after synthesis, the practical results are
summarized in the third and fourth columns of Table 3. It can be observed that with
the same adder design, the higher approximations of the multiplier implementation,
the less number of the slices are needed. Similarly, when using the same multiplier,
higher approximation of adders consume less number of FPGA slices. The MOF is
decided by the critical path delay. However, since the combinational circuit mapping
on FPGA is based on the Look-Up-Table (LUT), the critical paths for all the ap-
proximations of implementations are the same, resulting in the same MOF as 271.326
MHz. Finally, we use XPower Analyzer to estimate the realistic power consump-
tion. Xilinx Power Analyzer evaluates the power with the Native Circuit Description
(NCD) file generated by ISE and the specific simulation VCD file. As the power con-
47
(a) Exact (b) AP#1 Mul
48
(a) AP Mul (b) AP Add
sumption shown in the sixth column in Table 3, the dynamic power decreases with
the increasing of approximations of the adders or multipliers. Some of the power
consumptions are the same because the toggle rates of slices are similar to each other
in this benchmark. By simply multiplying dynamic power by the reciprocal of MOF,
the dynamic energy is computed in the seventh column. Since the MOF are the same
for all the approximate designs, the dynamic energy dissipations have the same trend
of the power cost.
In order to find the optimal cost saving in terms of slice number and energy
consumption corresponding to the specific quality bound, in this subsection we present
where ‘S’ and ‘E’ represent the FPGA cost in terms of slice count and energy dis-
sipation. ‘x’ is the weight of slice count, and ‘y’ is the weight of FPGA energy
consumption. The weights x and y are between 0 and 1, and the summation of the
49
DUT No. AP Adder AP Mul Slice of Regsiter Slice of LUT DP(mW) DE(pJ)
1 Ex Ex 700 879 13 4.79
2 Ex Ap1 684 843 12 4.42
3 Ex Ap2 652 799 12 4.42
4 Ex Ap3 640 783 12 4.42
5 Ap1 Ex 698 874 13 4.79
6 Ap1 Ap1 682 838 12 4.42
7 Ap1 Ap2 650 794 12 4.42
8 Ap1 Ap3 638 778 12 4.42
9 Ap2 Ex 538 720 10 3.69
10 Ap2 Ap1 522 692 10 3.69
11 Ap2 Ap2 506 672 10 3.69
12 Ap2 Ap3 494 656 10 3.69
the small-size design, and y = 1 targets the low-energy optimization. Generally, the
slice-energy cost saving should be minimized in order to find the optimal design with
different weight configurations. As an example for equally setting the two weights
as 1/2, the minimum slice-energy cost occurs at the No. 12 design with the highest
approximation of multiplier (AP#3) and the highest approximation of adder (AP#2)
as shown in Fig. 3.17.
3.10 Summary
50
Figure 3.17: Approximate No.4 multiplier structure
multiplication and 2.85% for addition, the dynamic energy can be reduced to 77.04%
and the slice count can be saved to 72.83% compared to the exact design. Our fu-
ture work is to implement the face detection algorithm with many approximations,
in order to speed up the system and reduce the energy consumption.
51
CHAPTER 4
We then discuss the possible directions for our future research work.
4.1 Summary
taining not only open source design code but also a verification environment. The
power consumption and slice count of our work is significantly reduced when com-
pared to the prior works. The most important here is that the reusable and the
and the slice count can be saved to 72.83% compared to the exact design.
Consider of the reusable and expandable of the platform proposed in this disser-
tation, in future, we can extend our platform and approximate library on a variety
of algorithms and systems to explore the energy tradeoff. Such as FPGA-IoTmesh
52
system, FPGA-video processing system, or FPGA-deep learning system. We expect
our work will lead to multiple designs and give some contribution on research and
education of this area.
53
BIBLIOGRAPHY
[1] C. Ababei and et al. Open source digital camera on field programmable gate
[2] M. Birla. Fpga based reconfigurable platform for complex image processing. 2006
IEEE Intl. Conf. on Electro/Information Technology, pages 204–209, May 2006.
51–66, 2014.
[7] M. Fan, Q. Han, and X. Yang. Energy minimization for on-line real-time schedul-
ing with reliability awareness. Elsevier Journal of Systems and Software (JSS),
127:168–176, May 2017. doi: 10.1016/j.jss.2017.02.004.
54
[9] A. Gajjar and et al. An fpga synthesis of face detection algorithm using haar
[10] A. Gajjar, Y. Zhang, and X. Yang. Demo abstract: A smart building system
integrated with an edge computing algorithm and iot mesh networks. The Second
[11] A. Gajjar, X. Yang, and et. al. Mesh-iot based system for large-scale environ-
ment. 5th Annual Conf. on Computational Science and Computational Intelli-
gence (CSCI2018), 2018.
[12] H. He and et al. Dual long short-term memory networks for sub-character rep-
[13] H. He, L. Wu, X. Yang, and et al. Dual long short-term memory networks for
sub-character representation learning. The 15th Intl. Conference on Information
[16] K. Jin and et al. High-speed fpga-gpu processing for 3d-oct imaging. Intl. Conf.
on Computer and Communications (ICCC), pages 2085–2088, March 2018.
55
[17] A. Kahng and S. Kang. Accuracy-configurable adder for approximate arithmetic
[18] P. Kulkarni, P. Gupta, and M. Ercegovac. Trading accuracy for power with an
underdesigned multiplier architecture. 24th IEEE Intl. Conf. on VLSI Design,
[19] J. Liang, J. Han, and F. Lombardi. New metrics for the reliability of approximate
and probabilistic adders. IEEE Transactions on Computers, 62(9):1760–1771,
2013.
67–73, 2004.
[22] S. Misailovic, M. Carbin, S. Achour, and Z. Qi. Chisel: Reliability- and accuracy-
aware optimization of approximate computational kernels. Proceedings of the
2014 ACM Intl Conference on Object Oriented Programming Systems Languages
[24] R. Nair. Big data needs approximate computing: Technical perspective. ACM
56
[25] L. Nwosu and et al. Deep convolutional neural network for facial expression
recognition using facial parts. 15th IEEE Intl Conf. on Dependable Autonomic
and Secure Computing, Feb 2018.
www.cs.cmu.edu/ gpekhime/Projects/15740/paper.pdf.
[28] S. Rahangdale and et al. MBSEM image acquisition and image processing in
LabView FPGA. 2016 Intl. Conf. on Systems, Signals and Image Processing
(IWSSIP), pages 1–4, July 2016.
[31] P. Vangali and X. Yang. A compression algorithm design and simulation for
processing large volumes of data from wireless sensor networks. Communications
on Applied Electronics (CAE), 7(4):1–5, June 2017.
[32] X.Yang and X.He. Demo abstract: Establishing a BLE mesh network with fabri-
57
[33] X. Yang and J. Andrian. A high performance on-chip bus (MSBUS) design and
verification. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. (TVLSI), 23(7):
1350–1354, July 2015.
[34] X. Yang and J. Andrian. An advanced bus architecture for aes-encrypted high-
performance embedded systems. US20170302438A1, 2017.
[35] X. Yang and et al. A novel bus transfer mode: Block transfer and a performance
evaluation methodology. Elsevier Integration the VLSI Journal, 53:23–33, Jan
2016.
[36] X. Yang and et. al. A vision of fog systems with integrating fpgas and ble mesh
network. Journal of Communications, 2018.
[37] X. Yang and W. Wen. Design of a pre-scheduled data bus (DBUS) for advanced
encryption standard (AES) encrypted system-on-chips (socs). The 22nd Asia
and South Pacific Design Automation Conference (ASP-DAC 2017), pages 1–6,
Feb 2017. doi: 10.1109/ASPDAC.2017.7858373.
[38] X. Yang and N. Wu. Design of a bio-feedback digital system (bfs) using 33-step
training table for cardio equipment. The 8th Intl. Conference on Applied Human
bus encoding method on the MBUS structure. Journal of VLSI Design, 2017:
1–7, May 2017. doi: 10.1155/2017/4914301.
58
[41] X. Yang, W. Wen, and M. Fan. Improving AES core performance via an advanced
[43] K. Zeng, N. Wu, X. Yang, and K. K. Yen. Fhcc: A soft hierarchical cluster-
ing approach for collaborative filtering recommendation. Intl. Journal of Data
5120/ijca2018916380.
59
VITA
Yunxiang Zhang
A. Gajjar, X. Yang , Y. Zhang, et. al.,“An FPGA Synthesis of Face Detection Al-
gorithm using HAAR Classifiers,” Intl. Conference on Algorithms, Computing and
Systems (ICACS 2018), Under Review, 2018.
60
Y. Zhang, and X. Yang, “A Novel Fog Computing Acceleration Method: Approx-
imate FPGA Design on Computation Components,” 2017 Innovation/Automation
Dual Conference, Houston, TX, US, Oct 2017.
61