0% found this document useful (0 votes)

24 views

Low Precision Networks For Efficient Inference On Fpgas White Paper

Uploaded by

itsnithin_ts

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Low Precision Networks For Efficient Inference On Fpgas White Paper

Uploaded by

itsnithin_ts

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

White Paper

Vision
FPGA

Low-Precision Networks for Efficient

Inference on FPGAs
Light retraining illuminates the way to meeting computer vision specifications

Authors Executive Summary

R. Abra Neural networks are highly compute intensive. As a result, downsizing any of
FPGA Deep Learning Retraining Lead the calculations leads to significant savings in cost, time, and power. One way
Intel Programmable Solutions Group to downsize calculations is to reduce the size of the parameters. Quantization
compresses the parameters in a neural network by reducing the number of bits
D. Denisenko used to represent them. This in turn reduces both the size of each calculation and
Deep Learning Software Engineer the time and resources needed to move the values around the chip.
Intel Programmable Solutions Group Implementing a low precision network in hardware provides numerous advantages
when it comes to meeting specification. The increased flexibility allows the
R. Allen optimization of:
Solutions Engineer
Intel Programmable Solutions Group • Throughput
• Overall power consumption
T. Vanderhoek • Resource usage
Engineering Software Manager
Intel Programmable Solutions Group • Device size
• TOPs/Watt
S. Wolstencroft • Deterministic latency
Royal Holloway University
These have important benefits to the user experience, particularly where scaling
M. Gibson and efficiency are inherent requirements of the application.
University of Reading There are many options for quantization: small integers (weights and activations
quantized to scaled integer values, e.g. int8), minifloats (e.g. FP11, FP9), or
application-optimized numeric formats. Our end-to-end solution uses Block
Floating Point (BFP), which has the advantage of easily halving the hardware
Table of Contents
footprint while maintaining accuracy at low precisions. This is possible due to BFP’s
Executive Summary. . . . . . . . . . . . . . . . . . . . . . . 1 high dynamic range.
Quantizing Neural Networks for Only two scenarios require a little more intervention. Modern, compact networks
FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 such as MobileNet and EfficientNet hold little in the way of redundancy, so accuracy
What is BFP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 can be lost at low precision. Additionally, even the larger, older networks (ResNet,
VGG-SSDs) lose accuracy at very low precision. These losses can be recovered
Asymmetric BFP for Increased within a few retraining epochs when the quantization is modelled in the training
Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 flow. Accurate modelling of hardware arithmetic during training allows hardware
accuracy to closely match that achieved in software.
BFP vs Integer Quantization. . . . . . . . . . . . . 2
BFP for High Inference Accuracy. . . . . . . 3 The main aim of retraining is therefore to facilitate the compression of FP32 models
to comply with application specifications, be they at the Edge or in the Cloud [1]. A
BFP to Significantly Cut Resource reduction in the number of bits inherent in a network enables far more flexibility
Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 than is available at the larger size.
Mixed Precision – The Advantages of
Very Low Precisions with the Accuracy
of Higher Precisions. . . . . . . . . . . . . . . . . . . . . 4
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
White Paper | Low-Precision Networks for Efficient Inference on FPGAs

Quantizing Neural Networks for FPGAs Storage Size of BFP vs FP32

35%
FPGAs have a major advantage where specific numeric
30%
formats are required [2]. For a given deep learning model, a

Proportion of FP32 Bits

format can be chosen that optimally balances compute and 25%

storage availability on the system with dynamic range and 20%

Block Size 8
precision required by the model to easily maintain accuracy. 15%
Block Size 16
Besides integer and floating-point formats natively supported 10% Block Size 32
by Intel® FPGA digital signal processing (DSP) blocks1 , FPGA
5%
logic can be used to implement additional numeric formats
that are particularly well suited to deep learning models. 0%
10 9 8 7 6 5 4 3 2
BFP combines the extended dynamic range of floating-point Number of Mantissa Bits
format with the low-cost implementation of fixed point.
Figure 1. The effect of blocking and mantissa size on the
number of bits in a neural network
What is BFP?
In BFP, values are grouped into blocks — nominally of size int7bfp consists of seven integer bits (including the sign) and
32 — within which each value takes the same exponent. five bits for the exponent. int5bfp, shown in Figure 2, consists
To achieve this, grouped numbers are aligned to suit the of four integer bits with one sign bit and five exponent bits.
largest exponent. The mantissas can now be treated as Without the BFP blocking, this would be FP9 format, which
signed integers in, for example, dot-product computation, uses one sign bit, an implicit mantissa bit, three explicit
and combined with the shared exponent in the accumulation mantissa bits and five exponent bits.
step to convert the results to single precision floating-point
Such small multiplication operations can be efficiently
representation.
implemented on Intel® FPGAs. For example, a single DSP
Low-precision BFP reduces the payload by eliminating bits block in an Intel® Arria® 10 FPGA can implement two int7bfp
from two locations. The shared exponent removes the need multiplier or, with a few additional ALMs, four int5bfp
for individual assignment, while low-bit mantissas are vastly multipliers. Since block size usually encompasses at least
smaller than the standard 23-bit fraction in FP32 values. eight values, significant storage and compute savings result
Figure 1 shows the bit savings from using different block and from utilizing BFP [7].
mantissa sizes.
5 Bits

S0 e04 e03 e02 e01 e00 1 m02 m01 m00 S0 Shifted m0

5 Bits

S1 e14 e13 e12 e11 e10 1 m12 m11 m10 S1 Shifted m1

Max Exponent
S2 e24 e23 e22 e21 e20 1 m22 m21 m20 S2 Shifted m2

S3 e34 e33 e32 e31 e30 1 m32 m31 m30 S3 Shifted m3

Figure 2. Blocking of four FP9 floating point values to int5bfp. The "1" in the left-most mantissa positions is the implicit 1 in
floating-point format made explicit prior to conversion. Sign + mantissa bits are now in sign + magnitude integer format

Asymmetric BFP for Increased Savings BFP vs Integer Quantization

Dot product computations need not be symmetrical in BFP. BFP can be compared against integer quantization with
A further reduction in storage size can be achieved if weights favorable results. Each block of numbers in BFP gets its
and activations are represented with different precisions. own scaling factor (2max_exp), unlike integer quantization
Many Convolutional Neural Networks (CNNs) maintain their where such factors are arbitrary floats on a per layer basis.
accuracy even if their weights are represented with fewer bits BFP provides an overall higher dynamic range, which can
than the activations. For example, int5/4bfp format can be be further adjusted with the size of the block. Additionally,
used to store activations in int5bfp and weights in int4bfp. the BFP scaling factor is sized automatically as part of the
Dot product engines implemented using an Intel® FPGA computation, instead of having to run additional initialization
would then perform 5 bit x 4 bit integer multiplies, which steps.
would achieve more efficient DSP block packing than using
int5bfp for both weights and activations.
1
Some examples of natively supported numeric formats by Intel® FPGAs: 18-bit integer and FP32 on Intel® Arria® 10 FPGA
[3][4]; int8, FP16, and bfloat16 on Intel® Agilex™ FPGA [5], and int4 and int8 tensor block on Intel® Stratix® 10 NX FPGA [6].

2
White Paper | Low-Precision Networks for Efficient Inference on FPGAs

BFP for High Inference Accuracy is insufficient, it is possible for the FPGA to accommodate
an increase in the activation bit width for specific layers.
Many of the high parameter networks – the ResNets, Algorithmic techniques are available to determine and retrain
Inception, VGG-based SSDs – quantize well to int8bfp and a network to account for these changes in precision (see
even int7bfp without any additional intervention, as shown section on Mixed Precision below).
in Table 1, where green highlights indicate a minimal loss of
accuracy from the original FP32 model.
Why Block Floating Point?
As expected, the drop in accuracy from applying quantization
is more perceptible at very low precisions. This effect is • Small resource footprint
exaggerated in the more modern, compact networks such as
MobileNet and EfficientNet, which experience some accuracy • Excellent compatibility with Intel® FPGA DSP blocks
drop even at higher precisions. • High dynamic range models weights and activations
Fortunately, this penalty can be reversed easily. As few as four well at low precisions
epochs of retraining – or a dozen for the more challenging • Simplicity in training: no parameter initialization
networks – can recover the model’s accuracy. Where this

Int5/4bfp Int7bfp Int8bfp

Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy
FP32 accuracy
Network (%) without (%) with (%) without (%) with (%) without (%) with
reference (%)
retraining retraining retraining retraining retraining retraining
Classification (ImageNet)
ResNet-18 69.76 55.69 69.13 69.67 n/a 69.60 n/a
ResNet-34 73.31 65.09 72.81 72.94 n/a 73.09 n/a
ResNet-50 76.13 60.32 75.60 75.75 n/a 75.95 n/a
Inception v3 77.32 32.70 78.34 77.11 n/a 77.31 n/a
EfficientNet_b0 75.86 0.34 71.96 64.37 75.45 70.48 75.47
MobileNet v2 71.81 6.00 68.99 67.28 71.65 71.12 n/a
SqueezeNet v1.1 58.18 33.09 54.90 57.73 58.15 58.10 n/a
Object Detection (VOC 2007 & 2012)
SSD300 78.12 73.64 77.92 78.09 n/a 78.08 n/a
SSD512 80.26 74.72 80.00 80.19 n/a 80.08 n/a
Object Detection (COCO 2017)
TinyYOLO v3 35.7 26.90 31.40 35.50 n/a 35.60 n/a
Semantic Segmentation (CamVid)
UNet 71.95 63.95 72.36 71.66 n/a 71.89 n/a

ICNet 67.89 59.66 67.09 67.88 n/a 67.87 n/a

Table 1. Indicative Top-1 accuracies for networks both with and without retraining, at int5/4bfp (int5bfp activations and int4bpf
weights), int7bfp and int8bfp – all at block size 32. n/a shows where retraining is not required
Blue : Full precision accuracy
Green : Achieves quantization accuracy within 1 percentage point of the full precision accuracy
Amber : Achieves quantization accuracy within around 5 percentage points of the full precision accuracy
Red : Achieves quantization accuracy significantly lower than full precision accuracy

BFP to Significantly Cut Resource Count A B

As already seen, the number of bits implicated in low x C

precision quantization is much reduced from the original
single precision implementation. This has a large knock-on AxC = BxC
effect on hardware resources. For instance, the 18 bit input
hardened multipliers in an Intel® Arria® 10 FPGA (represented A B
in Figure 3) can be used to implement a single 18 bit x 18 bit
multiply, two 6 bit x 6 bit multiplies, or four 4 bit x 3 bit. C x D
Pushing the sign bit into external logic therefore, it is AxC BxC AxD = BxD
possible to halve the number of DSP blocks used at int12bfp
(equivalent to blocked FP16) by quantizing to a 7 bit mantissa
(int7bfp) and halve it again using a 5 bit x 4 bit (int5/4bfp) Figure 3. Packing 6x6 bit (int7bfp) or 4x3 bit (int5/4bfp)
configuration. These savings can be utilized to scale back the multipliers into an Intel Arria 10 FPGA 18 bit
hardware footprint or increase throughput. multiplier

3
White Paper | Low-Precision Networks for Efficient Inference on FPGAs

Meaningful examples mirror real-life applications, which Hardware Footprint on Reducing Network
tend to employ variations of standard networks. As Precision - Block Size 16
typical benchmarks, ResNet 50 and MobileNet v2 are used 1.0 70
throughout this section to give an idea of the effects of
quantization. The reference is int12bfp which is a good proxy 60

Ratio of FP16 Resource Count

0.8
for single precision in this context owing to the negligible
50
accuracy loss from downsizing.

Frame Rate (fps)

0.6
Simply by reducing the precision, the associated numbers 40

of Adaptive Logic Modules (ALMs) and RAM blocks roughly 30

follow the pattern seen in multiplier usage. This reduction 0.4

is reflected in the number of DSP blocks and amplified 20

by decreasing the block size, which – while limiting the 0.2
throughput – has additional benefits for the footprint. 10

Also worthy of note is that the choice of network makes 0.0 0

int12bfp (eq. o FP16) int7bfp int5/4bfp
a big difference to the frame rate, with the much smaller
MobileNet v2 attaining around twice the frame rate of ResNet RAM Blocks DSP Blocks ALMs
50 for the same footprint. ResNet50 fps MobileNetv2 fps

In the following figures, the baseline footprint (indicated in Figure 5. Resource count ratios and frame rate for ResNet 50
the leftmost columns of Figure 4) at int12bfp and block size and MobileNet v2 inference at block size 16
32 is:
An additional indication of available trade-offs is given in
• 816 M20K RAM blocks
Figure 6. A realistic application may well use MobileNet v2
• 551 DSP blocks at a frame rate of 30 frames per second (fps). In this case, a
block size of 8 is sufficient to fulfill the criteria which results
• 39,315 ALMs
in the use of 103 M20K RAM blocks, 31 DSP blocks and
In Figure 4, reductions in footprint result directly from 12,635 ALMs at int5/4bfp.
reduced precision. With a single bitstream for both ResNet
50 and MobileNet v2, it is interesting to note the difference in
frame rate. Hardware Footprint on Reducing Network
Precision - Block Size 16
Hardware Footprint on Reducing Network 1.0
Precision - Block Size 32
1.0 160
Ratio of FP16 Resource Count

0.8
140
Ratio of FP16 Resource Count

0.8
120 0.6
Frame Rate (fps)

100
0.6
0.4
80

0.4
60
0.2
40
0.2
30 0.0
int12bfp (eq. o FP16) int7bfp int5/4bfp
0.0 0
int12bfp (eq. o FP16) int7bfp int5/4bfp RAM Blocks DSP Blocks ALMs

RAM Blocks DSP Blocks ALMs Figure 6. Resource count and accuracy for MobileNet v2
ResNet50 fps MobileNetv2 fps inference at 30 fps with block size 8

Figure 4. Resource count ratios and frame rate for ResNet 50 As seen, simply by reducing the precision, the associated
and MobileNet v2 inference at block size 32 numbers of ALMs and RAM blocks roughly follow the pattern
seen in DSP block usage. This is amplified by decreasing the
Optimizing further, halving the block size leads to even block size, which – while limiting throughput – compounds
greater savings. In Figure 5, while frame rate is halved for the benefits in the footprint. It is worth noting that doubling
MobileNet v2 and reduced by two thirds for ResNet 50, using the number of hardware instances doubles the throughput
block size 16 halves RAM utilization and reduces the DSP — potentially useful in the case of multiple input streams or
block usage by two thirds. requirements for redundancy in the system.

4
White Paper | Low-Precision Networks for Efficient Inference on FPGAs

Mixed Precision – The Advantages of Very Low Conclusion

Precisions with the Accuracy of Higher Many AI applications have stringent requirements that are
Precisions complicated by additional functions needing to be in-lined,
such as I/O, clipping, scaling, and dewarp. A big advantage of
Finally, for those situations where low precision quantization FPGAs is that these can be included as intellectual property
and retraining provides an unacceptable loss in accuracy, (IP) cores on the same chip and combined as building
certain modes of mixed precision exist that have a blocks. While the functions themselves may claim heavy
cumulatively positive effect. These include “layer type” resource usage, the flexibility provided to neural network
precision changes to incorporate higher precision hardware engines by BFP quantization can reduce the IP footprint
kernels, say for depthwise convolutions, and per-layer tensor and help to meet other specifications such as throughput or
precision doubling. performance.
Although distinguishing different layer types is BFP quantization works very well on FPGAs due to the ability
straightforward, accuracy uplift — much like training the to pack integers of certain sizes efficiently into the DSP
original network — is determined by exploratory testing. blocks, which very easily allows the footprint reductions
This can be mitigated by using algorithms such as Hessian shown on the previous page. A 50% reduction in DSP
Aware Quantization (HAWQ), that determine the sensitivity block usage is achieved simply by reducing the precision to
of each convolutional layer to quantization. Each layer can be int7bfp. This is replicated on a further precision reduction
identified and the bit width of the weights, the activations or to int5/4bfp. Other logic elements follow a similar pattern
both can be doubled accordingly in retraining. In hardware, of usage reduction. From here, changing the block size or
this augmentation can easily be effected by multiple passes repeating instantiations of the hardware enables tuning of
through the PE array [8]. the frame rate.
Figure 7 shows the results of training MobileNet v2 at BFP A further benefit of BFP quantization is its ability to store
precisions determined by HAWQ. There is built-in flexibility to more network graphs in DDR. With a greater number of
specify what proportion of the parameters in the convolution parameters available on-chip, the time and power to change
layers are doubled. from one graph to another reduces, enabling high-speed
switching for different types of inference.
72.0 Where accuracy is concerned, BFP has a high dynamic range
93.1%
that is modifiable via the block size. This makes it very adept
71.5 68.5%
at retraining, even at very low precisions. A default block size
49.8%
71.0 31.0% of 32 is sufficient to allow older, larger networks to quantize
23.4% at int7bfp without retraining (specifically ResNet, SqueezeNet,
Top-1 Accuracy

70.5 18.8% VVG-SSDs, TinyYOLO, UNet and ICNet). On newer, leaner

topologies such as the MobileNets and EfficientNet, the
70.0 12.5% resultant drop from quantization can be overcome with a
5.5%
69.5
few epochs of retraining. At precisions lower than int7bfp,
1.3% these leaner networks still achieve good accuracy by enable
69.0 0.0% precision-doubling for select critical layers.

68.5 The software model has several benefits. In addition to

1.00 1.50 2.00 2.50 3.00 3.50 4.00 providing a neural network training facility, it allows:
Estimated Slowdown
• Testing the effects on accuracy of quantization before
Figure 7. Accuracies achieved with mixed precision layers on implementation
MobileNet v2 with a base of int5/4bfp. Percentage • Retraining to recoup accuracy lost through low precision
figures show proportion of convolution layer quantization
parameters doubled
• The ability to trial mixed precision configurations and
higher precision kernels before building in hardware
All three points save a significant amount of time in
speculative bitstream compilation and hardware engineering.
In summary, quantization is an easy way to significantly
reduce hardware footprint while maintaining frame rate and
keeping accuracy loss to a minimum.

5
White Paper | Low-Precision Networks for Efficient Inference on FPGAs

References
For more information about Intel and low-precision inference on FPGAs, the following links are available:
[1] “A Configurable Cloud-Scale DNN Processor for Real-Time AI”, https://www.microsoft.com/en-us/research/uploads/
prod/2018/06/ISCA18-Brainwave-CameraReady.pdf
[2] “Harnessing Numerical Flexibility for Deep Learning on FPGAs”, Proceedings of the 9th International Symposium on
Highly-Efficient Accelerators and Reconfigurable Technologies, https://dl.acm.org/doi/10.1145/3241793.3241794
[3] “Intel Arria 10 Native Fixed Point DSP IP Core User Guide”, https://www.intel.com/content/dam/altera-www/global/en_US/
pdfs/literature/ug/ug_nfp_dsp.pdf
[4] “Intel Arria 10 Native Floating-Point DSP Intel FPGA IP User Guide”, https://www.intel.com/content/dam/altera-www/
global/en_US/pdfs/literature/ug/ug-a10dsp.pdf
[5] “Intel Agilex Variable Precision DSP Blocks User Guide”, https://www.intel.com/content/dam/www/programmable/us/en/
pdfs/literature/hb/agilex/ug-ag-dsp.pdf
[6] “Intel® Stratix® 10 NX FPGA”, https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/
stratix-10-nx-technology-brief.pdf
[7] “Flexibility: FPGAs and CAD in Deep Learning Acceleration”, Proceedings of the 2018 International Symposium on Physical
Design, https://doi.org/10.1145/3177540.3177561
[8] US Patent application number 16/818889: “Floating-point Decomposition Circuitry with Dynamic Precision”,
https://uspto.report/patent/app/20200218508

The performance numbers presented herein are a mix of measured and estimated numbers generated using an Arria 10 PAC card at a batch size of 1, incorporating an A10-1150 speed grade 2
FPGA. The host is a Xeon E5-1650 v3 @ 3.5 GHz w/ 132 GB RAM. Some numbers were estimated based on the fmax of the compiled architecture.
All information provided here is subject to change without notice.
Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document.
You should visit the referenced web site and confirm whether referenced data are accurate.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation.
Performance varies depending on system configuration. No computer system can be absolutely secure.
Check with your system manufacturer or retailer or learn more at www.intel.com.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Please Recycle WP-01308-1.0

Dcam v2.2.3 Master - 8.5x11
No ratings yet
Dcam v2.2.3 Master - 8.5x11
242 pages
Research Methodology
100% (1)
Research Methodology
10 pages
OCI Data Science Vs Google AI Platform PDF
No ratings yet
OCI Data Science Vs Google AI Platform PDF
5 pages
IABP Learning Package
No ratings yet
IABP Learning Package
35 pages
Agrippa Cuadarados Magicos
100% (1)
Agrippa Cuadarados Magicos
6 pages
NEC Article 250
100% (1)
NEC Article 250
42 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
A CNN Accelerator on FPGA Using Depthwise
No ratings yet
A CNN Accelerator on FPGA Using Depthwise
5 pages
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
No ratings yet
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
4 pages
RISC-VTF RISC-V Based Extended Instruction Set for Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set for Transformer
6 pages
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
No ratings yet
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
13 pages
vlsi-implementation-of-low-power-8-bit-dsp-processor-IJERTCONV1IS06078
No ratings yet
vlsi-implementation-of-low-power-8-bit-dsp-processor-IJERTCONV1IS06078
8 pages
amplifying-5g-vran-performance-with-artificial-intelligence-deep-learning-1652393552-6
No ratings yet
amplifying-5g-vran-performance-with-artificial-intelligence-deep-learning-1652393552-6
8 pages
High Performance FPGA Based CNN Accelerator
No ratings yet
High Performance FPGA Based CNN Accelerator
4 pages
Redactor
No ratings yet
Redactor
8 pages
Finn
No ratings yet
Finn
10 pages
Att Cpu Impact On Packet Processing Perfomance Paper
No ratings yet
Att Cpu Impact On Packet Processing Perfomance Paper
24 pages
2504.03083v1
No ratings yet
2504.03083v1
10 pages
Karthiga Phase II Report
No ratings yet
Karthiga Phase II Report
60 pages
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
4 pages
10.1109VDAT50263.2020.9190274
No ratings yet
10.1109VDAT50263.2020.9190274
6 pages
Thesis Report
No ratings yet
Thesis Report
60 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
Fast Generation of Custom Floating-Point Spatial Filters On Fpgas
No ratings yet
Fast Generation of Custom Floating-Point Spatial Filters On Fpgas
12 pages
Implementation of FIR Filter Amp Mac Unit by Using Neural Networks in FPGA
No ratings yet
Implementation of FIR Filter Amp Mac Unit by Using Neural Networks in FPGA
6 pages
Co-Design of A Novel CMOS Highly Parallel, Low-Power, Multi-Chip Neural Network Accelerator
No ratings yet
Co-Design of A Novel CMOS Highly Parallel, Low-Power, Multi-Chip Neural Network Accelerator
6 pages
5742-22998-1-PB
No ratings yet
5742-22998-1-PB
8 pages
Reconfigurable Computing
No ratings yet
Reconfigurable Computing
38 pages
A Two-Stage Operand Trimming Approximate
No ratings yet
A Two-Stage Operand Trimming Approximate
11 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
Performance Modeling For CNN Inference Accelerators On FPGA
No ratings yet
Performance Modeling For CNN Inference Accelerators On FPGA
14 pages
Takeoff-edu-group-VLSI-title-list
No ratings yet
Takeoff-edu-group-VLSI-title-list
130 pages
Energyefficient Backend Compiler Design For Embedded Systems
No ratings yet
Energyefficient Backend Compiler Design For Embedded Systems
7 pages
[email protected]
No ratings yet
[email protected]
4 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
Faraone 2018
No ratings yet
Faraone 2018
4 pages
A High Performance Reconfigurable Hardware Archite (5)
No ratings yet
A High Performance Reconfigurable Hardware Archite (5)
17 pages
Advantages and Limitations of Fully on-chip CNN
No ratings yet
Advantages and Limitations of Fully on-chip CNN
5 pages
GPU友好稀疏量化Boost Vision Transformer
No ratings yet
GPU友好稀疏量化Boost Vision Transformer
11 pages
Gan Fpga
No ratings yet
Gan Fpga
35 pages
Approximate Multipliers For Optimal Utilization of FPGA Resources
No ratings yet
Approximate Multipliers For Optimal Utilization of FPGA Resources
6 pages
towards-achieving-high-performance-in-5g-mobile-p
No ratings yet
towards-achieving-high-performance-in-5g-mobile-p
18 pages
Analysisofapproxadders
No ratings yet
Analysisofapproxadders
6 pages
ptp-servo-solution-for-time-synchronization-applications-white-paper
No ratings yet
ptp-servo-solution-for-time-synchronization-applications-white-paper
11 pages
Karthiga Phase II Report-11
No ratings yet
Karthiga Phase II Report-11
59 pages
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency
No ratings yet
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency
14 pages
1808 09945
No ratings yet
1808 09945
9 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
10. jan 19ijamte - cw
No ratings yet
10. jan 19ijamte - cw
9 pages
10 Paper Analysis
No ratings yet
10 Paper Analysis
3 pages
Lu 等 - 2020 - Hardware Accelerator for Multi-Head Attention and
No ratings yet
Lu 等 - 2020 - Hardware Accelerator for Multi-Head Attention and
6 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
02-92
No ratings yet
02-92
15 pages
Low Latency 5g Upf Using Priority Based 5g Packet Classification
No ratings yet
Low Latency 5g Upf Using Priority Based 5g Packet Classification
20 pages
Reducing Multimedia Decode Power
No ratings yet
Reducing Multimedia Decode Power
9 pages
Low Power DSP Using Approximate Adders
No ratings yet
Low Power DSP Using Approximate Adders
14 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Embedded Systems Components Part-2
No ratings yet
Embedded Systems Components Part-2
8 pages
Genus Synthesis Solution Ds
No ratings yet
Genus Synthesis Solution Ds
3 pages
Genus Synthesis Solution Ds
No ratings yet
Genus Synthesis Solution Ds
3 pages
A High-Throughput and Power-Efficient FPGA Implementation of Yolo CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of Yolo CNN For Object Detection
13 pages
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet
Session Border Controllers For Dummies and More Using AI
100% (1)
Session Border Controllers For Dummies and More Using AI
47 pages
The Economic Impact of Generative AI The Future of Work in The India
No ratings yet
The Economic Impact of Generative AI The Future of Work in The India
32 pages
Presentation Chapter 5 Plant Layout 1516079587 20707
No ratings yet
Presentation Chapter 5 Plant Layout 1516079587 20707
27 pages
The Transformative Power of Generative AI - Reword
No ratings yet
The Transformative Power of Generative AI - Reword
26 pages
Dependent Independent Variable (S) : Regression: What Is Regression
No ratings yet
Dependent Independent Variable (S) : Regression: What Is Regression
15 pages
Aircraft Design Project 1 Batch 1
No ratings yet
Aircraft Design Project 1 Batch 1
86 pages
Worksheet #1 Balancing Chemical Equations
No ratings yet
Worksheet #1 Balancing Chemical Equations
4 pages
RF Tuning Guideline - With LTE Red Site New Site Optimization
No ratings yet
RF Tuning Guideline - With LTE Red Site New Site Optimization
22 pages
Module 4 - ME - MA PDF
No ratings yet
Module 4 - ME - MA PDF
94 pages
T&S Book Rev 1
No ratings yet
T&S Book Rev 1
53 pages
Types of Computers: Supercomputer Mainframe Computers Mini Computer/Servers Computer Micro Computer/ Personal Computer
No ratings yet
Types of Computers: Supercomputer Mainframe Computers Mini Computer/Servers Computer Micro Computer/ Personal Computer
11 pages
Airports 2004
No ratings yet
Airports 2004
397 pages
Altherma Technical Catalogue For Installers - EPCEN08-721 - Catalogues - English
No ratings yet
Altherma Technical Catalogue For Installers - EPCEN08-721 - Catalogues - English
24 pages
Guardlink DEVICE LEVEL SAFETY LINKING TECHNOLOGY
No ratings yet
Guardlink DEVICE LEVEL SAFETY LINKING TECHNOLOGY
6 pages
PPS - Ch-1.Number System
No ratings yet
PPS - Ch-1.Number System
63 pages
Nuclear Physics & Nuclear Reactions Practice Problems: Multiple Choice Questions
50% (2)
Nuclear Physics & Nuclear Reactions Practice Problems: Multiple Choice Questions
10 pages
2.3 User Classes and Characteristics: Class Drone
No ratings yet
2.3 User Classes and Characteristics: Class Drone
4 pages
Ch-3-2-Theory of Cost
No ratings yet
Ch-3-2-Theory of Cost
33 pages
Corner
No ratings yet
Corner
5 pages
Canonical Correspondence Analysis For Determining Distributional Patterns of Benthic Macroinvertebrate Fauna in The Lotic Ecosystem
No ratings yet
Canonical Correspondence Analysis For Determining Distributional Patterns of Benthic Macroinvertebrate Fauna in The Lotic Ecosystem
10 pages
0560 4251 en 01
No ratings yet
0560 4251 en 01
2 pages
Spe 17741 PDF
No ratings yet
Spe 17741 PDF
8 pages
(Ebook) The Cosmic Perspective, 7th Edition, Testbank by Jeffrey O. Bennett, Megan O. Donahue, Nicholas Schneider, Mark Voit ISBN 9780321841001, 032184100X - Download the ebook today to explore every detail
100% (2)
(Ebook) The Cosmic Perspective, 7th Edition, Testbank by Jeffrey O. Bennett, Megan O. Donahue, Nicholas Schneider, Mark Voit ISBN 9780321841001, 032184100X - Download the ebook today to explore every detail
52 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
1896 - Pearson - Mathematical Contributions To The Theory of Evolution. III. Regression, Heredity, and Panmixia
No ratings yet
1896 - Pearson - Mathematical Contributions To The Theory of Evolution. III. Regression, Heredity, and Panmixia
67 pages
Mnras0409 1244
No ratings yet
Mnras0409 1244
9 pages
Missing Person Project
No ratings yet
Missing Person Project
9 pages
Power Electronics Seminar
100% (3)
Power Electronics Seminar
25 pages
Difference Between Available To Reserve & Available To Transact PDF
No ratings yet
Difference Between Available To Reserve & Available To Transact PDF
3 pages
West 6010+
No ratings yet
West 6010+
180 pages
A CFD Modeling of A Catalytic Cracking Riser Reactor. Thesis of Rashmi Pahwa
No ratings yet
A CFD Modeling of A Catalytic Cracking Riser Reactor. Thesis of Rashmi Pahwa
56 pages
Full Download Classic Topics On The History of Modern Mathematical Statistics: From Laplace To More Recent Times 1st Edition Prakash Gorroochurn PDF
100% (4)
Full Download Classic Topics On The History of Modern Mathematical Statistics: From Laplace To More Recent Times 1st Edition Prakash Gorroochurn PDF
62 pages

Low Precision Networks For Efficient Inference On Fpgas White Paper

Uploaded by

Low Precision Networks For Efficient Inference On Fpgas White Paper

Uploaded by

White Paper

Low-Precision Networks for Efficient

Authors Executive Summary

Quantizing Neural Networks for FPGAs Storage Size of BFP vs FP32

Proportion of FP32 Bits

storage availability on the system with dynamic range and 20%

S0 e04 e03 e02 e01 e00 1 m02 m01 m00 S0 Shifted m0

S1 e14 e13 e12 e11 e10 1 m12 m11 m10 S1 Shifted m1

S3 e34 e33 e32 e31 e30 1 m32 m31 m30 S3 Shifted m3

Asymmetric BFP for Increased Savings BFP vs Integer Quantization

Int5/4bfp Int7bfp Int8bfp

ICNet 67.89 59.66 67.09 67.88 n/a 67.87 n/a

BFP to Significantly Cut Resource Count A B

As already seen, the number of bits implicated in low x C

Ratio of FP16 Resource Count

Frame Rate (fps)

of Adaptive Logic Modules (ALMs) and RAM blocks roughly 30

is reflected in the number of DSP blocks and amplified 20

Also worthy of note is that the choice of network makes 0.0 0

Mixed Precision – The Advantages of Very Low Conclusion

70.5 18.8% VVG-SSDs, TinyYOLO, UNet and ICNet). On newer, leaner

68.5 The software model has several benefits. In addition to

Please Recycle WP-01308-1.0

You might also like