Low Precision Networks For Efficient Inference On Fpgas White Paper
Low Precision Networks For Efficient Inference On Fpgas White Paper
Vision
FPGA
Figure 2. Blocking of four FP9 floating point values to int5bfp. The "1" in the left-most mantissa positions is the implicit 1 in
floating-point format made explicit prior to conversion. Sign + mantissa bits are now in sign + magnitude integer format
2
White Paper | Low-Precision Networks for Efficient Inference on FPGAs
BFP for High Inference Accuracy is insufficient, it is possible for the FPGA to accommodate
an increase in the activation bit width for specific layers.
Many of the high parameter networks – the ResNets, Algorithmic techniques are available to determine and retrain
Inception, VGG-based SSDs – quantize well to int8bfp and a network to account for these changes in precision (see
even int7bfp without any additional intervention, as shown section on Mixed Precision below).
in Table 1, where green highlights indicate a minimal loss of
accuracy from the original FP32 model.
Why Block Floating Point?
As expected, the drop in accuracy from applying quantization
is more perceptible at very low precisions. This effect is • Small resource footprint
exaggerated in the more modern, compact networks such as
MobileNet and EfficientNet, which experience some accuracy • Excellent compatibility with Intel® FPGA DSP blocks
drop even at higher precisions. • High dynamic range models weights and activations
Fortunately, this penalty can be reversed easily. As few as four well at low precisions
epochs of retraining – or a dozen for the more challenging • Simplicity in training: no parameter initialization
networks – can recover the model’s accuracy. Where this
Table 1. Indicative Top-1 accuracies for networks both with and without retraining, at int5/4bfp (int5bfp activations and int4bpf
weights), int7bfp and int8bfp – all at block size 32. n/a shows where retraining is not required
Blue : Full precision accuracy
Green : Achieves quantization accuracy within 1 percentage point of the full precision accuracy
Amber : Achieves quantization accuracy within around 5 percentage points of the full precision accuracy
Red : Achieves quantization accuracy significantly lower than full precision accuracy
3
White Paper | Low-Precision Networks for Efficient Inference on FPGAs
Meaningful examples mirror real-life applications, which Hardware Footprint on Reducing Network
tend to employ variations of standard networks. As Precision - Block Size 16
typical benchmarks, ResNet 50 and MobileNet v2 are used 1.0 70
throughout this section to give an idea of the effects of
quantization. The reference is int12bfp which is a good proxy 60
In the following figures, the baseline footprint (indicated in Figure 5. Resource count ratios and frame rate for ResNet 50
the leftmost columns of Figure 4) at int12bfp and block size and MobileNet v2 inference at block size 16
32 is:
An additional indication of available trade-offs is given in
• 816 M20K RAM blocks
Figure 6. A realistic application may well use MobileNet v2
• 551 DSP blocks at a frame rate of 30 frames per second (fps). In this case, a
block size of 8 is sufficient to fulfill the criteria which results
• 39,315 ALMs
in the use of 103 M20K RAM blocks, 31 DSP blocks and
In Figure 4, reductions in footprint result directly from 12,635 ALMs at int5/4bfp.
reduced precision. With a single bitstream for both ResNet
50 and MobileNet v2, it is interesting to note the difference in
frame rate. Hardware Footprint on Reducing Network
Precision - Block Size 16
Hardware Footprint on Reducing Network 1.0
Precision - Block Size 32
1.0 160
Ratio of FP16 Resource Count
0.8
140
Ratio of FP16 Resource Count
0.8
120 0.6
Frame Rate (fps)
100
0.6
0.4
80
0.4
60
0.2
40
0.2
30 0.0
int12bfp (eq. o FP16) int7bfp int5/4bfp
0.0 0
int12bfp (eq. o FP16) int7bfp int5/4bfp RAM Blocks DSP Blocks ALMs
RAM Blocks DSP Blocks ALMs Figure 6. Resource count and accuracy for MobileNet v2
ResNet50 fps MobileNetv2 fps inference at 30 fps with block size 8
Figure 4. Resource count ratios and frame rate for ResNet 50 As seen, simply by reducing the precision, the associated
and MobileNet v2 inference at block size 32 numbers of ALMs and RAM blocks roughly follow the pattern
seen in DSP block usage. This is amplified by decreasing the
Optimizing further, halving the block size leads to even block size, which – while limiting throughput – compounds
greater savings. In Figure 5, while frame rate is halved for the benefits in the footprint. It is worth noting that doubling
MobileNet v2 and reduced by two thirds for ResNet 50, using the number of hardware instances doubles the throughput
block size 16 halves RAM utilization and reduces the DSP — potentially useful in the case of multiple input streams or
block usage by two thirds. requirements for redundancy in the system.
4
White Paper | Low-Precision Networks for Efficient Inference on FPGAs
5
White Paper | Low-Precision Networks for Efficient Inference on FPGAs
References
For more information about Intel and low-precision inference on FPGAs, the following links are available:
[1] “A Configurable Cloud-Scale DNN Processor for Real-Time AI”, https://www.microsoft.com/en-us/research/uploads/
prod/2018/06/ISCA18-Brainwave-CameraReady.pdf
[2] “Harnessing Numerical Flexibility for Deep Learning on FPGAs”, Proceedings of the 9th International Symposium on
Highly-Efficient Accelerators and Reconfigurable Technologies, https://dl.acm.org/doi/10.1145/3241793.3241794
[3] “Intel Arria 10 Native Fixed Point DSP IP Core User Guide”, https://www.intel.com/content/dam/altera-www/global/en_US/
pdfs/literature/ug/ug_nfp_dsp.pdf
[4] “Intel Arria 10 Native Floating-Point DSP Intel FPGA IP User Guide”, https://www.intel.com/content/dam/altera-www/
global/en_US/pdfs/literature/ug/ug-a10dsp.pdf
[5] “Intel Agilex Variable Precision DSP Blocks User Guide”, https://www.intel.com/content/dam/www/programmable/us/en/
pdfs/literature/hb/agilex/ug-ag-dsp.pdf
[6] “Intel® Stratix® 10 NX FPGA”, https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/
stratix-10-nx-technology-brief.pdf
[7] “Flexibility: FPGAs and CAD in Deep Learning Acceleration”, Proceedings of the 2018 International Symposium on Physical
Design, https://doi.org/10.1145/3177540.3177561
[8] US Patent application number 16/818889: “Floating-point Decomposition Circuitry with Dynamic Precision”,
https://uspto.report/patent/app/20200218508
The performance numbers presented herein are a mix of measured and estimated numbers generated using an Arria 10 PAC card at a batch size of 1, incorporating an A10-1150 speed grade 2
FPGA. The host is a Xeon E5-1650 v3 @ 3.5 GHz w/ 132 GB RAM. Some numbers were estimated based on the fmax of the compiled architecture.
All information provided here is subject to change without notice.
Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document.
You should visit the referenced web site and confirm whether referenced data are accurate.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation.
Performance varies depending on system configuration. No computer system can be absolutely secure.
Check with your system manufacturer or retailer or learn more at www.intel.com.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.