SlideShare a Scribd company logo
©2017, Amazon Web Services, Inc. or its affiliates. All rights reserved
FPGAs in the cloud?
Julien Simon, Principal Evangelist, AI/ML
@julsimon
Velocity Conference, NYC, 04/10/2017
Agenda
• The case for non-CPU architectures
• What is an FPGA?
• Using FPGAs on AWS
• Demo: running an FPGA image on AWS
• FPGAs and Deep Learning
• Resources
The case for non-CPU architectures
Source: Intel
Powering AWS instances: Intel Xeon E7 v4
• 7.1 billion transistors
– 456 mm2 (0.7 square inch)
• General-purpose architecture
– SISD with SIMD extension (AVX instruction set)
• Best single-core performance
• Low parallelism
– 24 cores, 48 hyperthreads
– Multi-threaded applications are hard to build
– OS and librairies need to be thread-friendly
• Thermal envelope: 168W
•https://ark.intel.com/products/96900/Intel-Xeon-Processor-E7-8894-v4-60M-Cache-2_40-GHz
Case study: Clemenson University
1.1 million vCPUs for Natural Language Processing
Optimized cost thanks to Spot Instances
https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/
Moore’s winter is (probably) coming
• « I guess I see Moore’s Law dying here in the next decade or so, but
that’s not surprising », Gordon Moore, 2015
• Technology limits: a Skylake transistor is around 100 atoms across
• New workloads require higher parallelism to achieve good
performance
– Genomics
– Financial computing
– Image and video processing
– Deep Learning
• The age of the GPU has come
http://www.economist.com/technology-quarterly/2016-03-12/after-moores-law
https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means-progress
State of the art GPU: Nvidia V100
• 21.1 billion transistors
- 815 mm2 (1.36 square inch)
• Architecture optimized for floating point
– SIMT (Single Instruction, Multiple Threads)
• Massive parallelism
– 5120 CUDA cores, 640 Tensor cores
– CUDA programming model
– Large, high-bandwidth off-chip memory (DRAM)
• Thermal envelope: 250W
https://www.nvidia.com/en-us/data-center/tesla-v100/
https://devblogs.nvidia.com/parallelforall/inside-volta/
GPUs are not optimal for some applications
• Power consumption and efficiency (TOPS/Watt)
• Strict latency requirements
• Other requirements
– Custom data types, irregular parallelism, divergence
• Building your own ASIC may solve this, but:
– It’s a huge, costly and risky effort
– ASICs can’t be reconfigured
• Time for an FPGA renaissance?
What’s an FPGA?
The FPGA
• First commercial product by Xilink in 1985
• Field Programmable Gate Array
• Not a CPU (although you could build one with it)
• « Lego » hardware: logic cells, lookup tables, DSP, I/O
• Small amount of very fast on-chip memory
• Build custom logic to accelerate your SW application
FGPA architecture
Sources:
https://www.embedded-vision.com/industry-analysis/technical-articles/fpgas-deep-learning-based-vision-processing
http://www.bober-optosensorik.de/fpga-entwicklung.html
Developing FPGA applications
• Languages
– VHDL, Verilog
– OpenCL (C++)
• Software tools
– Design
– Simulation
– Synthesis
– Routing
• Hardware tools
– Evaluation boards
– Prototypes
Expensive and hard to scale
Using FPGAs on AWS
Amazon EC2 F1 Instances
• Up to 8 Xilinx UltraScale Plus VU9P FPGAs
• Each FPGA includes
• Local 64 GB DDR4 ECC protected memory
• Dedicated PCIe x16 connections
• Up to 400Gbps bidirectional ring connection for high-speed streaming
• Approximately 2.5 million logic elements, and approximately 6,800 DSP
engines
The FPGA Developer Amazon Machine Image
(AMI)
• Xilinx SDx 2017.1
– Free license for F1 FPGA development
– Supports VHDL, Verilog, OpenCL
• AWS FPGA SDK
– Amazon FPGA Image (AFI) Management Tools
– Linux drivers
– Command line
• AWS FPGA HDK
– Design files and scripts required to build an AFI
– Shell: platform logic to handle external peripherals, PCIe, DRAM, and interrupts
• Run simulation, design, etc. on a C4 to save money!
https://aws.amazon.com/marketplace/pp/B06VVYBLZZ
https://github.com/aws/aws-fpga
Amazon
Machine
Image (AMI)
Amazon FPGA
Image (AFI)
F1 Instance
CPU
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
FPGA Link
PCIe
DDR
Controllers
FPGA Acceleration Using F1 instances
AWS
Marketplac
e
Case study: Edico Genome
Highly Efficient
• Algorithms Implemented in Hardware
• Gate-Level Circuit Design
• No Instruction Set Overhead
Massively Parallel
• Massively Parallel Circuits
• Multiple Compute Engines
• Rapid FPGA Reconfigurability
Speeds Analysis of Whole Human Genomes from Hours to Minutes
Unprecedented Low Cost for Compute and Compressed Storage
http://www.edicogenome.com/
https://aws.amazon.com/marketplace/pp/B075JR57J1
Case study: NGCodec
• Provider of UHD video compression technology
• Up to 50x faster vs. software H.265
• Higher quality video than x265 ‘veryslow’ preset
– Same bit rate
– 60+ frames per second
• Lower latency between live stream and end
viewing
• Optimized cost
https://ngcodec.com/markets-cloud-transcoding/
https://aws.amazon.com/marketplace/pp/B074W1FPKR
Demo: OpenCL on F1 instance
Building the OpenCL application
git clone https://github.com/aws/aws-fpga.git
cd aws-fpga
source sdk_setup.sh
source hdk_setup.sh
source sdaccel_setup.sh
source $XILINX_SDX/settings64.sh
cd $SDACCEL_DIR/examples/xilinx/getting_started/host/helloworld_ocl/
make clean
make check TARGETS=sw_emu DEVICES=$AWS_PLATFORM all
make check TARGETS=hw_emu DEVICES=$AWS_PLATFORM all
make check TARGETS=hw DEVICES=$AWS_PLATFORM all
Creating Vivado project and starting FPGA synthesis
…
INFO: [XOCC 60-586] Created xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin
Total elapsed time: 2h 31m 7s
$(SDACCEL_DIR)/tools/create_sdaccel_afi.sh -xclbin=xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-
2pr_4_0.xclbin -o=vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0 -s3_bucket=jsimon-fpga -
s3_logs_key=logs -s3_dcp_key=dcp
…
Generated manifest file '17_10_02-163912_manifest.txt’
upload: ./17_10_02-163912_Developer_SDAccel_Kernel.tar to s3://jsimon-fpga/dcp/17_10_02-
163912_Developer_SDAccel_Kernel.tar17_10_02-163912_agfi_id.txt
Building the AFI
aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37
{ "FpgaImages": [{
"UpdateTime": "2017-10-02T16:39:17.000Z",
"Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin",
"FpgaImageGlobalId": "agfi-03a8031774fc4773f",
"Public": false,
"State": { "Code": "pending"},
"OwnerId": "6XXXXXXXXXXX",
"FpgaImageId": "afi-056fb17ddb8cedf37",
"CreateTime": "2017-10-02T16:39:17.000Z",
"Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }]
}
Loading the AFI and running the OpenCL
application
aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37
{ "FpgaImages": [{
"UpdateTime": "2017-10-02T16:39:17.000Z",
"Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin",
"FpgaImageGlobalId": "agfi-03a8031774fc4773f",
"Public": false,
"State": { "Code": "ready"},
"OwnerId": "6XXXXXXXXXXX",
"FpgaImageId": "afi-056fb17ddb8cedf37",
"CreateTime": "2017-10-02T16:39:17.000Z",
"Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }]
}
sudo fpga-load-local-image -S 0 -I agfi-03a8031774fc4773f
sudo fpga-describe-local-image -S 0
sudo sh
source /opt/Xilinx/SDx/2017.1.rte/setup.sh
./helloworld
sudo fpga-clear-local-image -S 0
FPGAs and Deep Learning
A chink in the GPU armor?
• GPUs are great for training,
but what about inference?
• Throughput and latency: pick one?
– Using batches increases latency
– Using single samples degrades throughput
• Power and memory requirements
– Floating-point operations are power-hungry
– Floating-point weights need more DRAM,
which is power-hungry too
• Neural networks can be implemented
on FPGA
© HBO
Using custom logic to Multiply and Accumulate
Source: « FPGA Implementations of Neural Networks », Springer, 2006
Smaller weights  less gates, less data to load into the FPGA
Optimizing Deep Learning models for FPGAs
• Quantization: using integer weights
– 8/4/2-bit integers instead of 32-bit floats
– Reduces power consumption
– Simplifies the logic needed to implement the
model
– Reduces memory usage
• Pruning: removing useless connections
– Increases computation speed
– Reduces memory usage
• Compression: encoding weights
– Reduces model size
On-chip SRAM
becomes a
viable option
 More power-
effcient than
DRAM
 Faster than
off-chip DRAM
Published results
[Han, 2016] Optimizing CNNs on CPU and GPU
• AlexNet 35x smaller, VGG-16 49x smaller
• 3x to 4x speedup, 3x to 7x more energy-efficient
• No loss of accuracy
[Han, 2017] Optimizing LSTM on Xilinx FPGA
• FPGA vs CPU: 43x faster, 40x more energy-efficient
• FPGA vs GPU: 3x faster, 11.5x more energy-efficient
[Nurvitadhi, 2017] Optimizing CNNs on Intel FPGA
• FPGA vs GPU: 60% faster, 2.3x more energy-effcient
• <1% loss of accuracy
Nvidia Hardware for Deep Learning
• Open architecture for DL inference accelerators on IoT
devices
– Convolution Core – optimized high-performance convolution engine
– Single Data Processor – single-point lookup engine for activation functions
– Planar Data Processor – planar averaging engine for pooling
– Channel Data Processor – multi-channel averaging engine for normalization
functions
– Dedicated Memory and Data Reshape Engines – memory-to-memory
transformation acceleration for tensor reshape and copy operations.
• Verilog model + test suite
• F1 instances are supportedhttp://nvdla.org/
https://github.com/nvdla/
Conclusion
• CPU, GPU, FPGA: the battle rages on
• As always, pick the right tool for the job
– Application requirements: performance, power, cost, etc.
– Time to market
– Skills
– The AWS marketplace: the solution may be just a few clicks away!
• AWS offers you many options,
please explore them and give us feedback
Resources
https://aws.amazon.com/ec2/instance-types/f1
https://aws.amazon.com/ec2/instance-types/f1/partners/
https://github.com/aws/aws-fpga
[Han, 2016] « Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization
and Huffman Coding » https://arxiv.org/abs/1510.00149
[Han, 2017] « ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA », Best Paper at
FPGA’17
https://arxiv.org/abs/1612.00694
« Deep Learning Tutorial and Recent Trends », FPGA’17
http://isfpga.org/slides/D1_S1_Tutorial.pdf
[Nurvitadhi, 2017] « Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? »,
FPGA’17 http://jaewoong.org/pubs/fpga17-next-generation-dnns.pdf
Thank you!
http://aws.amazon.com/evangelists/julien-simon
@julsimon

More Related Content

What's hot (10)

PPTX
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
PPTX
Machine Learning inference at the Edge
Julien SIMON
 
PDF
Deep Learning with AWS (November 2016)
Julien SIMON
 
PPTX
High Performance Computing (HPC) in cloud
Accubits Technologies
 
PDF
HPC on Azure for Reserach
Jürgen Ambrosi
 
PDF
Deep Dive on Amazon EC2 Instances (March 2017)
Julien SIMON
 
PDF
Machine Learning Inference at the Edge
Julien SIMON
 
PDF
AWS meetup「Apache Spark on EMR」
SmartNews, Inc.
 
PDF
deep learning in production cff 2017
Ari Kamlani
 
PDF
Deep Learning for Developers (October 2017)
Julien SIMON
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
Machine Learning inference at the Edge
Julien SIMON
 
Deep Learning with AWS (November 2016)
Julien SIMON
 
High Performance Computing (HPC) in cloud
Accubits Technologies
 
HPC on Azure for Reserach
Jürgen Ambrosi
 
Deep Dive on Amazon EC2 Instances (March 2017)
Julien SIMON
 
Machine Learning Inference at the Edge
Julien SIMON
 
AWS meetup「Apache Spark on EMR」
SmartNews, Inc.
 
deep learning in production cff 2017
Ari Kamlani
 
Deep Learning for Developers (October 2017)
Julien SIMON
 

Similar to FPGAs in the cloud? (October 2017) (20)

PDF
FPGAs for Supercomputing: The Why and How
DESMOND YUEN
 
PPTX
Introduction to FPGA acceleration
Marco77328
 
PDF
The basic graphics architecture for all modern PCs and game consoles is similar
dinosocrates
 
PDF
FPGA Embedded Design
Dr. Shivananda Koteshwar
 
PDF
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
PPTX
FPGAs versus GPUs in Data centers
Mehedi Hasan Raju
 
PPTX
fpga1 - What is.pptx
ssuser0de10a
 
PPTX
SoC FPGA Technology
Siraj Muhammad
 
PDF
FPGAs memory synchronization and performance evaluation using the open compu...
International Journal of Reconfigurable and Embedded Systems
 
PDF
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
PPTX
AI Hardware Landscape 2021
Grigory Sapunov
 
PDF
Sybsc cs sem 3 physical computing and iot programming unit 1
WE-IT TUTORIALS
 
PPTX
Introduction to FPGAs
Scott Thibault
 
PPTX
SDAccel Design Contest: SDAccel and F1 Instances
NECST Lab @ Politecnico di Milano
 
PDF
INFN Advanced ML Hackaton 2022 Talk
Mirko Mariotti
 
PPTX
HiPEAC-Keynote.pptx
Behzad Salami
 
PDF
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
Duy-Hieu Bui
 
PDF
Deep learning with FPGA
Ayush Singh, MS
 
PPTX
A Primer on FPGAs - Field Programmable Gate Arrays
Taylor Riggan
 
PDF
Can FPGAs Compete with GPUs?
inside-BigData.com
 
FPGAs for Supercomputing: The Why and How
DESMOND YUEN
 
Introduction to FPGA acceleration
Marco77328
 
The basic graphics architecture for all modern PCs and game consoles is similar
dinosocrates
 
FPGA Embedded Design
Dr. Shivananda Koteshwar
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
FPGAs versus GPUs in Data centers
Mehedi Hasan Raju
 
fpga1 - What is.pptx
ssuser0de10a
 
SoC FPGA Technology
Siraj Muhammad
 
FPGAs memory synchronization and performance evaluation using the open compu...
International Journal of Reconfigurable and Embedded Systems
 
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
AI Hardware Landscape 2021
Grigory Sapunov
 
Sybsc cs sem 3 physical computing and iot programming unit 1
WE-IT TUTORIALS
 
Introduction to FPGAs
Scott Thibault
 
SDAccel Design Contest: SDAccel and F1 Instances
NECST Lab @ Politecnico di Milano
 
INFN Advanced ML Hackaton 2022 Talk
Mirko Mariotti
 
HiPEAC-Keynote.pptx
Behzad Salami
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
Duy-Hieu Bui
 
Deep learning with FPGA
Ayush Singh, MS
 
A Primer on FPGAs - Field Programmable Gate Arrays
Taylor Riggan
 
Can FPGAs Compete with GPUs?
inside-BigData.com
 
Ad

More from Julien SIMON (20)

PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
PDF
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
PDF
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
PDF
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
PDF
An introduction to computer vision with Hugging Face
Julien SIMON
 
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
PDF
Building NLP applications with Transformers
Julien SIMON
 
PPTX
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
PDF
Starting your AI/ML project right (May 2020)
Julien SIMON
 
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
An introduction to computer vision with Hugging Face
Julien SIMON
 
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
Building NLP applications with Transformers
Julien SIMON
 
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
Starting your AI/ML project right (May 2020)
Julien SIMON
 
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
Ad

Recently uploaded (20)

PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
The Future of Artificial Intelligence (AI)
Mukul
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 

FPGAs in the cloud? (October 2017)

  • 1. ©2017, Amazon Web Services, Inc. or its affiliates. All rights reserved FPGAs in the cloud? Julien Simon, Principal Evangelist, AI/ML @julsimon Velocity Conference, NYC, 04/10/2017
  • 2. Agenda • The case for non-CPU architectures • What is an FPGA? • Using FPGAs on AWS • Demo: running an FPGA image on AWS • FPGAs and Deep Learning • Resources
  • 3. The case for non-CPU architectures
  • 5. Powering AWS instances: Intel Xeon E7 v4 • 7.1 billion transistors – 456 mm2 (0.7 square inch) • General-purpose architecture – SISD with SIMD extension (AVX instruction set) • Best single-core performance • Low parallelism – 24 cores, 48 hyperthreads – Multi-threaded applications are hard to build – OS and librairies need to be thread-friendly • Thermal envelope: 168W •https://ark.intel.com/products/96900/Intel-Xeon-Processor-E7-8894-v4-60M-Cache-2_40-GHz
  • 6. Case study: Clemenson University 1.1 million vCPUs for Natural Language Processing Optimized cost thanks to Spot Instances https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/
  • 7. Moore’s winter is (probably) coming • « I guess I see Moore’s Law dying here in the next decade or so, but that’s not surprising », Gordon Moore, 2015 • Technology limits: a Skylake transistor is around 100 atoms across • New workloads require higher parallelism to achieve good performance – Genomics – Financial computing – Image and video processing – Deep Learning • The age of the GPU has come http://www.economist.com/technology-quarterly/2016-03-12/after-moores-law https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means-progress
  • 8. State of the art GPU: Nvidia V100 • 21.1 billion transistors - 815 mm2 (1.36 square inch) • Architecture optimized for floating point – SIMT (Single Instruction, Multiple Threads) • Massive parallelism – 5120 CUDA cores, 640 Tensor cores – CUDA programming model – Large, high-bandwidth off-chip memory (DRAM) • Thermal envelope: 250W https://www.nvidia.com/en-us/data-center/tesla-v100/ https://devblogs.nvidia.com/parallelforall/inside-volta/
  • 9. GPUs are not optimal for some applications • Power consumption and efficiency (TOPS/Watt) • Strict latency requirements • Other requirements – Custom data types, irregular parallelism, divergence • Building your own ASIC may solve this, but: – It’s a huge, costly and risky effort – ASICs can’t be reconfigured • Time for an FPGA renaissance?
  • 11. The FPGA • First commercial product by Xilink in 1985 • Field Programmable Gate Array • Not a CPU (although you could build one with it) • « Lego » hardware: logic cells, lookup tables, DSP, I/O • Small amount of very fast on-chip memory • Build custom logic to accelerate your SW application
  • 13. Developing FPGA applications • Languages – VHDL, Verilog – OpenCL (C++) • Software tools – Design – Simulation – Synthesis – Routing • Hardware tools – Evaluation boards – Prototypes Expensive and hard to scale
  • 15. Amazon EC2 F1 Instances • Up to 8 Xilinx UltraScale Plus VU9P FPGAs • Each FPGA includes • Local 64 GB DDR4 ECC protected memory • Dedicated PCIe x16 connections • Up to 400Gbps bidirectional ring connection for high-speed streaming • Approximately 2.5 million logic elements, and approximately 6,800 DSP engines
  • 16. The FPGA Developer Amazon Machine Image (AMI) • Xilinx SDx 2017.1 – Free license for F1 FPGA development – Supports VHDL, Verilog, OpenCL • AWS FPGA SDK – Amazon FPGA Image (AFI) Management Tools – Linux drivers – Command line • AWS FPGA HDK – Design files and scripts required to build an AFI – Shell: platform logic to handle external peripherals, PCIe, DRAM, and interrupts • Run simulation, design, etc. on a C4 to save money! https://aws.amazon.com/marketplace/pp/B06VVYBLZZ https://github.com/aws/aws-fpga
  • 17. Amazon Machine Image (AMI) Amazon FPGA Image (AFI) F1 Instance CPU DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory FPGA Link PCIe DDR Controllers FPGA Acceleration Using F1 instances AWS Marketplac e
  • 18. Case study: Edico Genome Highly Efficient • Algorithms Implemented in Hardware • Gate-Level Circuit Design • No Instruction Set Overhead Massively Parallel • Massively Parallel Circuits • Multiple Compute Engines • Rapid FPGA Reconfigurability Speeds Analysis of Whole Human Genomes from Hours to Minutes Unprecedented Low Cost for Compute and Compressed Storage http://www.edicogenome.com/ https://aws.amazon.com/marketplace/pp/B075JR57J1
  • 19. Case study: NGCodec • Provider of UHD video compression technology • Up to 50x faster vs. software H.265 • Higher quality video than x265 ‘veryslow’ preset – Same bit rate – 60+ frames per second • Lower latency between live stream and end viewing • Optimized cost https://ngcodec.com/markets-cloud-transcoding/ https://aws.amazon.com/marketplace/pp/B074W1FPKR
  • 20. Demo: OpenCL on F1 instance
  • 21. Building the OpenCL application git clone https://github.com/aws/aws-fpga.git cd aws-fpga source sdk_setup.sh source hdk_setup.sh source sdaccel_setup.sh source $XILINX_SDX/settings64.sh cd $SDACCEL_DIR/examples/xilinx/getting_started/host/helloworld_ocl/ make clean make check TARGETS=sw_emu DEVICES=$AWS_PLATFORM all make check TARGETS=hw_emu DEVICES=$AWS_PLATFORM all make check TARGETS=hw DEVICES=$AWS_PLATFORM all Creating Vivado project and starting FPGA synthesis … INFO: [XOCC 60-586] Created xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin Total elapsed time: 2h 31m 7s $(SDACCEL_DIR)/tools/create_sdaccel_afi.sh -xclbin=xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr- 2pr_4_0.xclbin -o=vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0 -s3_bucket=jsimon-fpga - s3_logs_key=logs -s3_dcp_key=dcp … Generated manifest file '17_10_02-163912_manifest.txt’ upload: ./17_10_02-163912_Developer_SDAccel_Kernel.tar to s3://jsimon-fpga/dcp/17_10_02- 163912_Developer_SDAccel_Kernel.tar17_10_02-163912_agfi_id.txt
  • 22. Building the AFI aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37 { "FpgaImages": [{ "UpdateTime": "2017-10-02T16:39:17.000Z", "Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin", "FpgaImageGlobalId": "agfi-03a8031774fc4773f", "Public": false, "State": { "Code": "pending"}, "OwnerId": "6XXXXXXXXXXX", "FpgaImageId": "afi-056fb17ddb8cedf37", "CreateTime": "2017-10-02T16:39:17.000Z", "Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }] }
  • 23. Loading the AFI and running the OpenCL application aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37 { "FpgaImages": [{ "UpdateTime": "2017-10-02T16:39:17.000Z", "Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin", "FpgaImageGlobalId": "agfi-03a8031774fc4773f", "Public": false, "State": { "Code": "ready"}, "OwnerId": "6XXXXXXXXXXX", "FpgaImageId": "afi-056fb17ddb8cedf37", "CreateTime": "2017-10-02T16:39:17.000Z", "Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }] } sudo fpga-load-local-image -S 0 -I agfi-03a8031774fc4773f sudo fpga-describe-local-image -S 0 sudo sh source /opt/Xilinx/SDx/2017.1.rte/setup.sh ./helloworld sudo fpga-clear-local-image -S 0
  • 24. FPGAs and Deep Learning
  • 25. A chink in the GPU armor? • GPUs are great for training, but what about inference? • Throughput and latency: pick one? – Using batches increases latency – Using single samples degrades throughput • Power and memory requirements – Floating-point operations are power-hungry – Floating-point weights need more DRAM, which is power-hungry too • Neural networks can be implemented on FPGA © HBO
  • 26. Using custom logic to Multiply and Accumulate Source: « FPGA Implementations of Neural Networks », Springer, 2006 Smaller weights  less gates, less data to load into the FPGA
  • 27. Optimizing Deep Learning models for FPGAs • Quantization: using integer weights – 8/4/2-bit integers instead of 32-bit floats – Reduces power consumption – Simplifies the logic needed to implement the model – Reduces memory usage • Pruning: removing useless connections – Increases computation speed – Reduces memory usage • Compression: encoding weights – Reduces model size On-chip SRAM becomes a viable option  More power- effcient than DRAM  Faster than off-chip DRAM
  • 28. Published results [Han, 2016] Optimizing CNNs on CPU and GPU • AlexNet 35x smaller, VGG-16 49x smaller • 3x to 4x speedup, 3x to 7x more energy-efficient • No loss of accuracy [Han, 2017] Optimizing LSTM on Xilinx FPGA • FPGA vs CPU: 43x faster, 40x more energy-efficient • FPGA vs GPU: 3x faster, 11.5x more energy-efficient [Nurvitadhi, 2017] Optimizing CNNs on Intel FPGA • FPGA vs GPU: 60% faster, 2.3x more energy-effcient • <1% loss of accuracy
  • 29. Nvidia Hardware for Deep Learning • Open architecture for DL inference accelerators on IoT devices – Convolution Core – optimized high-performance convolution engine – Single Data Processor – single-point lookup engine for activation functions – Planar Data Processor – planar averaging engine for pooling – Channel Data Processor – multi-channel averaging engine for normalization functions – Dedicated Memory and Data Reshape Engines – memory-to-memory transformation acceleration for tensor reshape and copy operations. • Verilog model + test suite • F1 instances are supportedhttp://nvdla.org/ https://github.com/nvdla/
  • 30. Conclusion • CPU, GPU, FPGA: the battle rages on • As always, pick the right tool for the job – Application requirements: performance, power, cost, etc. – Time to market – Skills – The AWS marketplace: the solution may be just a few clicks away! • AWS offers you many options, please explore them and give us feedback
  • 31. Resources https://aws.amazon.com/ec2/instance-types/f1 https://aws.amazon.com/ec2/instance-types/f1/partners/ https://github.com/aws/aws-fpga [Han, 2016] « Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding » https://arxiv.org/abs/1510.00149 [Han, 2017] « ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA », Best Paper at FPGA’17 https://arxiv.org/abs/1612.00694 « Deep Learning Tutorial and Recent Trends », FPGA’17 http://isfpga.org/slides/D1_S1_Tutorial.pdf [Nurvitadhi, 2017] « Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? », FPGA’17 http://jaewoong.org/pubs/fpga17-next-generation-dnns.pdf

Editor's Notes

  • #2: OK
  • #3: OK
  • #4: OK
  • #5: Intel 4004: 15 novembre 1971, 4-bit architecture 0.74MHz, 2300 transistors, 10um
  • #6: XXX is this Skylake ?
  • #10: OK
  • #11: OK
  • #15: Amazon EC2 instances: F1 family FPGA Developer AMI AWS SDK and HDK
  • #16: Available in North Virginia, Oregon and Ireland regions
  • #18: An F1 instance can have any number of AFIs An AFI can be loaded into the FPGA in less than 1 second
  • #19: Edico Genome ported their genomics platform (DRAGEN) to F1 enabling real-time genomic analysis while saving cost and dramatically scaling its availability. This offering has the potential to be transformative for hospitals, academic institutions, drug developers and sequencing centers results, as it enables them to analyze whole genome data in under an hour, which offers up to 10x improvement compared to comparable state of the art algorithms both on-prem and in the cloud.
  • #20: off-the-shelf images for customers revenue stream for FPGA developers