0% found this document useful (0 votes)
8 views91 pages

30006

The document discusses AI accelerators, focusing on their architecture, challenges, and applications in smartphones and autonomous vehicles. It highlights the importance of hardware accelerators in enhancing the performance of AI algorithms, particularly in deep learning and inference tasks. The tutorial presented at the 2021 IEEE International Symposium on Circuits and Systems covers various aspects of AI hardware, including design approaches, performance metrics, and the growing demand for efficient computing solutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views91 pages

30006

The document discusses AI accelerators, focusing on their architecture, challenges, and applications in smartphones and autonomous vehicles. It highlights the importance of hardware accelerators in enhancing the performance of AI algorithms, particularly in deep learning and inference tasks. The tutorial presented at the 2021 IEEE International Symposium on Circuits and Systems covers various aspects of AI hardware, including design approaches, performance metrics, and the growing demand for efficient computing solutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Artificial Intelligent & Deep Learning

Hardware Accelerators for


Smart Technology and Intelligent Society
Shiho Kim, Ashutosh Mishra,
Seamless Transportation Lab, Yonsei University
Hyunbin Park, Samsung Electronics

2021 IEEE International Symposium on Circuits and Systems


May 22-28, 2021 Virtual & Hybrid Conference

Seamless Transportation Lab


Outline of Contents 2

1. AI Accelerators (Training & Inference)


• Architecture, Challenging Issues of AI Accelerators for Training & Inference

2. On-device AI of Smartphone
• On-device AI for Smartphone: Hardware, Software, Model Optimization and Benchmarking

3. AI Accelerators for Autonomous Vehicles


• Overview, Survey and challenges, HW AI Accelerators for Autonomous vehicles
Speakers 3

7.1 AI Accelerators (Training & Inference)


Dr. Ashutosh Mishra has received his Ph.D. degree in 2018 from the Department of Electronics Engineering, Indian
Institute of Technology (BHU) Varanasi, India. He has worked as Assistant professor in Electronics and Communication
Engineering at National Institute of Technology Raipur, India. He is recipient of the Korea Research Fellowship (KRF) 2019
provided by National Research Foundation of Korea through the Ministry of Science & ICT, South Korea. Currently, he is
working as Research faculty in Seamless Transportation Lab Yonsei University, South Korea. His research interests include
Smart sensors, Intelligent systems, Autonomous vehicles, and Artificial Intelligence, etc.

7.2 On-device AI of Smartphone


Dr. Hyunbin Park has received his Ph.D. from School of Integrated Technology, Yonsei University, South Korea, in 2019.
Currently, he is working as Staff Engineer in Samsung Electronics, South Korea. He is expertise in design of Inference
accelerator for deep neural networks, Deep learning processor inside camera, etc. His research interests are in NPU designs
for smartphones, Autonomous vehicle processor, Artificial Intelligent & Deep Learning based Hardware Accelerators etc.

7.3 AI Accelerators for Autonomous Vehicles


Dr. Shiho Kim is a Professor in the School of Integrated Technology, Yonsei University, South Korea. He has been
directing Seamless Transportation Lab, since 2011. His main research interests include the development of software and
hardware technologies for intelligent vehicles, and reinforcement learning for autonomous vehicles. He is a member of the
editorial board and reviewer for various Journals and International conferences. So far he has organized two International
Conference as Technical Chair/General Chair.
ISCAS 2021 Tutorial 7-1
AI Accelerators (Training & Inference)

Ashutosh Mishra, Yonsei University


Shiho Kim, Yonsei University
Hyunbin Park, Samsung Electronics

2021 IEEE International Symposium on Circuits and Systems


May 22-28, 2021 Virtual & Hybrid Conference

Seamless Transportation Lab


Outline of Tutorial Session 7-1 5

 Artificial Intelligence
• AI Algorithms & Hardware Requirements

 AI Hardware Accelerators
• Overview & Opportunities

 State-of-The-Art AI accelerators
• Recent AI Hardware Accelerators

 Summary
6
Artificial Intelligence (AI)
7
Artificial Intelligence (AI)
8
AI (Training and Inference)
9
AI (Training and Inference)
10
AI (Training and Inference)
11
AI (Inference)

SOURCE: Xilinx Adaptive Compute Acceleration Platform

Key Requirements of AI Inference Acceleration

 Lowest latency
 Accelerate whole application
 Match the speed of AI innovation
12
Software & Hardware Options of ML inference System

Neural Network Open Neural Net Graph Format


Exchange Format work Exchange (NNEF/ ONNX)
My ML Fra ML Frame
mework work
13
AI Algorithms (Deep Learning)

Task Suitable Models

Image classification CNN

Image recognition CNN

Time series prediction RNN, LSTM

Text generation RNN, LSTM

Visualization SOM
14
AI Algorithms (Deep Learning)
Neural Networks

Convolutional Neural Networks (CNNs)


15
AI Algorithms (Deep Learning)

h=height of input s=stride (pixel skipping)


w=width of input p=width of output
d=depth of input q=height of output
m=l=filter-size k=depth of output =no. of channels in filter/ kernel

𝐼𝑛𝑝𝑢𝑡 𝑤𝑖𝑑𝑡ℎ + 2 × 𝑝𝑎𝑑𝑑𝑖𝑛𝑔 − 𝑓𝑖𝑙𝑡𝑒𝑟 𝑤𝑖𝑑𝑡ℎ


𝑂𝑢𝑡𝑝𝑢𝑡 𝑤𝑖𝑑𝑡ℎ = +1
𝑠𝑡𝑟𝑖𝑑𝑒
𝐼𝑛𝑝𝑢𝑡 ℎ𝑒𝑖𝑔ℎ𝑡 + 2 × 𝑝𝑎𝑑𝑑𝑖𝑛𝑔 − 𝑓𝑖𝑙𝑡𝑒𝑟 ℎ𝑒𝑖𝑔ℎ𝑡
𝑂𝑢𝑡𝑝𝑢𝑡 ℎ𝑒𝑖𝑔ℎ𝑡 = +1
𝑠𝑡𝑟𝑖𝑑𝑒
16
AI Algorithms (Deep Learning)
17
AI Algorithms (Deep Learning)
MAC operations in inference and training of a convolutional layer

Inter Layer Parallelism

SOURCE: [1]. Choi, S., Sim, J., Kang, M., Choi, Y., Kim, H. and Kim, L.S., 2020. An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices. IEEE Journal of Solid-State Circuits, 55(10), pp.2691-2702.
[2]. Song, L., Qian, X., Li, H. and Chen, Y., 2017, February. Pipelayer: A pipelined reram-based accelerator for deep learning. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 541-552). IEEE.
18
AI Algorithms (Timeline for Computer Vison Models)
Inception-v4
Inception-v1

AlexNet ResNet-50 2016 ResNeXt-50


2014

2017
2012 2015

2015
1998 2016

2014 Inception-v3 2016


LeNet-5 Inception
ResNets
VGG-16 Xception
19
AI Algorithms (Deep Learning Models)
20
AI Algorithms

Accuracy • For example; if actual image is blueberry


Model Size (MB) Parameters Depth
Top-1 Top-1 and predictions are (with probability)
VGG16 528 0.713 0.901 138357544 23
Inception-V3 92 0.779 0.937 23851784 159
1. cherry: 0.35
ResNet50 98 0.749 0.921 25636712
Xception 88 0.790 0.945 22910480 126
2. raspberry: 0.25
InceptionResNetV2 215 0.803 0.953 55873736 572 3. blueberry: 0.2
ResNeXt50 96 0.777 0.938 25097128 4. strawberry: 0.1


Here, Top-1 & Top-5 accuracy indicates the model performance on ImageNet validation dataset.
Depth indicates the topological depth of the network. It includes the Convolutional layers, pooling
5. apple: 0.06
layers, activation layers, batch normalization layers, etc.
6. orange: 0.04

• Top-1 accuracy is the conventional version of accuracy. • According to Top-1 accuracy prediction
It consider 1 class with the highest probability. (cherry: 0.35) is wrong.

• Top-5 accuracy use top-5 classes instead of 1 class. • According to Top-5 accuracy the
prediction is correct since blueberry is still
on top-5 results.
21
AI Accelerators (Hardware Accelerators)

• Hardware Acceleration — It is the use of computer hardware specially made to perform some
functions more efficiently than is possible in software running on a general-purpose
processing unit alone.

• It combines the flexibility of general-purpose processors, such as CPUs, with the fully
customized hardware, such as GPUs and ASICs, to increase the efficiency by orders of
magnitude. For example, visualization processes may be offloaded onto a graphics card in order
to enable faster, higher-quality playback of videos and games.

• An AI accelerator is a class of specialized computer system designed to accelerate artificial


intelligence applications, especially artificial neural networks, machine vision and machine
learning.
22
AI Accelerators (Hardware Accelerators)

• Emerging Deep Neural Network (DNN)


algorithms have made AI-assisted systems
to witness the tremendous growth in recent
years in accomplishing a variety of cognitive
tasks.
• However, the algorithmic superiority of
DNNs require extremely high computation
and memory costs that pose significant
challenges to the hardware platforms
executing them. Therefore, AI accelerators
are gaining attention by the designers and
academicians in circuits and systems.
• Recently, the approximate computing, in-
memory computing, machine intelligence
and quantum computing are among the
SOTA computing approaches being explored
for AI workloads.
23
Artificial Intelligence & Hardware Requirements

• Size of AI models is increasing exponentially.

• Therefore, the floating point operations per


second (FLOPS) required is doubling roughly
every 3.5 months, creating an insatiable
demand for ever more performance.

• Therefore, over 100 companies developing


new architectures to bring the performance SOURCE: Intel

up and the cost of computing down.

• Analysis show that 2X advances in


architecture, silicon, interconnect, software
and packaging is required to match the
FLOPs requirements (~10X) every year.

SOURCE: Moor Insights and Strategy


24
AI Accelerators (Hardware Accelerators)

• General purpose (GP) hardware (HW) uses arithmetic blocks for basic in-memory calculations
(i.e., serial processing) which is not suitable for high performance deep learning techniques.

 Neural networks need multiple parallel and simple arithmetic operations.


 Even the powerful GP chips can not support a high number of simultaneous operations.
 AI optimized HW includes numerous less powerful chips which enables parallel processing.

• The AI accelerators provide following advantages over the GP hardware:

 Faster computation: Artificial intelligence applications typically require parallel


computational capabilities in order to run sophisticated training models and algorithms. AI
hardware provides more parallel processing capability that is estimated to have up to 10
times more competing power in ANN applications compared to traditional semiconductor
devices at similar price.

 High bandwidth memory: Specialized AI hardware is estimated to allocate 4-5 times more
bandwidth than traditional chips. This is necessary in parallel processing. AI applications
require significantly more bandwidth between processors for efficient performance.
25
AI Accelerators Assessment Parameters
Processing speed Power requirements Device Size Total Cost

Processing Speed: AI hardware enables faster training and inference using neural networks.
• Faster training enables the ML experts to try different DL approaches.
• Optimize the structure of their neural network (hyper parameter optimization).
• Faster inferences (e.g. predictions) are critical for applications like autonomous driving.

Power Requirements: Lesser power consumption increases the device ON-time.

Device Size: The device size is very important in IoT applications, mobile phones, or small devices.

Total Cost: The cost of the device is extremely crucial for any procurement decision.
26
AI Accelerators Assessment Parameters

Another important criteria in assessing AI Hardware is: Platforms

It is challenging as the chip needs to be supported by the hardware and software for developers to
build applications on them.

 Standalone platform

• Personal Computer
• On-board Devices
• Mobile Devices

 Sever-based platform

 Cloud-based platform
27
Performance Metrics for AI Accelerators

Instructions Per Second (IPS): It is a measure of a processor speed in instructions/cycle or


instructions/second (e.g., kilo instructions per second (KIPS), million instructions per second (MIPS),
billion instructions per second (GIPS), etc.).
• However, the instructions/cycle measurement depends on the instruction sequence, the data
and external factors.

Floating Point Operations Per Second (FLOPS, flops or flop/s): It is a measure of computer
performance in floating-point calculations.
• Therefore, it is a more accurate measure than measuring instructions per second.

Trillions Operations Per Second (TOPS): It is a measure of the maximum achievable throughput
but not a measure of actual throughput.
• Most operations are Multiply and Accumulate (MAC).
• Therefore, TOPS ="(number of MAC units)"×"(frequency of MAC operations)"×"2“
• Even TOPS is not the enough information for performance.
28
Performance Metrics for AI Accelerators

• We should know the throughput of our model.


• Further, the Throughput/$ is the inference efficiency for a given model.
• All inference accelerators have four key components:
MACs; SRAM; DRAM; Interconnect architecture.
• Interconnect architecture connects the compute and memory blocks along with logic
that controls the execution of the neural network model.
• However, more of MACs, SRAM, DRAM and interconnect improve throughput as well as
the cost.
• Therefore, to get maximum inference efficiency we should maximize throughput (for a
given model, image size, batch size) with the least MACs, SRAM, DRAM and
interconnects, which eventually maximize the Throughput/$.
• Note that roughly $ and power are correlated (as the power dissipation comes from
MACs, SRAM, DRAM and interconnect – more of each will translate to more power).
29
AI Accelerators Design Approaches

Scalar processing elements (Central Processing Units (CPU))


AMD Ryzen 9 5900X, AMD Ryzen 9 3950X, Intel® Core™ i5-10600k, 11th Gen Intel® Core™ vPro, etc.

General Purpose Processors


Vector processing elements (Graphic Processing Units (GPU))
AI Accelerator Architectures

Nvidia GeForce GeForce RTX 3070, RTX 3080, AMD Radeon RX 6900 XT, RX 6700 XT, etc.

Programmable logic (Field-Programmable Gate Arrays (FPGA))

Intel® Stratix® 10 GX, SX, TX, etc.

Fixed logic (Application-specific Integrated Circuits (ASIC))

Google TPU, Gaudi Habana, Intel Nervana, etc.


Special Purpose Processors

Mobile AI/ Neural Processing Units (NPU)

Exynos 2100, Qualcomm® Hexagon™ 780, MAX78000, etc.

Emerging Technologies

Processing in-memory (PIM), Neuromorphic Computing, Quantum Computing, AI-Wafer Chips, Analog Emerging Technologies
memory-based technologies, etc.
30
AI Accelerators Design Approaches

• Central Processing Units (CPU): It is the general purpose processors mostly used in standalone
personal computes (Intel Core, AMD Ryzen etc.)

• Graphical Processing Units (GPUs): They were originally designed for accelerating graphical
processing through parallel computing. The same approach is effective to train the complex deep
learning algorithms.

• Wafer Chips: To increase the package density, a silicon wafer containing trillions of transistors on a
single chip (e.g., Cerebras). It has ~72 square inch silicon wafer size containing 1.2 trillion transistors
on it. Therefore, it can support ~ 400 thousands of processing cores on it.

• Neural Processing Unit (NPU): The architecture provides parallel computing and pooling to increase
overall performance. It is specialized in Convolution Neural Network (CNN) applications. The
architecture can be reconfigured to switch between models in real-time. It allows creating an
optimized hardware depending on the needs of the application.
31
AI Accelerators Design Approaches

• Neuromorphic Architectures: These are an attempt to mimic brain cells using novel
approaches from adjacent fields such as materials science and neuroscience. These chips can
have an advantage in terms of speed and efficiency on training neural networks.

• Analog Memory-based Technologies: Digital systems built on 0’s and 1’s dominate today’s
computing world. However, analog techniques contain signals that are constantly variable and
have no specific ranges. IBM research team demonstrated that large arrays of analog memory
devices achieve similar levels of accuracy as GPUs in deep learning applications.

• Tensor Processing Unit (TPU): It is an application-specific integrated circuit (ASIC) based AI


accelerator developed by Google. Cloud TPU enables to run the machine learning workloads
on Google’s TPU accelerator hardware using TensorFlow.
32
Processor Designs
2 Integer operations
in same clock cycle 2 Instructions streams
= more parallelism

OoO (Out of Order)

FPU = Floating Point Unit


A very simple processor ALU = Arithmetic & Logic Unit
Only one Integer operations in one clock cycle SMT (Simultaneous Multi-Threading) Vector Processing
(Integer Processor Unit) SIMD (single instruction multiple data)
Latency when one of the SMT Threads is blocked
One ALU is saturated & other is used <1% Same fundamental limitation

SIMD/Vector Instructions

SIMT (Single-Instruction Multiple Threads)


OoO will simply pause the thread execution during any blockage AMD GCN (4x SIMD + 1 ALU)
Worse idea (Warp Scheduler is from NVidia)
SIMT+SIMD is how a modern GPU works
SOURCE: https://medium.com/@valarauca/wtf-is-a-simd-smt-simt-f9fb749f89f1
33
AI Accelerators Design Approaches

• The temporal architectures appear mostly in CPUs or Generic DNN Accelerator


GPUs and employ a variety of techniques to improve Architecture
parallelisms, such as vectors (SIMD) or parallel
threads (SIMT).
• Such temporal architectures use a centralized control
for a large number of arithmetic logic units (ALUs).
These ALUs can only fetch data from the memory
hierarchy and cannot communicate directly with each
other.
• The spatial architectures use dataflow processing; i.e.,
the ALUs form a processing chain so that they can
pass data from one to another directly.
• ALUs have its own control logic and local memory
called a scratchpad or register file. The ALU with its
own local memory as a processing engine (PE).
• Spatial architectures are commonly used for deep
neural networks in ASIC and FPGA-based designs.

SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
34
AI Accelerators Design Approaches

Generic dual core processor Intel dual-core dual-processor system


Multicore processors
SOURCE: Intel

• A multicore processor is a single integrated circuit with two or more separate processing
units, called cores, each of which reads and executes program instructions.

• The instructions are ordinary CPU instructions (such as add, move data, and branch) but the
single processor can run instructions on separate cores at the same time, increasing overall
speed for programs that support multithreading or other parallel computing techniques.
35
AI Accelerators Design Approaches

• GPU is a specialized processor originally designed to


accelerate graphics rendering.

• GPUs can process many pieces of data simultaneously,


making them useful for machine learning, video editing,
and gaming applications, etc.

SOURCE: https://steemit.com/gridcoin/@dutch/hardware-and-project-selection-part-1-cpu-vs-gpu

A generic modern GPU architecture

SOURCE: https://www.nextplatform.com/2019/07/10/a-decade-of-accelerated-computing-augurs-well-for-gpus/

SOURCE: Aamodt, T.M., Fung, W.W.L. and Rogers, T.G., 2018. General-purpose graphics
processor architectures. Synthesis Lectures on Computer Architecture, 13(2), pp.1-140.
36
AI Accelerators Design Approaches

Comparison of the number of CPU cores and GPUs


SOURCE: Capra, M., Bussolino, B., Marchisio, A., Shafique, M., Masera, G. and Martina, M., 2020. An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks. Future Internet, 12(7), p.113.
37
AI Accelerators Design Approaches

N Multiprocessors called (SMs)


Each has M cores called (SPs)
SM = Streaming Multiprocessors
SP = Streaming Processor (core)

MP = Multiprocessor; SM = Shared Memory; SFU = Special Functions Unit;


SOURCE: NVIDIA CUDA™
IU = Instruction Unit; SP = Streaming processor (core/ CUDA core) NVIDIA GPU Architecture
38
AI Accelerators Design Approaches

SOURCE: https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
39
AI Accelerators Design Approaches

• GPUs are the current workhorses for DNNs’ inference and especially training.

• Limitations:

 Bandwidth,
 Latency, and
 Branch prediction.
For a chaotic code flow the GPU becomes even slower than the CPU.

• It needs a CPU to be controlled.


(latest GPUs are more independent and can give themselves new tasks after first task given by CPU).
40
AI Accelerators Design Approaches
• Field-Programmable-Gate-Arrays (FPGAs) and Application-Specific-Integrated-Circuit (ASICs)
belongs to the Spatial Architectures.

• Primary purpose of FPGAs is programmability to implement any possible design.

• They are relatively cost-effective with short time-to-market, and the design flow is simple.

• However, FPGAs can not be optimized for the various requirements of different
applications, are less energy-efficient, and have lower performances than ASICs.

• On the contrary, ASICs need to be designed and produced for a specific application that
cannot be changed over time.

• The design flow is consequently more complex, and the production cost is higher, but the
resulting chip is highly-optimized and energy-efficient.
SOURCE: Mao, W., Xiao, Z., Xu, P., Ren, H., Liu, D., Zhao, S., An, F. and Yu, H., 2020, September. Energy-Efficient Machine Learning Accelerator for Binary Neural Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (pp. 77-82).
41
AI Accelerators Design Approaches
AI Hardware accelerator for DNNs
A general FPGA architecture
(implemented on ASIC or FPGA)

Processing Elements

SOURCE: Capra, M., Bussolino, B., Marchisio, A., Shafique, M., Masera, G. and Martina, M., 2020.
An updated survey of efficient hardware architectures for accelerating deep convolutional neural
networks. Future Internet, 12(7), p.113.

SOURCE: Skliarova, I. and Sklyarov, V., 2019. FPGA-based Hardware Accelerators. Springer International Publishing.

• Challenges in ASIC & FPGA Accelerators are:


 Significant amount of storage, External memory bandwidth, and
computational resources on the order of billions of operations per second.
42
AI Accelerators Design Approaches
Architecture of the FPGA-based accelerator

A general strategy for the design, implementation


& test of FPGA hardware accelerators

SOURCE: Skliarova, I. and Sklyarov, V., 2019. FPGA-based Hardware Accelerators. Springer International Publishing.

SOURCE: Mao, W., Xiao, Z., Xu, P., Ren, H., Liu, D., Zhao, S., An, F. and Yu, H., 2020, September. Energy-Efficient Machine Learning
Accelerator for Binary Neural Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (pp. 77-82).
43
AI Accelerators Design Approaches

The ASIC-based Accelerator Architecture

SOURCE: Mao, W., Xiao, Z., Xu, P., Ren, H., Liu, D., Zhao, S., An, F. and Yu, H., 2020, September. Energy-Efficient Machine Learning Accelerator for Binary Neural Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (pp. 77-82).
44
AI Accelerators Design Approaches
Architecture of the Google Tensor Processing Units (TPUs)

TPU Pods = Clusters of TPU

HBM = High-bandwidth Memory


MXU = Matrix Unit

TPU v2: • TPU v3:


• 8 GB of HBM for each TPU core • 16 GB of HBM for each TPU core
• One MXU for each TPU core • Two MXUs for each TPU core
• Up to 512 total TPU cores and 4 TB of total • Up to 2048 total TPU cores and 32 TB of total
memory in a TPU Pod memory in a TPU Pod
SOURCE: https://cloud.google.com/tpu/docs/system-architecture#device.
45
AI Accelerators Design Approaches

Edge Acceleration Platform Architecture

SOURCE: Karras, K., Pallis, E., Mastorakis, G. et al. A Hardware Acceleration Platform for AI-Based Inference at the Edge.
Circuits Syst Signal Process 39, 1059–1070 (2020). https://doi.org/10.1007/s00034-019-01226-7
46
AI Accelerators Design Approaches

Cloud Edge and Mobile Based Hardware Accelerators


47
AI Accelerators Design Approaches

Cloud and Edge Based AI Computing


48
AI Accelerators Design Approaches

Cloud and Edge Based AI Computing

• Nvidia (GPU)
• Google(TPU)
• Microsoft (BrainWave)
• Amazon (Inferentia)
• Facebook
• Alibaba, Baidu
49
AI Accelerators Design Approaches

Mobile/Edge based AI Inference


50
AI Accelerators Design Approaches

Mobile/Edge DNN Applications


51
AI Accelerators Design Approaches

Cloud vs Edge Summary


52
AI Accelerators Design Approaches

Multiply and Accumulate (MAC) Architecture

Conventional Systolic-Array Neural Computing Unit

SOURCE: Cho, K., Lee, I., Lim, H. and Kang, S., 2020. Efficient systolic-array redundancy architecture for offline/online repair. Electronics, 9(2), p.338.
53
AI Accelerators Design Approaches
Systolic Array-based DNN Accelerator

SOURCE: Zhang, J., Rangineni, K., Ghodsi, Z. and Garg, S., 2018, June. Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators. In Proceedings of the 55th Annual Design Automation Conference (pp. 1-6).
54
AI Accelerators Design Approaches

Tensor Processing Unit (TPU) Block Diagram

The TPU includes the following computational


resources:

• Matrix Multiplier Unit (MXU): 65,536 8-bit


multiply-and-add units for matrix operations.
• Unified Buffer (UB): 24MB of SRAM that
work as registers.
• Activation Unit (AU): Hardwired activation
functions.

SOURCE: Chen, Y., Xie, Y., Song, L., Chen, F. and Tang, T., 2020. A survey of accelerator architectures for deep neural networks. Engineering, 6(3), pp.264-274.
55
AI Accelerators Design Approaches
Von Neumann Bottleneck for AI Increasing Memory Bandwidth

• Von-Neumann architecture serially How can we increase bandwidth


fetches data from the storage. between processor and memory?
• AI application needs to access
tremendous amount of data.
SOURCE: http://ictconference.kr/2020ict/sub/pdf/006.pdf.
56
AI Accelerators Design Approaches

Near Memory Processing

High Bandwidth Memory (HBM)

3D Stacked Memory

SOURCE: http://ictconference.kr/2020ict/sub/pdf/006.pdf.
57
AI Accelerators Design Approaches
Advantage of High Bandwidth Memory

SOURCE: http://ictconference.kr/2020ict/sub/pdf/006.pdf.
58
AI Accelerators Design Approaches

Towards into Memory


Emerging Non-Volatile Memories

SOURCE: White Paper on AI Chip Technologies (2018).

SOURCE: http://ictconference.kr/2020ict/sub/pdf/006.pdf.
59
AI Accelerators Design Approaches
Processing-In-Memory (PIM)

• Von-Neumann architecture serially


Non Von Neuman
fetches data from the storage.
• Converged logic and memory (high BW).
• AI application needs to access
tremendous amount of data. • Suitable for data-intensive workloads.
• Little data movement (energy efficient).

SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
60
AI Accelerators Design Approaches
Processing-In-Memory (PIM)

• Processing-in-memory accelerators offer many potential benefits including:


• Reduced data movement of weights
• Higher memory bandwidth by reading multiple weights in parallel
• Higher throughput by performing multiple computations in parallel
• lower input activation delivery cost due to increased density of compute
• However, there are several key design challenges and decisions that need to be considered in
practice.
• Analog processing is typically required to bring the computation into the array of storage
elements or into its peripheral circuits.
• Therefore, major challenges for processing in memory (PIM) are its sensitivity to circuit and device
non-idealities (i.e., nonlinearity and process, voltage and temperature variations).

SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
61
AI Accelerators Design Approaches

Dataflow for PIM Accelerators

Word Line

Conventional Processing in memory

Bit Line

SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
62
AI Accelerators Design Approaches
MAC Operations in Resistive NVM Device MAC Operations in Floating Gate NVM Device

NVM = Non-volatile memory

I-V Curve of resistive NVM Device


I-V Curve of Floating Gate NVM Device

LRS = low resistive state (RON)


HRS = high resistive state (ROFF)

SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
63
AI Accelerators Design Approaches
Neural Processing Unit (NPU)
Single PE Eight-PE NPU

SOURCE: Chen, Y., Xie, Y., Song, L., Chen, F. and Tang, T., 2020. A survey of accelerator architectures for deep neural networks. Engineering, 6(3), pp.264-274.
64
AI Accelerators Design Approaches
Neuromorphic Chip

• Perceptron based • Non-linear activation functions • Spiking neuron


• No non-linear functions • Continuous output • Closely model biological neuron's
activity
• Binary output. • Functional modeling of our brain
• Incorporates concept of time:
• Working real-life applications
integrate and fire
• We are here (FF, CNN, RNN, …).
• Computationally expensive
• Difficult to train
SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
65
AI Accelerators Design Approaches
Neuromorphic Chip

Address-Event-Representation

SOURCE: Schuller, I.K., Stevens, R., Pino, R. and Pechan, M., 2015. Neuromorphic computing–from materials research to systems architecture roundtable. USDOE Office of Science (SC)(United States).
66
AI Accelerators Design Approaches
Neuromorphic Chip

SOURCE: Schuller, I.K., Stevens, R., Pino, R. and Pechan, M., 2015. Neuromorphic computing–from materials research to systems architecture roundtable. USDOE Office of Science (SC)(United States).
67
AI Accelerators Design Approaches

Neuromorphic Chip with Emerging Device

SOURCE: M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017.
68
AI Accelerators Design Approaches

Neuromorphic Chip with Emerging Non-volatile RAM (ReRAM Memristor)

(b) Scanning Electron Micrograph (c) Transmission Electron Microscopy


of a single 1T1R cell. image of the drift memristor

Schematic illustration of a cross point diffusive memristor.

(a) Optical micrograph of the integrated memristive (d) Scanning Electron Micrograph of a (e) Transmission Electron Microscopy
neural network. single diffusive memristor junction. image of the diffused memristor.

SOURCE: M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017.
69
AI Accelerators Design Approaches
Domain-specific hardware accelerators
• Accelerators have been designed for various tasks such as: Graphics, Deep learning, Simulation,
Bioinformatics, Image processing, etc.
• A domain-specific accelerator is specialized for a particular domain of applications.

Accelerators exploit four main techniques for performance and efficiency gains:

Data specialization: Specialized operations on domain-specific data types can do in one cycle what
may take tens of cycles on a conventional computer.
Parallelism: High degrees of parallelism, often exploited at several levels, provide gains in performance.
Local and optimized memory: By storing key data structures in many small, local memories, very high
memory bandwidth can be achieved with low cost and energy.
Reduced overhead: Specializing hardware eliminates the overhead of program interpretation.

SOURCE: Dally, W.J., Turakhia, Y. and Han, S., 2020. Domain-specific hardware accelerators. Communications of the ACM, 63(7), pp.48-57.
70
AI Accelerators Design Approaches

Comparison of computation efficiency (in Tasks/s-Watt) for CPU, FPGA, GPU, and ASIC
for Deep learning and Genomics domains

SOURCE: Dally, W.J., Turakhia, Y. and Han, S., 2020. Domain-specific hardware accelerators. Communications of the ACM, 63(7), pp.48-57.
71
AI Accelerators Design Approaches

AI Emulators

• AI companies are looking to


develop their own silicon because: AI frameworks Benchmarks to Run in Emulation
• They face increasing demands for
performance.
• The market fragmentation
demands application-specific
algorithm-to-hardware solutions.
• They have pressure to reduce
power consumption even with the
increasing AI processing load.
• AI emulation is playing an
important role in enabling this
shift-into-silicon.
SOURCE: https://www.techdesignforums.com/practice/technique/emulation-for-ai-part-one/.

• AI-Based Emulators enable


enhances simulation speedups,
and accuracy.
72
AI Accelerators Design Approaches

A block diagram for a four-CPU board (Wave Computing)


• The DPUs are interconnected directly
with each other over a fabric (used for
signaling “Fire” and “Done”) and
through a dual ported Hybrid Memory
Cubes (HMC),
• HMC act both as fast memory and as
shared data buffers between the DPUs.
• This allows shared double buffering to
improve scalability by keeping the
critical data close to the processors.
• The bisection bandwidth of this
approach is impressive, which supports
[Wave’s] scale-up and scale-out thesis,
delivering up to 7.25TB/s.
Hybrid Memory Cubes (HMC).
• Wave shifted from an FPGA-based
development strategy to emulation –
because of the scalability required by
that silicon size.
SOURCE: https://www.techdesignforums.com/practice/technique/emulation-for-ai-part-one/.
73
AI Accelerators Design Approaches

Intel Open VINO

OpenVINO™ toolkit:
• Enables CNN-based deep learning
inference on the edge.
• Supports heterogeneous execution
across an Intel® (CPU, Integrated
Graphics, Neural Compute Stick and
Vision) Accelerator Design with • It speeds time-to-market via an easy-to-use library of
Intel® Movidius™ Vision processing computer vision functions and pre-optimized kernels.
unit (VPUs). • It includes optimized calls for computer vision standards,
including OpenCV* and OpenCL™.
SOURCE: https://docs.openvinotoolkit.org/latest/index.html.
74
Some Leading AI Hardware Accelerators
• There is tremendous pressure on dominate AI hardware industries in producing an
efficient hardware because of the technical complexity of AI algorithms.

• According to Forbes, even Intel with numerous world class engineers and a strong
research background, needed 3 years of work to build the Nervana neural network
processor.

• Below is the list of some leading companies working on AI hardware:

 Advanced Micro Devices(AMD)  IBM


 Apple  Intel
 Arm (Advanced RISC Machine)  Samsung
 Baidu  Nvidia
 Google (Alphabet)  Texas instruments
 Graphcore  Qualcomm
 Huawei  Xilinx
75
Landscape for AI Hardware

SOURCE: https://basicmi.github.io/AI-Chip/
76
AI Algorithms
• NASNet-A-Large has highest accuracy with high
er computational complexity.

• Top-1 accuracy is the conventional accuracy (i.e.,


the model answer the highest probability).

• NVIDIA Titan X Pascal GPU is used as workstatio


n.

• For computational complexity less than 5 G-FLO


Ps, the SE-ResNeXt-50 (32 × 4𝑑) has the highest
accuracy.
The ball size corresponds to the model complexity

• SENet-154 needs ~ 3 times more operations as


that are needed by SE-ResNeXt-101 having almo
st the same accuracy.

• VGG-13 has a very high model complexity than


ResNet-18 (having almost the same accuracy).
Bianco, S., Cadene, R., Celona, L. and Napoletano, P., 2018. Benchmark analysis of representative deep neural network architectures. IEEE Access, 6, pp.64270-64277.
77
AI Stack
78
State-of-The-Art AI Accelerators

AWS Inferentia

• Amazon has its own solutions for both training (AWS


Trainium) and inference (AWS Inferentia).
• Each AWS Inferentia chip contains four NeuronCores.
• Each NeuronCore implements a high-performance systolic
array matrix multiply engine.
• NeuronCores are also equipped with a large on-chip
SOURCE: https://aws.amazon.com/machine-learning/inferentia/. cache.
AWS Trainium • AWS Trainium is the second custom machine learning chip
designed by AWS.
• It is targeted for training models in the cloud.

SOURCE: https://aws.amazon.com/machine-learning/trainium/.
79
State-of-The-Art AI Accelerators
Cerebras Wafer Scale Engine (WSE)
• The Cerebras Systems (CS)-1 wafer is an
MIMD, distributed-memory machine
with a 2D-mesh interconnection fabric.
• The repeated element of the architecture
Data Structure
Register is called a tile.
• The tile contains one processor core, its
memory, and the router that it connects
Fused Multiply
-Accumulate

to.
• The routers link to the routers of the
four neighboring tiles.
SOURCE: arXiv:2010.03660v1.

• The wafer contains a 7×12 array of 84 identical “die.” A die holds thousands of tiles.
• It has 18 Gigabytes of On-chip Memory, all accessible within a single clock cycle, and provides 9 PB/s
memory bandwidth.
• It is a huge monster having ~1.2 trillion transistors (TSMC 16nm process)
(for comparison, NVIDIA’s A100 GPU contains 54 billion transistors).
80
State-of-The-Art AI Accelerators
Intel Nervana Neural Network Processor-T (NNP-T)
Intel NNP-T Block Diagram
Intel NNP-T Matrix Processing Units (MPU)
Intel NNP-T Tensor Processor Diagram

Intel NNP-T Floating Point Dot Product Design


SerDes = Serializer/ Deserializer; ICL = Inter-chip Link;
PCIe = Peripheral Component Interconnect express;
HBM = High Bandwidth Memory; MC = Memory Controller

• Intel’s Nervana NNP-T is a standalone PCIE-based accelerator


for deep learning and artificial intelligence training acceleration.
• It is fabricated on TSMC’s 16nm process, the chip utilizes a
single large 680mm2 die with over 27 billion transistors and
typical workload power ranging from 150-250W.
SOURCE: B. Hickmann, J. Chen, M. Rotzin, A. Yang, M. Urbanski and S. Avancha, "Intel Nervana Neural Network Processor-T (NNP-T) Fused Floating Point Many-
Term Dot Product," 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH), Portland, OR, USA, 2020, pp. 133-136, doi: 10.1109/ARITH48897.2020.00029.
81
State-of-The-Art AI Accelerators
GOYA Processor High-level Architecture

• Goya inference processor is based on the scalable


architecture of Habana’s Tensor-Processing Core (TPC).
• TPC core natively supports the following data types:
FP32, BF16, INT32, INT16, INT8, UINT32, UINT16 and
UINT8.
• It includes a cluster of eight programmable cores.
• TPC is Habana’s proprietary core designed to support
deep learning workloads.
• It is a very long instruction word (VLIW) single
instruction multiple data (SIMD) vector processor with
Instruction-Set-Architecture and hardware tailored to
serve deep learning workloads efficiently.

SOURCE: https://habana.ai/wp-content/uploads/pdf/2020/Habana%20GOYA%20Inference%20Performance%20Whitepaper%20Nov'20.pdf.
82
State-of-The-Art AI Accelerators
Gaudi Processor High-level Architecture

• Habana provides products both for training


(Gaudi) and inference (Goya).
General Matrix Multiply

• Gaudi is based on the scalable architecture of the


Tensor Processor Core (TPC™). It uses a cluster of
eight TPC 2.0 cores.
• Gaudi is the first AI Processor which integrates
on-chip RDMA* over Converged Ethernet (RoCE
v2) engines.
• These engines play a critical role in the inter-
processor communication needed during the
training process.
• SynapseAI® is Habana’s home-grown compiler
and runtime. Direct Memory Access

• It is built for seamless integration with existing


frameworks, that both define a Neural Network for
execution and manage the execution Runtime.

*Remote Direct Memory Access (RDMA)

SOURCE: https://habana.ai/wp-content/uploads/pdf/2020/Habana%20GAUDI%20Training%20Whitepaper%20v1.2.pdf.
83
State-of-The-Art AI Accelerators
Graphcore Intelligence Processing Unit (IPU)
(Colossus MK2 GC200 IPU)

• GC200 contains 59.4B transistors and is


built using the very latest TSMC 7nm
process.
• Each MK2 IPU has 1472 IPU-cores, running
8832 independent parallel program
threads.
• Each IPU holds 900MB In-Processor-
Memory with 47.5 TB/s bandwidth.
• It delivers up to 250 TFLOPS of AI compute
at FP16.16 and FP16.SR (stochastic
rounding).

SOURCE: https://www.graphcore.ai/products/ipu.
84
State-of-The-Art AI Accelerators
NVIDIA GA102 GPU with 84 SMs

• GA102 is composed of
Graphics Processing Clusters
(GPCs), Texture Processing
Clusters (TPCs), Streaming
Multiprocessors (SMs), Raster
Operators (ROPS), and
memory controllers.

• The full GA102 GPU contains


seven GPCs, 42 TPCs, and 84
SMs.

SOURCE: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.
85
State-of-The-Art AI Accelerators
Huawei Ascend
Atlas 300I Inference Card Ascend 910 AI Processor
(Model: 3000/3010)

• Huawei has its own solutions for both training (Ascend 910) and inference (Ascend 310).
• Atlas 300I inference card has Ascend 310 AI processor to unlocks superior AI inference performance.
• It delivers 22 TOPS@INT8 and 11 TFLOPS@FP16 with just 8 W of power consumption.
• Ascend 910 is a high-integration SoC AI processor for suitable for AI training.
• It delivers 320 TFLOPS@FP16 and 640 TOPS@INT8 of compute performance with just 310 W of max
power consumption.

SOURCE: https://e.huawei.com/en/products/cloud-computing-dc/atlas.
86
State-of-The-Art AI Accelerators

NVIDIA A100 Tensor Core GPU

(Multi-Instance GPU)

• Sparsity in deep learning shows the importance of individual weights evolves during the learning
process, and by the end of network training.
• Only a subset of weights have acquired a meaningful purpose in determining the learned output. The
remaining weights are no longer needed.

SOURCE: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.
87
State-of-The-Art AI Accelerators

Sparsity

Coarse-Grained Sparsity Fine-Grained Sparsity

• Coarse-grained sparsity: It explores zeroing out specific weights distributed across the neural network.
• Fine-grained sparsity: It explores zeroing out entire sub-networks of a neural network.

SOURCE: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.
88
Opportunities

12

SOURCE: McKinsey & Company


89
Opportunities
90
Summery

 AI Shifting on Cloud and Edge Platforms


• Edge inference & learning will be more important due to privacy concern, real-time operation,
and power constraint.
• Federated learning will leverage the cloud’s big data advantage on edge.

 Domain-specific Architecture
• Application Specific Hardware Architectures are the cleaver choice.
• Programmable platforms or the easier solutions.

 Throughput/$
• AI accelerators should be cost effective to retain on the market.

 Emerging Technologies
• Newer technologies such as neuromorphic computing, processing-in-memory, quantum
computing are the future of AI accelerators.
91

You might also like