30006
30006
2. On-device AI of Smartphone
• On-device AI for Smartphone: Hardware, Software, Model Optimization and Benchmarking
Artificial Intelligence
• AI Algorithms & Hardware Requirements
AI Hardware Accelerators
• Overview & Opportunities
State-of-The-Art AI accelerators
• Recent AI Hardware Accelerators
Summary
6
Artificial Intelligence (AI)
7
Artificial Intelligence (AI)
8
AI (Training and Inference)
9
AI (Training and Inference)
10
AI (Training and Inference)
11
AI (Inference)
Lowest latency
Accelerate whole application
Match the speed of AI innovation
12
Software & Hardware Options of ML inference System
Visualization SOM
14
AI Algorithms (Deep Learning)
Neural Networks
SOURCE: [1]. Choi, S., Sim, J., Kang, M., Choi, Y., Kim, H. and Kim, L.S., 2020. An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices. IEEE Journal of Solid-State Circuits, 55(10), pp.2691-2702.
[2]. Song, L., Qian, X., Li, H. and Chen, Y., 2017, February. Pipelayer: A pipelined reram-based accelerator for deep learning. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 541-552). IEEE.
18
AI Algorithms (Timeline for Computer Vison Models)
Inception-v4
Inception-v1
2017
2012 2015
2015
1998 2016
• Top-1 accuracy is the conventional version of accuracy. • According to Top-1 accuracy prediction
It consider 1 class with the highest probability. (cherry: 0.35) is wrong.
• Top-5 accuracy use top-5 classes instead of 1 class. • According to Top-5 accuracy the
prediction is correct since blueberry is still
on top-5 results.
21
AI Accelerators (Hardware Accelerators)
• Hardware Acceleration — It is the use of computer hardware specially made to perform some
functions more efficiently than is possible in software running on a general-purpose
processing unit alone.
• It combines the flexibility of general-purpose processors, such as CPUs, with the fully
customized hardware, such as GPUs and ASICs, to increase the efficiency by orders of
magnitude. For example, visualization processes may be offloaded onto a graphics card in order
to enable faster, higher-quality playback of videos and games.
• General purpose (GP) hardware (HW) uses arithmetic blocks for basic in-memory calculations
(i.e., serial processing) which is not suitable for high performance deep learning techniques.
High bandwidth memory: Specialized AI hardware is estimated to allocate 4-5 times more
bandwidth than traditional chips. This is necessary in parallel processing. AI applications
require significantly more bandwidth between processors for efficient performance.
25
AI Accelerators Assessment Parameters
Processing speed Power requirements Device Size Total Cost
Processing Speed: AI hardware enables faster training and inference using neural networks.
• Faster training enables the ML experts to try different DL approaches.
• Optimize the structure of their neural network (hyper parameter optimization).
• Faster inferences (e.g. predictions) are critical for applications like autonomous driving.
Device Size: The device size is very important in IoT applications, mobile phones, or small devices.
Total Cost: The cost of the device is extremely crucial for any procurement decision.
26
AI Accelerators Assessment Parameters
It is challenging as the chip needs to be supported by the hardware and software for developers to
build applications on them.
Standalone platform
• Personal Computer
• On-board Devices
• Mobile Devices
Sever-based platform
Cloud-based platform
27
Performance Metrics for AI Accelerators
Floating Point Operations Per Second (FLOPS, flops or flop/s): It is a measure of computer
performance in floating-point calculations.
• Therefore, it is a more accurate measure than measuring instructions per second.
Trillions Operations Per Second (TOPS): It is a measure of the maximum achievable throughput
but not a measure of actual throughput.
• Most operations are Multiply and Accumulate (MAC).
• Therefore, TOPS ="(number of MAC units)"×"(frequency of MAC operations)"×"2“
• Even TOPS is not the enough information for performance.
28
Performance Metrics for AI Accelerators
Nvidia GeForce GeForce RTX 3070, RTX 3080, AMD Radeon RX 6900 XT, RX 6700 XT, etc.
Emerging Technologies
Processing in-memory (PIM), Neuromorphic Computing, Quantum Computing, AI-Wafer Chips, Analog Emerging Technologies
memory-based technologies, etc.
30
AI Accelerators Design Approaches
• Central Processing Units (CPU): It is the general purpose processors mostly used in standalone
personal computes (Intel Core, AMD Ryzen etc.)
• Graphical Processing Units (GPUs): They were originally designed for accelerating graphical
processing through parallel computing. The same approach is effective to train the complex deep
learning algorithms.
• Wafer Chips: To increase the package density, a silicon wafer containing trillions of transistors on a
single chip (e.g., Cerebras). It has ~72 square inch silicon wafer size containing 1.2 trillion transistors
on it. Therefore, it can support ~ 400 thousands of processing cores on it.
• Neural Processing Unit (NPU): The architecture provides parallel computing and pooling to increase
overall performance. It is specialized in Convolution Neural Network (CNN) applications. The
architecture can be reconfigured to switch between models in real-time. It allows creating an
optimized hardware depending on the needs of the application.
31
AI Accelerators Design Approaches
• Neuromorphic Architectures: These are an attempt to mimic brain cells using novel
approaches from adjacent fields such as materials science and neuroscience. These chips can
have an advantage in terms of speed and efficiency on training neural networks.
• Analog Memory-based Technologies: Digital systems built on 0’s and 1’s dominate today’s
computing world. However, analog techniques contain signals that are constantly variable and
have no specific ranges. IBM research team demonstrated that large arrays of analog memory
devices achieve similar levels of accuracy as GPUs in deep learning applications.
SIMD/Vector Instructions
SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
34
AI Accelerators Design Approaches
• A multicore processor is a single integrated circuit with two or more separate processing
units, called cores, each of which reads and executes program instructions.
• The instructions are ordinary CPU instructions (such as add, move data, and branch) but the
single processor can run instructions on separate cores at the same time, increasing overall
speed for programs that support multithreading or other parallel computing techniques.
35
AI Accelerators Design Approaches
SOURCE: https://steemit.com/gridcoin/@dutch/hardware-and-project-selection-part-1-cpu-vs-gpu
SOURCE: https://www.nextplatform.com/2019/07/10/a-decade-of-accelerated-computing-augurs-well-for-gpus/
SOURCE: Aamodt, T.M., Fung, W.W.L. and Rogers, T.G., 2018. General-purpose graphics
processor architectures. Synthesis Lectures on Computer Architecture, 13(2), pp.1-140.
36
AI Accelerators Design Approaches
SOURCE: https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
39
AI Accelerators Design Approaches
• GPUs are the current workhorses for DNNs’ inference and especially training.
• Limitations:
Bandwidth,
Latency, and
Branch prediction.
For a chaotic code flow the GPU becomes even slower than the CPU.
• They are relatively cost-effective with short time-to-market, and the design flow is simple.
• However, FPGAs can not be optimized for the various requirements of different
applications, are less energy-efficient, and have lower performances than ASICs.
• On the contrary, ASICs need to be designed and produced for a specific application that
cannot be changed over time.
• The design flow is consequently more complex, and the production cost is higher, but the
resulting chip is highly-optimized and energy-efficient.
SOURCE: Mao, W., Xiao, Z., Xu, P., Ren, H., Liu, D., Zhao, S., An, F. and Yu, H., 2020, September. Energy-Efficient Machine Learning Accelerator for Binary Neural Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (pp. 77-82).
41
AI Accelerators Design Approaches
AI Hardware accelerator for DNNs
A general FPGA architecture
(implemented on ASIC or FPGA)
Processing Elements
SOURCE: Capra, M., Bussolino, B., Marchisio, A., Shafique, M., Masera, G. and Martina, M., 2020.
An updated survey of efficient hardware architectures for accelerating deep convolutional neural
networks. Future Internet, 12(7), p.113.
SOURCE: Skliarova, I. and Sklyarov, V., 2019. FPGA-based Hardware Accelerators. Springer International Publishing.
SOURCE: Skliarova, I. and Sklyarov, V., 2019. FPGA-based Hardware Accelerators. Springer International Publishing.
SOURCE: Mao, W., Xiao, Z., Xu, P., Ren, H., Liu, D., Zhao, S., An, F. and Yu, H., 2020, September. Energy-Efficient Machine Learning
Accelerator for Binary Neural Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (pp. 77-82).
43
AI Accelerators Design Approaches
SOURCE: Mao, W., Xiao, Z., Xu, P., Ren, H., Liu, D., Zhao, S., An, F. and Yu, H., 2020, September. Energy-Efficient Machine Learning Accelerator for Binary Neural Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (pp. 77-82).
44
AI Accelerators Design Approaches
Architecture of the Google Tensor Processing Units (TPUs)
SOURCE: Karras, K., Pallis, E., Mastorakis, G. et al. A Hardware Acceleration Platform for AI-Based Inference at the Edge.
Circuits Syst Signal Process 39, 1059–1070 (2020). https://doi.org/10.1007/s00034-019-01226-7
46
AI Accelerators Design Approaches
• Nvidia (GPU)
• Google(TPU)
• Microsoft (BrainWave)
• Amazon (Inferentia)
• Facebook
• Alibaba, Baidu
49
AI Accelerators Design Approaches
SOURCE: Cho, K., Lee, I., Lim, H. and Kang, S., 2020. Efficient systolic-array redundancy architecture for offline/online repair. Electronics, 9(2), p.338.
53
AI Accelerators Design Approaches
Systolic Array-based DNN Accelerator
SOURCE: Zhang, J., Rangineni, K., Ghodsi, Z. and Garg, S., 2018, June. Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators. In Proceedings of the 55th Annual Design Automation Conference (pp. 1-6).
54
AI Accelerators Design Approaches
SOURCE: Chen, Y., Xie, Y., Song, L., Chen, F. and Tang, T., 2020. A survey of accelerator architectures for deep neural networks. Engineering, 6(3), pp.264-274.
55
AI Accelerators Design Approaches
Von Neumann Bottleneck for AI Increasing Memory Bandwidth
3D Stacked Memory
SOURCE: http://ictconference.kr/2020ict/sub/pdf/006.pdf.
57
AI Accelerators Design Approaches
Advantage of High Bandwidth Memory
SOURCE: http://ictconference.kr/2020ict/sub/pdf/006.pdf.
58
AI Accelerators Design Approaches
SOURCE: http://ictconference.kr/2020ict/sub/pdf/006.pdf.
59
AI Accelerators Design Approaches
Processing-In-Memory (PIM)
SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
60
AI Accelerators Design Approaches
Processing-In-Memory (PIM)
SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
61
AI Accelerators Design Approaches
Word Line
Bit Line
SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
62
AI Accelerators Design Approaches
MAC Operations in Resistive NVM Device MAC Operations in Floating Gate NVM Device
SOURCE: Sze, V., Chen, Y.H., Yang, T.J. and Emer, J.S., 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 15(2), pp.1-341.
63
AI Accelerators Design Approaches
Neural Processing Unit (NPU)
Single PE Eight-PE NPU
SOURCE: Chen, Y., Xie, Y., Song, L., Chen, F. and Tang, T., 2020. A survey of accelerator architectures for deep neural networks. Engineering, 6(3), pp.264-274.
64
AI Accelerators Design Approaches
Neuromorphic Chip
Address-Event-Representation
SOURCE: Schuller, I.K., Stevens, R., Pino, R. and Pechan, M., 2015. Neuromorphic computing–from materials research to systems architecture roundtable. USDOE Office of Science (SC)(United States).
66
AI Accelerators Design Approaches
Neuromorphic Chip
SOURCE: Schuller, I.K., Stevens, R., Pino, R. and Pechan, M., 2015. Neuromorphic computing–from materials research to systems architecture roundtable. USDOE Office of Science (SC)(United States).
67
AI Accelerators Design Approaches
SOURCE: M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017.
68
AI Accelerators Design Approaches
(a) Optical micrograph of the integrated memristive (d) Scanning Electron Micrograph of a (e) Transmission Electron Microscopy
neural network. single diffusive memristor junction. image of the diffused memristor.
SOURCE: M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017.
69
AI Accelerators Design Approaches
Domain-specific hardware accelerators
• Accelerators have been designed for various tasks such as: Graphics, Deep learning, Simulation,
Bioinformatics, Image processing, etc.
• A domain-specific accelerator is specialized for a particular domain of applications.
Accelerators exploit four main techniques for performance and efficiency gains:
Data specialization: Specialized operations on domain-specific data types can do in one cycle what
may take tens of cycles on a conventional computer.
Parallelism: High degrees of parallelism, often exploited at several levels, provide gains in performance.
Local and optimized memory: By storing key data structures in many small, local memories, very high
memory bandwidth can be achieved with low cost and energy.
Reduced overhead: Specializing hardware eliminates the overhead of program interpretation.
SOURCE: Dally, W.J., Turakhia, Y. and Han, S., 2020. Domain-specific hardware accelerators. Communications of the ACM, 63(7), pp.48-57.
70
AI Accelerators Design Approaches
Comparison of computation efficiency (in Tasks/s-Watt) for CPU, FPGA, GPU, and ASIC
for Deep learning and Genomics domains
SOURCE: Dally, W.J., Turakhia, Y. and Han, S., 2020. Domain-specific hardware accelerators. Communications of the ACM, 63(7), pp.48-57.
71
AI Accelerators Design Approaches
AI Emulators
OpenVINO™ toolkit:
• Enables CNN-based deep learning
inference on the edge.
• Supports heterogeneous execution
across an Intel® (CPU, Integrated
Graphics, Neural Compute Stick and
Vision) Accelerator Design with • It speeds time-to-market via an easy-to-use library of
Intel® Movidius™ Vision processing computer vision functions and pre-optimized kernels.
unit (VPUs). • It includes optimized calls for computer vision standards,
including OpenCV* and OpenCL™.
SOURCE: https://docs.openvinotoolkit.org/latest/index.html.
74
Some Leading AI Hardware Accelerators
• There is tremendous pressure on dominate AI hardware industries in producing an
efficient hardware because of the technical complexity of AI algorithms.
• According to Forbes, even Intel with numerous world class engineers and a strong
research background, needed 3 years of work to build the Nervana neural network
processor.
SOURCE: https://basicmi.github.io/AI-Chip/
76
AI Algorithms
• NASNet-A-Large has highest accuracy with high
er computational complexity.
AWS Inferentia
SOURCE: https://aws.amazon.com/machine-learning/trainium/.
79
State-of-The-Art AI Accelerators
Cerebras Wafer Scale Engine (WSE)
• The Cerebras Systems (CS)-1 wafer is an
MIMD, distributed-memory machine
with a 2D-mesh interconnection fabric.
• The repeated element of the architecture
Data Structure
Register is called a tile.
• The tile contains one processor core, its
memory, and the router that it connects
Fused Multiply
-Accumulate
to.
• The routers link to the routers of the
four neighboring tiles.
SOURCE: arXiv:2010.03660v1.
• The wafer contains a 7×12 array of 84 identical “die.” A die holds thousands of tiles.
• It has 18 Gigabytes of On-chip Memory, all accessible within a single clock cycle, and provides 9 PB/s
memory bandwidth.
• It is a huge monster having ~1.2 trillion transistors (TSMC 16nm process)
(for comparison, NVIDIA’s A100 GPU contains 54 billion transistors).
80
State-of-The-Art AI Accelerators
Intel Nervana Neural Network Processor-T (NNP-T)
Intel NNP-T Block Diagram
Intel NNP-T Matrix Processing Units (MPU)
Intel NNP-T Tensor Processor Diagram
SOURCE: https://habana.ai/wp-content/uploads/pdf/2020/Habana%20GOYA%20Inference%20Performance%20Whitepaper%20Nov'20.pdf.
82
State-of-The-Art AI Accelerators
Gaudi Processor High-level Architecture
SOURCE: https://habana.ai/wp-content/uploads/pdf/2020/Habana%20GAUDI%20Training%20Whitepaper%20v1.2.pdf.
83
State-of-The-Art AI Accelerators
Graphcore Intelligence Processing Unit (IPU)
(Colossus MK2 GC200 IPU)
SOURCE: https://www.graphcore.ai/products/ipu.
84
State-of-The-Art AI Accelerators
NVIDIA GA102 GPU with 84 SMs
• GA102 is composed of
Graphics Processing Clusters
(GPCs), Texture Processing
Clusters (TPCs), Streaming
Multiprocessors (SMs), Raster
Operators (ROPS), and
memory controllers.
SOURCE: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.
85
State-of-The-Art AI Accelerators
Huawei Ascend
Atlas 300I Inference Card Ascend 910 AI Processor
(Model: 3000/3010)
• Huawei has its own solutions for both training (Ascend 910) and inference (Ascend 310).
• Atlas 300I inference card has Ascend 310 AI processor to unlocks superior AI inference performance.
• It delivers 22 TOPS@INT8 and 11 TFLOPS@FP16 with just 8 W of power consumption.
• Ascend 910 is a high-integration SoC AI processor for suitable for AI training.
• It delivers 320 TFLOPS@FP16 and 640 TOPS@INT8 of compute performance with just 310 W of max
power consumption.
SOURCE: https://e.huawei.com/en/products/cloud-computing-dc/atlas.
86
State-of-The-Art AI Accelerators
(Multi-Instance GPU)
• Sparsity in deep learning shows the importance of individual weights evolves during the learning
process, and by the end of network training.
• Only a subset of weights have acquired a meaningful purpose in determining the learned output. The
remaining weights are no longer needed.
SOURCE: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.
87
State-of-The-Art AI Accelerators
Sparsity
• Coarse-grained sparsity: It explores zeroing out specific weights distributed across the neural network.
• Fine-grained sparsity: It explores zeroing out entire sub-networks of a neural network.
SOURCE: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.
88
Opportunities
12
Domain-specific Architecture
• Application Specific Hardware Architectures are the cleaver choice.
• Programmable platforms or the easier solutions.
Throughput/$
• AI accelerators should be cost effective to retain on the market.
Emerging Technologies
• Newer technologies such as neuromorphic computing, processing-in-memory, quantum
computing are the future of AI accelerators.
91