0% found this document useful (0 votes)

220 views

Google TPU

The document discusses the Google Tensor Processing Unit (TPU), a custom application-specific integrated circuit (ASIC) designed by Google for neural network machine learning. It provides an overview of TPU versions and architecture, including its systolic array design and performance improvements over CPUs and GPUs. The TPU was designed to reduce the total cost of ownership for deep neural network inference by 10x compared to other hardware.

Uploaded by

Ashraf Uddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

220 views

Google TPU

Uploaded by

Ashraf Uddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Department of Computer Science

Paderborn University

The Google TPU

July 9, 2020

Presented By Md. Ashraf Uddin

Overview

❖ TPU history and Goal

❖ Deep Neural Network
❖ TPU version and architecture
❖ TPU software stack
❖ TPU performance
❖ N.P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit.
International Symposium on Computer Architecture (ISCA’17),Toronto, Canada, pages
1–12, 2017.

❖ Norman P Jouppi, Cliff Young, Nishant Patil, and David Patterson. A domain-specific
architecture for deep neural networks. Communications of the ACM, 61(9):50–59, 2018
❖ Early discussion started in 2006
❖ Production project started in 2013
❖ TPU was designed, verified, built, and deployed in datacenters in just 15 months.

GOAL:

❖ Design a custom ASIC for the inference phase of NN

❖ Custom hardware to reduce the Total Cost of Ownership(TCO) of DNN inference
phase by 10X
➢ Must run existing apps developed for CPUs and GPUs
❖ More Data, more processing speed
❖ Breakthroughs (of “Deep” NN):
➢ Decreasing the error in
■ Speech recognition by 30% over other
approaches
■ Image recognition from 26% to 3.5%
➢ Beat human champion at Go

❖ Artificial NN imitate the functionality of the Brain.

❖ Artificial Neural Network divided into Layers:
➢ Input layer, Hidden Layers, Output Layer
❖ “Deep” NN refers to many layers
❖ Two phases of NN :
➢ Training: Learning
➢ Inference: Prediction
Popular NN Architecture

❖ Multi-Layer Perceptrons(MLP): (Fully

connected) output of each layer feed
into every neuron of the next layer

❖ Convolution Neural Networks(CNN):

each neuron get as input the results of
spatial nearby output of the previous
layer

❖ Recurrent Neural Networks(RNN):

Some outputs of the previous layer and
its previous state
❖ Matrix Unit: 65536 (256 X 256) 8-bit
multiply-accumulate units
❖ 700 MHz clock rate
❖ Peak: 92T operation/seconds
➢ 65536*2*700M
❖ >25X as many MACs vs GPU
❖ 4-MiB on-chip Accumulator Memory
❖ 24-MiB on-chip Unified Buffer
(activation memory)
❖ 3.5X as much on-chip memory vs
GPU
Problem: energy/time for repeated
SRAM/register access of matrix multiply

Solution: “Systolic Execution” to

compute data on the fly in buffers by
pipelining control and data
When Pi,j receives a and b, it performs the following steps:
1. Calculates a*b
2. 2 adds the result of previous Ci,j and stores the results Ci,j
3. Sends a to Pi,j+1 unless j=4
4. Sends b to Pi+1, j unless i=4
❖ Data Buffer: 37%
❖ Computational Unit: 30%
❖ I/O: 10%
❖ Control: 2%
❖ 80X perf/Watt of Haswell CPU
❖ 30X per/Watt k80 GPU

29 30

Incremental Performance/Watt
Could improve TPUv1 with more time?

❖ Simulated bigger MXU, faster clock, faster memory

❖ TPU DRAM
➢ Two DDR3 2133 channels
■ 34 GB/s

❖ What if replaced DDR3 DRAM with GDDR5 DRAM like

in GPU?
➢ 180 GB/s
❖ 200X perf/Watt of Haswell CPU
❖ 70X per/Watt k80 GPU

Incremental Performance/Watt
❖ 1 large 2D multiplier vs many smaller 1D units
➢ Matrix multiply benefit from 2D HW

❖ 8-bit ints vs 32-bit FP

➢ More efficient computation / memory

❖ Systolic array
➢ Fewer Register access (½ energy)

❖ TPUv1 drops CPU/GPU features (cache, branch Prediction)

➢ Save area & energy
➢ Reuse transistor for domain specific on-chip memory

❖ SW easier for 1 TPUv1 cores vs 13 GPU cores, 18 CPU cores

❖ 2014: Harder training target versus easier inference target

❖ 2017: Deployed in data centers

❖ More Computation

❖ More Memory

❖ More Programmable

❖ Bigger Numerics

❖ Harder parallelization
❖ A 128X128 systolic MXU performs
Nx128x128 matric multiplications (peak: 32K
Tensor Core
ops/clock)
Core Sequencer ❖ Transpose Reduction permute Unit (TRP) on
128x128 matrices
❖ Vector Processing Unit(VPU): 32 2D Vector
Host Vect MXU Intercon
HBM
Queue or nect
registers Vregs + 2D Vector memory
s(Over
Memory
Unit Router Vmem(16MiB)
PCIe) (8 GIB)
Transpose
Permit Unit ❖ Core Sequencer: fetches instructions from
instruction memory Imem
❖ Inter-Core Interconnect (ICI) sends messages
between TensorCores
❖ High Bandwidth Memory(HBM) 2HBM
stacks / TC
➢ 32 64-bit busses (20x TPUv1)
❖ 16 GiB of HBM for each TPU core

❖ Two MXUs for each TPU core

❖ Up to 2048 total TPU cores and 32 TiB of total

memory in a TPU Pod
Feature TPUv1 TPUv2 TPUv3 Volta GPU
Peak 16b TeraFLOPS/Chip 92(8b) 46 123 125

Peak 32b TeraFLOPS/Chip -- 5 14 16

Network links x -- 4x496 4x656 6x200

Gbits/s/Chip
Max Chips / Supercomputer -- 256 1024 Varies

Clock(MHz) 700 700 940 1530

TDP(watts)/Chip 75 280 450 450
Die Size(mm^2) <331 <611 <648 815
Chip Technology 28nm >12nm >12nm 12nm
Memory size(on-/off-chip) 28MB/8GB 37MB/16GB 37MB/32GB 26MB/32GB

Memory GB/s/Chip 34 700 900 900

Core/Chip 1 2 2 80
Chips/CPU Host 4 4 8 8 or 16
Neural Network Model (TPU Estimator)
❖ TensorFlow generates a computation graph
❖ Compiles the computation graph just in time
TensorFlow Client

Google Compute ❖ And sends the program binary to TPU devices

Engine VM
Computational Graph(gRPC) for execution

TensorFlow Server XLA Compiler

XLA just-in time compiler ❖ Just-in-time compiler
Host ❖ Whole program analysis and execution
TPU Binary
❖ No Cache
❖ Generates binary code to be run on Cloud TPU
Model mm^2 nm MHz TDP chip TOPS/s TOPS/s GB/s chip/Server
8b FP

Haswell 662 22 2,300 145W 2.6 1.3 51 2

CPU

Nvidia K80 561 28 560 150w - 2.8 160 8

GPU

tpu <331 28 700 75w 92 - 34 4

❖ the TPU succeeded because of
➢ The large—but not too large—matrix multiply unit;
➢ The substantial software controlled on-chip memory;
➢ Ability to run whole inference models to reduce dependence on host CPU
➢ A single-threaded, deterministic execution model
➢ Omission of general-purpose features
➢ TensorFlow, which made it easy to port them to the TPU

Building Better IP With RTL Architect NoC IP Physical Exploration by Arteris
No ratings yet
Building Better IP With RTL Architect NoC IP Physical Exploration by Arteris
30 pages
DO Qualification Kit: Simulink Design Verifier™ Tool Qualification Plan
No ratings yet
DO Qualification Kit: Simulink Design Verifier™ Tool Qualification Plan
16 pages
Problem Statement
0% (1)
Problem Statement
11 pages
MLPerf - Vision Behind MLPerf
No ratings yet
MLPerf - Vision Behind MLPerf
62 pages
Hardware Accelerators For Machine Learning (CS 217) by cs217
No ratings yet
Hardware Accelerators For Machine Learning (CS 217) by cs217
8 pages
Samsung Foundry 14nm
100% (2)
Samsung Foundry 14nm
2 pages
AWS Reliability Pillar
No ratings yet
AWS Reliability Pillar
62 pages
Arithmetic Circuits: Design of Digital Circuits 2014 Srdjan Capkun Frank K. Gürkaynak
No ratings yet
Arithmetic Circuits: Design of Digital Circuits 2014 Srdjan Capkun Frank K. Gürkaynak
44 pages
Synthesizable Finite State Machine Design Techniques Using The New Systemverilog 3.0 Enhancements
No ratings yet
Synthesizable Finite State Machine Design Techniques Using The New Systemverilog 3.0 Enhancements
53 pages
Arm Corstone 300 Reference Package Technical Overview 101772 0000 02 en
No ratings yet
Arm Corstone 300 Reference Package Technical Overview 101772 0000 02 en
38 pages
CPUs GPUs Accelerators
No ratings yet
CPUs GPUs Accelerators
22 pages
Top 7 Analog Design Engineer Interview Questions - Maven Silicon
No ratings yet
Top 7 Analog Design Engineer Interview Questions - Maven Silicon
1 page
Ai in Financial Services Making It Happen at Scale
No ratings yet
Ai in Financial Services Making It Happen at Scale
15 pages
Communication Standards IEEE 802 - 3
No ratings yet
Communication Standards IEEE 802 - 3
3 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
Floating Point Arithmetic
No ratings yet
Floating Point Arithmetic
17 pages
Design and Development of A CubeSat Hardware Architecture With COTS MPSoC Using Radiation Mitigation Techniques
No ratings yet
Design and Development of A CubeSat Hardware Architecture With COTS MPSoC Using Radiation Mitigation Techniques
71 pages
MPC5xxx Programming The eTPU
No ratings yet
MPC5xxx Programming The eTPU
14 pages
White Paper Interconnect Solutions Debugging Issues Advanced ARM CoreLink
No ratings yet
White Paper Interconnect Solutions Debugging Issues Advanced ARM CoreLink
8 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
Recurrent Neural Networks: Anahita Zarei, PH.D
No ratings yet
Recurrent Neural Networks: Anahita Zarei, PH.D
37 pages
Zynq Ultrascale+ Mpsoc: A Fips 140-3 Primer: Wp543 (V1.0) February 4, 2022
No ratings yet
Zynq Ultrascale+ Mpsoc: A Fips 140-3 Primer: Wp543 (V1.0) February 4, 2022
21 pages
Quantum Computing Benchmarks and Quantum Noise Mitigation
No ratings yet
Quantum Computing Benchmarks and Quantum Noise Mitigation
7 pages
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
From Everand
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
Adam Jones
No ratings yet
Artificial Neural Network Part-2
No ratings yet
Artificial Neural Network Part-2
15 pages
Deep Learning-Powered Technologies Autonomous Driving, Artificial Intelligence of Things (AIoT), Augmented Reality, 5G Communications and Beyond
100% (1)
Deep Learning-Powered Technologies Autonomous Driving, Artificial Intelligence of Things (AIoT), Augmented Reality, 5G Communications and Beyond
216 pages
Image Parallel Processing Based On GPU PDF
No ratings yet
Image Parallel Processing Based On GPU PDF
4 pages
767 Electric Power System Modeling in Sysml: Authors
No ratings yet
767 Electric Power System Modeling in Sysml: Authors
23 pages
Is RISC V Ready For Space A Security Perspective
No ratings yet
Is RISC V Ready For Space A Security Perspective
6 pages
VlogAMS 2.2 Pub
No ratings yet
VlogAMS 2.2 Pub
309 pages
Emerging Technologies in Information and Communications Technology
From Everand
Emerging Technologies in Information and Communications Technology
Fouad Sabry
No ratings yet
001 Intro
No ratings yet
001 Intro
66 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
Electronic Engineer Interview Questions
No ratings yet
Electronic Engineer Interview Questions
7 pages
Guide To Ethernet
No ratings yet
Guide To Ethernet
48 pages
Opportunities For Explainable Artificial Intelligence in Aerospace Predictive Maintenance
No ratings yet
Opportunities For Explainable Artificial Intelligence in Aerospace Predictive Maintenance
12 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
Agents White Paper
No ratings yet
Agents White Paper
21 pages
ISSCC2022 Advance Program
No ratings yet
ISSCC2022 Advance Program
72 pages
Agentic-AI-in-Predictive-AIOps-Enhancing-IT-Autonomy-and-Performance
No ratings yet
Agentic-AI-in-Predictive-AIOps-Enhancing-IT-Autonomy-and-Performance
9 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
The official Raspberry Pi Camera Module guide
From Everand
The official Raspberry Pi Camera Module guide
David Plowman
No ratings yet
EDA Tools
No ratings yet
EDA Tools
13 pages
EE292A Lecture 1.intro
No ratings yet
EE292A Lecture 1.intro
61 pages
Tuesday DataScience Michel Mesquita PDF
No ratings yet
Tuesday DataScience Michel Mesquita PDF
42 pages
Biomatrics
No ratings yet
Biomatrics
39 pages
EC-436 IOT Syllabus
100% (1)
EC-436 IOT Syllabus
2 pages
IEEE 802.3bj: 100GBASE-CR4 Specifications Minneapolis, MN May 2012
No ratings yet
IEEE 802.3bj: 100GBASE-CR4 Specifications Minneapolis, MN May 2012
24 pages
Klayout-0 21 16
No ratings yet
Klayout-0 21 16
511 pages
SG 2483522
No ratings yet
SG 2483522
190 pages
Chip Design Flow and Hardware Modelling
100% (1)
Chip Design Flow and Hardware Modelling
39 pages
Deep Fake Detection Vtu Report
No ratings yet
Deep Fake Detection Vtu Report
41 pages
Traffic Sign Detection and Recognition Using Image Processing
No ratings yet
Traffic Sign Detection and Recognition Using Image Processing
7 pages
Synthesis and Optimization of Digital Circuits
No ratings yet
Synthesis and Optimization of Digital Circuits
587 pages
Nist Chips 1000
No ratings yet
Nist Chips 1000
36 pages
Best Network Monitoring Software
No ratings yet
Best Network Monitoring Software
30 pages
Cuda Many Cores
100% (1)
Cuda Many Cores
19 pages
White Paper NVIDIA-VDI
No ratings yet
White Paper NVIDIA-VDI
12 pages
Hitachi White Paper Compute Blade 2000
No ratings yet
Hitachi White Paper Compute Blade 2000
15 pages
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
From Everand
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
Ladd Baby
No ratings yet
Cloud Surfing: A New Way to Think About Risk, Innovation, Scale and Success
From Everand
Cloud Surfing: A New Way to Think About Risk, Innovation, Scale and Success
Thomas M Koulopoulos
4/5 (1)
Kowsalya Selvaraj
No ratings yet
Kowsalya Selvaraj
3 pages
Using Sublime Text With C++ and OpenGL On Windows
No ratings yet
Using Sublime Text With C++ and OpenGL On Windows
14 pages
Online Railway Reservation System: Project by Miss. Ashwini Biradar Miss. Prajakta Tingre FOR T.Y.B.Sc. (Computer Science)
No ratings yet
Online Railway Reservation System: Project by Miss. Ashwini Biradar Miss. Prajakta Tingre FOR T.Y.B.Sc. (Computer Science)
44 pages
12.8 HPE StoreVirtual Storage VSA Installation and Configuration Guide-A00090616en - Us
No ratings yet
12.8 HPE StoreVirtual Storage VSA Installation and Configuration Guide-A00090616en - Us
56 pages
E Kitap Ivan
No ratings yet
E Kitap Ivan
56 pages
Unit-5_Part 2
No ratings yet
Unit-5_Part 2
11 pages
Lte - Earfcn
No ratings yet
Lte - Earfcn
7 pages
Cns Security Laws and Policies1
No ratings yet
Cns Security Laws and Policies1
8 pages
1 Introduction to Computer System-ppt
No ratings yet
1 Introduction to Computer System-ppt
42 pages
Form Ceklist Harian Alat Radiologi
No ratings yet
Form Ceklist Harian Alat Radiologi
3 pages
Aquilla Dr. Frisco, TX 75034 Cell Phone: 972-529-8222: Aamir Z Memon
No ratings yet
Aquilla Dr. Frisco, TX 75034 Cell Phone: 972-529-8222: Aamir Z Memon
3 pages
Reviews Scam, Legit or Safe Check Scamadviser
No ratings yet
Reviews Scam, Legit or Safe Check Scamadviser
1 page
Android: Ieee Projects For CSE 2016-2017 For M.E/M.TECH/B.E/B.TECH CSE/ IT Final Year Students
No ratings yet
Android: Ieee Projects For CSE 2016-2017 For M.E/M.TECH/B.E/B.TECH CSE/ IT Final Year Students
2 pages
Javascript - Javascript Programming For Beginners Guide To Javascript Coding, Javascript Programs and Javascript Language by Josh Thompsons
No ratings yet
Javascript - Javascript Programming For Beginners Guide To Javascript Coding, Javascript Programs and Javascript Language by Josh Thompsons
155 pages
Junos As A Security Language
No ratings yet
Junos As A Security Language
128 pages
TP Link 5 Port 10 100mbps Desktop Switch TL sf1005d Hoja de Datos
No ratings yet
TP Link 5 Port 10 100mbps Desktop Switch TL sf1005d Hoja de Datos
2 pages
Gujarati
No ratings yet
Gujarati
7 pages
Bessel Function
No ratings yet
Bessel Function
23 pages
Mina_s_Resume
No ratings yet
Mina_s_Resume
1 page
8051 Atmel Datasheet
100% (3)
8051 Atmel Datasheet
12 pages
Sices GC315
No ratings yet
Sices GC315
4 pages
A Business-Grade Satellite Network Service.: Internet of Things (Iot)
No ratings yet
A Business-Grade Satellite Network Service.: Internet of Things (Iot)
2 pages
SK 20230504101813396
No ratings yet
SK 20230504101813396
2 pages
Lenovo b470 Laptop Schematics
No ratings yet
Lenovo b470 Laptop Schematics
103 pages
Result Analysis Report.docx - Google Docs
No ratings yet
Result Analysis Report.docx - Google Docs
23 pages
SrivatsanVaradharajan TAM Resume
No ratings yet
SrivatsanVaradharajan TAM Resume
5 pages
GDPR in a nutshell 2nd edition
No ratings yet
GDPR in a nutshell 2nd edition
26 pages
1734725231_medonic™ m51
No ratings yet
1734725231_medonic™ m51
4 pages
ECDIS
No ratings yet
ECDIS
4 pages

Google TPU

Uploaded by

Google TPU

Uploaded by

Department of Computer Science

The Google TPU

Presented By Md. Ashraf Uddin

❖ TPU history and Goal

❖ Design a custom ASIC for the inference phase of NN

❖ Artificial NN imitate the functionality of the Brain.

❖ Multi-Layer Perceptrons(MLP): (Fully

❖ Convolution Neural Networks(CNN):

❖ Recurrent Neural Networks(RNN):

Solution: “Systolic Execution” to

❖ Simulated bigger MXU, faster clock, faster memory

❖ What if replaced DDR3 DRAM with GDDR5 DRAM like

❖ 8-bit ints vs 32-bit FP

❖ TPUv1 drops CPU/GPU features (cache, branch Prediction)

❖ SW easier for 1 TPUv1 cores vs 13 GPU cores, 18 CPU cores

❖ 2017: Deployed in data centers

❖ Two MXUs for each TPU core

❖ Up to 2048 total TPU cores and 32 TiB of total

Peak 32b TeraFLOPS/Chip -- 5 14 16

Network links x -- 4x496 4x656 6x200

Clock(MHz) 700 700 940 1530

Memory GB/s/Chip 34 700 900 900

Google Compute ❖ And sends the program binary to TPU devices

TensorFlow Server XLA Compiler

Haswell 662 22 2,300 145W 2.6 1.3 51 2

Nvidia K80 561 28 560 150w - 2.8 160 8

tpu <331 28 700 75w 92 - 34 4

You might also like