0% found this document useful (0 votes)
220 views

Google TPU

The document discusses the Google Tensor Processing Unit (TPU), a custom application-specific integrated circuit (ASIC) designed by Google for neural network machine learning. It provides an overview of TPU versions and architecture, including its systolic array design and performance improvements over CPUs and GPUs. The TPU was designed to reduce the total cost of ownership for deep neural network inference by 10x compared to other hardware.

Uploaded by

Ashraf Uddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
220 views

Google TPU

The document discusses the Google Tensor Processing Unit (TPU), a custom application-specific integrated circuit (ASIC) designed by Google for neural network machine learning. It provides an overview of TPU versions and architecture, including its systolic array design and performance improvements over CPUs and GPUs. The TPU was designed to reduce the total cost of ownership for deep neural network inference by 10x compared to other hardware.

Uploaded by

Ashraf Uddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Department of Computer Science

Paderborn University

The Google TPU

July 9, 2020

Presented By Md. Ashraf Uddin


Overview

❖ TPU history and Goal


❖ Deep Neural Network
❖ TPU version and architecture
❖ TPU software stack
❖ TPU performance
❖ N.P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit.
International Symposium on Computer Architecture (ISCA’17),Toronto, Canada, pages
1–12, 2017.

❖ Norman P Jouppi, Cliff Young, Nishant Patil, and David Patterson. A domain-specific
architecture for deep neural networks. Communications of the ACM, 61(9):50–59, 2018
❖ Early discussion started in 2006
❖ Production project started in 2013
❖ TPU was designed, verified, built, and deployed in datacenters in just 15 months.

GOAL:

❖ Design a custom ASIC for the inference phase of NN


❖ Custom hardware to reduce the Total Cost of Ownership(TCO) of DNN inference
phase by 10X
➢ Must run existing apps developed for CPUs and GPUs
❖ More Data, more processing speed
❖ Breakthroughs (of “Deep” NN):
➢ Decreasing the error in
■ Speech recognition by 30% over other
approaches
■ Image recognition from 26% to 3.5%
➢ Beat human champion at Go

❖ Artificial NN imitate the functionality of the Brain.


❖ Artificial Neural Network divided into Layers:
➢ Input layer, Hidden Layers, Output Layer
❖ “Deep” NN refers to many layers
❖ Two phases of NN :
➢ Training: Learning
➢ Inference: Prediction
Popular NN Architecture

❖ Multi-Layer Perceptrons(MLP): (Fully


connected) output of each layer feed
into every neuron of the next layer

❖ Convolution Neural Networks(CNN):


each neuron get as input the results of
spatial nearby output of the previous
layer

❖ Recurrent Neural Networks(RNN):


Some outputs of the previous layer and
its previous state
❖ Matrix Unit: 65536 (256 X 256) 8-bit
multiply-accumulate units
❖ 700 MHz clock rate
❖ Peak: 92T operation/seconds
➢ 65536*2*700M
❖ >25X as many MACs vs GPU
❖ 4-MiB on-chip Accumulator Memory
❖ 24-MiB on-chip Unified Buffer
(activation memory)
❖ 3.5X as much on-chip memory vs
GPU
Problem: energy/time for repeated
SRAM/register access of matrix multiply

Solution: “Systolic Execution” to


compute data on the fly in buffers by
pipelining control and data
When Pi,j receives a and b, it performs the following steps:
1. Calculates a*b
2. 2 adds the result of previous Ci,j and stores the results Ci,j
3. Sends a to Pi,j+1 unless j=4
4. Sends b to Pi+1, j unless i=4
❖ Data Buffer: 37%
❖ Computational Unit: 30%
❖ I/O: 10%
❖ Control: 2%
❖ 80X perf/Watt of Haswell CPU
❖ 30X per/Watt k80 GPU

80

29 30

Incremental Performance/Watt
Could improve TPUv1 with more time?

❖ Simulated bigger MXU, faster clock, faster memory

❖ TPU DRAM
➢ Two DDR3 2133 channels
■ 34 GB/s

❖ What if replaced DDR3 DRAM with GDDR5 DRAM like


in GPU?
➢ 180 GB/s
❖ 200X perf/Watt of Haswell CPU
❖ 70X per/Watt k80 GPU

Incremental Performance/Watt
❖ 1 large 2D multiplier vs many smaller 1D units
➢ Matrix multiply benefit from 2D HW

❖ 8-bit ints vs 32-bit FP


➢ More efficient computation / memory

❖ Systolic array
➢ Fewer Register access (½ energy)

❖ TPUv1 drops CPU/GPU features (cache, branch Prediction)


➢ Save area & energy
➢ Reuse transistor for domain specific on-chip memory

❖ SW easier for 1 TPUv1 cores vs 13 GPU cores, 18 CPU cores


❖ 2014: Harder training target versus easier inference target

❖ 2017: Deployed in data centers


❖ More Computation

❖ More Memory

❖ More Programmable

❖ Bigger Numerics

❖ Harder parallelization
❖ A 128X128 systolic MXU performs
Nx128x128 matric multiplications (peak: 32K
Tensor Core
ops/clock)
Core Sequencer ❖ Transpose Reduction permute Unit (TRP) on
128x128 matrices
❖ Vector Processing Unit(VPU): 32 2D Vector
Host Vect MXU Intercon
HBM
Queue or nect
registers Vregs + 2D Vector memory
s(Over
Memory
Unit Router Vmem(16MiB)
PCIe) (8 GIB)
Transpose
Permit Unit ❖ Core Sequencer: fetches instructions from
instruction memory Imem
❖ Inter-Core Interconnect (ICI) sends messages
between TensorCores
❖ High Bandwidth Memory(HBM) 2HBM
stacks / TC
➢ 32 64-bit busses (20x TPUv1)
❖ 16 GiB of HBM for each TPU core

❖ Two MXUs for each TPU core

❖ Up to 2048 total TPU cores and 32 TiB of total


memory in a TPU Pod
Feature TPUv1 TPUv2 TPUv3 Volta GPU
Peak 16b TeraFLOPS/Chip 92(8b) 46 123 125

Peak 32b TeraFLOPS/Chip -- 5 14 16

Network links x -- 4x496 4x656 6x200


Gbits/s/Chip
Max Chips / Supercomputer -- 256 1024 Varies

Clock(MHz) 700 700 940 1530


TDP(watts)/Chip 75 280 450 450
Die Size(mm^2) <331 <611 <648 815
Chip Technology 28nm >12nm >12nm 12nm
Memory size(on-/off-chip) 28MB/8GB 37MB/16GB 37MB/32GB 26MB/32GB

Memory GB/s/Chip 34 700 900 900


Core/Chip 1 2 2 80
Chips/CPU Host 4 4 8 8 or 16
Neural Network Model (TPU Estimator)
❖ TensorFlow generates a computation graph
❖ Compiles the computation graph just in time
TensorFlow Client

Google Compute ❖ And sends the program binary to TPU devices


Engine VM
Computational Graph(gRPC) for execution

TensorFlow Server XLA Compiler


XLA just-in time compiler ❖ Just-in-time compiler
Host ❖ Whole program analysis and execution
TPU Binary
❖ No Cache
❖ Generates binary code to be run on Cloud TPU
Model mm^2 nm MHz TDP chip TOPS/s TOPS/s GB/s chip/Server
8b FP

Haswell 662 22 2,300 145W 2.6 1.3 51 2


CPU

Nvidia K80 561 28 560 150w - 2.8 160 8


GPU

tpu <331 28 700 75w 92 - 34 4


❖ the TPU succeeded because of
➢ The large—but not too large—matrix multiply unit;
➢ The substantial software controlled on-chip memory;
➢ Ability to run whole inference models to reduce dependence on host CPU
➢ A single-threaded, deterministic execution model
➢ Omission of general-purpose features
➢ TensorFlow, which made it easy to port them to the TPU

You might also like