Google TPU
Google TPU
Paderborn University
July 9, 2020
❖ Norman P Jouppi, Cliff Young, Nishant Patil, and David Patterson. A domain-specific
architecture for deep neural networks. Communications of the ACM, 61(9):50–59, 2018
❖ Early discussion started in 2006
❖ Production project started in 2013
❖ TPU was designed, verified, built, and deployed in datacenters in just 15 months.
GOAL:
80
29 30
Incremental Performance/Watt
Could improve TPUv1 with more time?
❖ TPU DRAM
➢ Two DDR3 2133 channels
■ 34 GB/s
Incremental Performance/Watt
❖ 1 large 2D multiplier vs many smaller 1D units
➢ Matrix multiply benefit from 2D HW
❖ Systolic array
➢ Fewer Register access (½ energy)
❖ More Memory
❖ More Programmable
❖ Bigger Numerics
❖ Harder parallelization
❖ A 128X128 systolic MXU performs
Nx128x128 matric multiplications (peak: 32K
Tensor Core
ops/clock)
Core Sequencer ❖ Transpose Reduction permute Unit (TRP) on
128x128 matrices
❖ Vector Processing Unit(VPU): 32 2D Vector
Host Vect MXU Intercon
HBM
Queue or nect
registers Vregs + 2D Vector memory
s(Over
Memory
Unit Router Vmem(16MiB)
PCIe) (8 GIB)
Transpose
Permit Unit ❖ Core Sequencer: fetches instructions from
instruction memory Imem
❖ Inter-Core Interconnect (ICI) sends messages
between TensorCores
❖ High Bandwidth Memory(HBM) 2HBM
stacks / TC
➢ 32 64-bit busses (20x TPUv1)
❖ 16 GiB of HBM for each TPU core