research paper
research paper
Abstract—Tensor processing units (TPUs) are one of the most lar focus on maximizing data reuse while minimizing data
well-known machine learning (ML) accelerators utilized at large transfer. Since their advent in 2015, several variations of TPU
scale in data centers as well as in tiny ML applications. TPUs have been proposed [4], [10]–[14], which similarly adopt
offer several improvements and advantages over conventional
arXiv:2407.08700v1 [cs.AR] 11 Jul 2024
ML accelerators, like graphical processing units (GPUs), being the systolic array architecture but focus on modifying the
designed specifically to perform the multiply-accumulate (MAC) microarchitecture to achieve improvements in performance,
operations required in the matrix-matrix and matrix-vector power, energy, etc. For example, in 2022 the APTPU [11] was
multiplies extensively present throughout the execution of deep proposed which leverages approximate multipliers and adders
neural networks (DNNs). Such improvements include maximizing in the systolic array processing elements (PEs) to improve
data reuse and minimizing data transfer by leveraging the
temporal dataflow paradigms provided by the systolic array the performance, area, and power of the TPU design. Another
architecture. While this design provides a significant performance example is UPTPU [12], which utilizes power-gating to reduce
benefit, the current implementations are restricted to a single the energy consumption of the TPUs.
dataflow consisting of either input, output, or weight stationary The typical systolic array architecture consists of an N × N
architectures. This can limit the achievable performance of DNN
inference and reduce the utilization of compute units. Therefore, array of PEs, each of which implements a MAC operation
the work herein consists of developing a reconfigurable dataflow using a single multiplier and adder along with some registers
TPU, called the Flex-TPU, which can dynamically change the to store data for reuse. The dataflow in the systolic array is
dataflow per layer during run-time. Our experiments thoroughly a mapping scheme that depends on the microarchitecture of
test the viability of the Flex-TPU comparing it to conventional PEs and determines how the input data is fed to the array
TPU designs across multiple well-known ML workloads. The
results show that our Flex-TPU design achieves a significant along with how the partial results and outputs are generated
performance increase of up to 2.75× compared to conventional and stored. Instead of loading and storing to and from memory
TPU, with only minor area and power overheads. for each computation, each PE in the systolic array typically
Index Terms—Tensor processing unit (TPU), AI hardware employs one of the following dataflow paradigms:
accelerator, machine learning, systolic array, ML architecture.
• Input Stationary (IS): The inputs (or activations) remain
fixed in the systolic array PEs while the weights are
I. I NTRODUCTION distributed horizontally.
• Output Stationary (OS): Outputs are attached to MAC
In 2015, Google launched its tensor processing unit (TPU)
units as the inputs and weights are circulated among
project adopting the systolic array architecture, dating back
the units. As new inputs and weights are loaded and
to as early as 1979 [1], to accelerate machine learning (ML)
multiplied, they are then accumulated directly into the
workloads [2], [3]. The first version of Google’s TPU was
stationary outputs.
primarily designed to accelerate ML workloads in large data
• Weight Stationary (WS): Each weight is pre-loaded into
centers utilizing 8-bit integer (INT8) multiply and accumulate
a register attached to the MAC within each PE. During
(MAC) units to offer a peak performance of 92 tera operations
each cycle, the input activation data is multiplied by the
per second (TOPS) [3]. The most recent version of the TPU,
fixed weights and broadcast across the systolic array’s
the TPU v4, can accelerate training and inference using the
other processing elements.
TPU’s internal 16-bit brain-float (BF16) and INT8 precisions
to offer up to 275 teraflops of computational power [4]. In Even in 2024, most of the TPUs used in larger data centers,
2019, Google launched a smaller and low-power version of the or even the edge TPUs like the Google Coral Edge TPU [15],
TPU, called the Coral Edge TPU, that is suited to accelerate have only been engineered with one of these three dataflow ar-
the inference of the ML workloads at the edge [5]–[8]. The chitectures in hardware. However, this singular static dataflow
Edge TPU uses INT8 MAC core units [9] and realizes a peak architecture may not always provide the optimal performance
performance of four TOPS. depending on the specific implementation of a deep neural
In contrast to the graphical processing units (GPUs) already network’s (DNN) layers leading to significant performance
employed for this task, TPUs were specifically designed limitations. As shown in Fig. 1, our simulations of the ResNet-
to accelerate the common matrix-matrix and matrix-vector 18 [16] convolutional neural network (CNN), show that in
multiplications dominant in ML workloads with a particu- many cases the many layers of a DNN can perform better
into a functional TPU.
• Thorough experimentation which shows the validity and
increased performance of our design.
The remainder of the paper is organized as follows. In
Section II we present our Flex-TPU architecture and discuss
the specific changes we make to the PE microarchitecture.
In Section III we discuss the performance gains resulting
from the flexibility in dataflow in the Flex-TPU and the
(a)
marginal overheads incurred over conventional TPUs. Finally,
we conclude the paper in Section IV.
II. P ROPOSED F LEX -TPU
The systolic array is the core element of any TPU archi-
tecture. A systolic array consists of two or multi-dimensional
arrays of processing elements (PEs). Each PE in the systolic
array implements a multiply-and-accumulate (MAC) operation
by multiplying the weights and inputs with a multiplier and
(b) then adding this product with any previously computed partial
sums using an accumulator. The result of this summation
is then kept in the same PE or broadcast downstream to
other PEs to be used in further computations. Regardless of
the dataflow type, this MAC operation occurs in each PE
of the systolic array to accomplish matrix-matrix or matrix-
vector multiplication while maximizing data reuse without
introducing additional data transfer overhead.
The primary distinction between different TPU architectures
is typically the dataflow of the systolic array and its PEs.
(c)
Each dataflow has its own advantages and trade-offs for
Fig. 1: Cycles required for executing each layer in ResNet-18 ML workloads regarding power, data transfer, and compute
model using static dataflow architectures: (a) input stationary, units’ utilization efficiency dependent on specific workload
(b) output Stationary, and (c) weight stationary. The layer-wise characteristics. The choice between the IS, OS, or WS dataflow
comparison shows that the optimal dataflow can be different largely depends on the objectives of the computation, such
in each layer of the network emphasizing the potential benefits as maximizing data reuse, minimizing memory bandwidth, or
of a flexible TPU with a run-time reconfigurable dataflow. reducing latency. As shown in Fig. 1, ResNet-18’s optimal
dataflow varies across the layers between all three dataflows.
Hence, selecting the optimal dataflow at runtime for each layer
on a heterogenous distribution of dataflows. For example in can lead to significant performance gains.
Fig. 1, we see that ResNet-18’s first five layers are fastest Figure 2 shows the overall architecture of our proposed
on the weight stationary dataflow while the more intermediate Flex-TPU, which is equipped with a runtime dataflow re-
and final layers perform optimally on the output and input configurability feature. Similar to a conventional TPU, our
stationary dataflows, respectively. As later shown in Fig. 5, a proposed Flex-TPU design consists of weight memory, in-
majority of the TPU’s area is consumed by the systolic array put memory, output memory, and a systolic array of size
and accounts for 77%-80% of the entire TPU’s area consuming S = N × N PEs surrounded by the first-in-first-out (FIFO)
approximately 50%-89% of the overall power consumption of buffers as depicted in Fig. 2. Additionally, our Flex-TPU
the architecture. Thus, modifying the microarchitecture of the includes a Weight/IFMap Register File that stores the fixed
systolic array to support multiple different dataflows could lead or the “stationary” weights or the IFMaps - depending on the
to significant performance speedups for the TPU as a whole. selected dataflow - with output ports distributed among the
In this paper, to increase the performance of the TPU, PEs in the systolic array. Moreover, the Dataflow Generator
we propose a flexible runtime-reconfigurable dataflow TPU, block generates the memory read/write addresses to store or
called the Flex-TPU, in which the dataflow architecture of retrieve the IFMaps, weights, and OFMap according to the
the systolic array can be reconfigured for each layer of the selected dataflow dictated by the Configuration Management
DNN according to the workload characteristic. Herein, our Unit (CMU). The CMU selects the dataflow for each layer of
contributions consist of the following: the ML workload by informing the Dataflow Generator and
• A modified PE microarchitecture to support runtime- by reconfiguring the PEs within the systolic array to work
reconfigurable dataflows. according to the pre-determined dataflow for each layer. It
• The implementation of the modified processing elements is worth mentioning that the dataflow of each layer of the
Weight Memory multiplexers (MUXs) compared to the PE of a conventional
TPU. The MUXs are controlled by the configuration manage-
DEMUX ment unit and are utilized to select the optimal dataflow for
Weight/IFMap
each layer in the ML model during runtime. As investigated in
Register File FIFO FIFO FIFO
Section III-B, adding these three extra components to each PE
in the systolic array does moderately increase both area and
power consumption, but the flexibility of the design provides a
significant performance increase as discussed in section III-A.
As shown in Fig. 4, there are three possible runtime configu-
FIFO
PE PE PE
Input Memory (IFMAP)
PE PE PE
map (IFMap) is fixed in a register in the PE. To accomplish
the IS dataflow during runtime, the CMU sends both MUXs a
“0” control signal and the Main Controller pins the IFMap in
FIFO
PE PE PE the register in the PE. The IS dataflow often excels for layers
with small stridden convolutions or depthwise convolutions
due to the high input reuse. By minimizing the movement of
Configuration
MUX
Dataflow
Management
heavily reused input data, a significant amount of bandwidth
Generator Unit (CMU) can be saved, particularly in memory-bound operations or
Output Memory (OFMAP)
power-constrained environments.
Main Controller Flex-TPU Architecture Figure 4(b) shows the OS configuration mode of the PE. In
this mode, the IFMaps and Weights are multiplied and then
Fig. 2: The proposed Flex-TPU Architecture. moved through the PEs in the systolic array to be reused in
further operations. The partial sums remain fixed in the PEs
Partial and keep accumulating to form the final output feature map
Weight/IFMap Weight/IFMap IFMap/Weight
Sum (OFMap) of the layer. The OS dataflow mode is triggered by
n n m n a “1” control signal being sent from the CMU to the MUXs
of each PE and thus leads to the output of the MAC being
1 0
fixed inside of the accumulator. The OS dataflow is typically
advantageous in deeper layers of DNNs where a large number
of partial sums are being accumulated. Thus, keeping the
× 0 1 output fixed within the PE minimizes the need for frequent
memory stores of intermediate results benefiting layers with
higher computational intensity per output.
+ Figure 4 (c) shows the runtime configuration for the weight
Register Register Register
stationary (WS) dataflow. The WS dataflow mode is activated
n m n
with a “0” control signal sent from the CMU. However, instead
of the IFMap being fixed in the added register in the PE, the
Weight/IFMap OFMap IFMap/Weight weight is fixed for the duration of the computation. This can
be advantageous in the first layers of DNNs where the ratio
Fig. 3: The proposed Flex-TPU processing element (PE) with of input to weights is high. As a result, keeping the weights
runtime reconfigurable dataflow. stationary yields the efficient use of memory bandwidth and
improves the overall computational throughput.
Selecting the optimal dataflow strategy is dependent on
ML model is determined after training and before deployment multiple layer-specific characteristics such as IFMap dimen-
of the model on the Flex-TPU, reducing the complexity of sions, filter sizes, number of channels, and strides. To find the
the hardware. The Main Controller handles the data transfer optimal dataflow strategy for each layer in the DNN, we should
between memories/FIFOs and the systolic array, programming run each trained model on the Flex-TPU three times, once
the CMU units, and writes to the Weight/IFMap Register File. for each dataflow, during the development phase. From these
The proposed Flex-TPU’s architecture differs from that of the three runs, the dataflow that executes each layer’s computation
conventional TPU in two primary ways: 1) the processing in the least number of clock cycles is then selected as the
elements within the systolic array and 2) the controller driving optimal dataflow for that layer. Following this one-time pre-
the dataflow selections. deployment optimization procedure the optimal dataflow is
Figure 3 shows the microarchitecture of a processing ele- then programmed into the CMU by the Main Controller
ment in the Flex-TPU, which has one extra register and two and the CMU subsequently drives each processing element’s
Partial Partial Partial
Weight/IFMap IFMap Weight Weight Weight/IFMap IFMap Weight/IFMap Weight IFMap
Sum Sum Sum
n n m n n n m n n n m n
1 0 1 0 1 0
× 0 1 × 0 1 × 0 1
+ + +
Weight/IFMap Partial Sum Weight Weight OFMap IFMap Weight/IFMap Partial Sum IFMap
Fig. 4: The three flexible PE dataflow configurations controlled by the two added MUXs: (a) IS, (b) OS, and (c) WS modes.
1.16
0.876
0.93
0.741
100 100
0.708
0.685
Clock Cycles (×106)
0.596
0.582
0.533
0.511
0.451
0.446
0.443
0.418
0.46
0.413
0.42
0.351
0.345
0.333
0.318
0.299
0.286
0.278
0.262
0.259
0.257
0.3
0.237
0.233
0.226
0.223
0.217
0.192
0.181
0.179
0.165
0.158
0.158
0.126
0.122
0.13
0.112
0.102
0.0815
0.0715
10 1 10 1
0.0535
et
et
18
et
et
18
y
y
e
e
-Tin
-Tin
CN
CN
xN
leN
eN
xN
leN
eN
et-
et-
erR
erR
Ale
Ale
bil
bil
LO
LO
sN
sN
og
og
Mo
Mo
YO
YO
t
t
Go
Go
Re
Re
Fas
Fas
Models Models
(a) (b)
Fig. 7: The inference clock cycles per model for a systolic array sizes of (a) S = 128 × 128 and (b) S = 256 × 256 for the
varying dataflows: IS, OS, WS, and our Flex-TPU demonstrating the scalability of our proposed Flex-TPU architecture.
array size of S = 128 × 128 and 256 × 256. Similar to the R EFERENCES
smaller scale S = 32 × 32 systolic array, the Flex-TPU still [1] H. T. Kung and C. E. Leiserson, “Systolic arrays (for vlsi),” in Sparse
provides a significant speed advantage compared to the IS Matrix Proceedings 1978, vol. 1. Society for industrial and applied
and WS dataflow. However, compared to the TPU with OS mathematics Philadelphia, PA, USA, 1979, pp. 256–282.
[2] N. Jouppi, “Quantifying the performance of the tpu, our first machine
dataflow, the Flex-TPU achieves further performance gains learning chip,” Google Cloud Platform Blog, Google, vol. 5, 2017.
at scale. In particular, the Flex-TPU with the 128 × 128 [3] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal et al., “In-
systolic array achieves an average speedup of 1.238×, and the datacenter performance analysis of a tensor processing unit,” in Proc.
of the 44th Annual Int. Symp. on Comput. Architecture, 2017, pp. 1–12.
256×256 achieves a 1.349× speedup compared to the 1.090× [4] N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan et al., “Tpu v4:
speedup of the 32 × 32 systolic array. This demonstrates the An optically reconfigurable supercomputer for machine learning with
Flex-TPU’s effectiveness in further accelerating ML workloads hardware support for embeddings,” in Proceedings of the 50th Annual
International Symposium on Computer Architecture, 2023, pp. 1–14.
in data centers at a larger scale. [5] B. C. Reidy, M. Mohammadi, M. E. Elbtity, and R. Zand, “Efficient
deployment of transformer models on edge tpu accelerators: A real
IV. C ONCLUSION system evaluation,” in Architecture and System Support for Transformer
Models (ASSYST ISCA), 2023.
As AI and ML gain increasing traction in daily life, there is [6] ——, “Work in progress: Real-time transformer inference on edge ai
a rising demand for more performant systems and architectures accelerators,” in 2023 IEEE 29th Real-Time and Embedded Technology
for accelerating these workloads. While conventional TPUs and Applications Symposium (RTAS), 2023, pp. 341–344.
[7] H. Smith, J. Seekings, M. Mohammadi, and R. Zand, “Realtime facial
have been instrumental in keeping up with this demand, their expression recognition: Neuromorphic hardware vs. edge ai acceler-
current static dataflow implementation potentially inhibits their ators,” in 2023 International Conference on Machine Learning and
full potential on some workloads. Thus, selecting an optimal Applications (ICMLA), 2023, pp. 1547–1552.
[8] M. Mohammadi, H. Smith, L. Khan, and R. Zand, “Facial expression
dataflow specific to the workload can lead to significant perfor- recognition at the edge: Cpu vs gpu vs vpu vs tpu,” in Proceedings of the
mance gains. Herein, we proposed the Flex-TPU architecture Great Lakes Symposium on VLSI 2023, ser. GLSVLSI ’23. New York,
highlighting the potential to further optimize the TPU design’s NY, USA: Association for Computing Machinery, 2023, p. 243–248.
[9] Google. Coral AI, “Tensorflow models on the edge tpu,”
performance without limitations caused by static dataflows. 2020. [Online]. Available: https://coral.ai/docs/edgetpu/models-intro/
The Flex-TPU accomplishes a runtime-reconfigurable dataflow #compatibility-overview
by adding two multiplexers and a single register to each [10] Z. Wang, G. Wang, H. Jiang, N. Xu, and G. He, “Cosa:co-operative
systolic arrays for multi-head attention mechanism in neural network
processing element. The experiments and simulation results using hybrid data reuse and fusion methodologies,” in 2023 60th
demonstrate performance increases across various ML work- ACM/IEEE Design Automation Conference (DAC), 2023, pp. 1–6.
loads with up to a 2.75× speedup without incurring significant [11] M. E. Elbtity, P. S. Chandarana, B. Reidy, J. K. Eshraghian, and R. Zand,
“Aptpu: Approximate computing based tensor processing unit,” IEEE
area and power overheads. Considering the popularity of Transactions on Circuits and Systems I: Regular Papers, pp. 1–0, 2022.
current TPU accelerators in data centers and edge applications, [12] P. Pandey, N. D. Gundi, K. Chakraborty, and S. Roy, “Uptpu: Improving
the higher performance achieved by the Flex-TPU positions it energy efficiency of a tensor processing unit through underutilization
based power-gating,” in 2021 58th ACM/IEEE Design Automation
as an appealing upgrade for future implementations of TPUs. Conference (DAC). IEEE, 2021, pp. 325–330.
[13] K.-C. Hsu and H.-W. Tseng, “Accelerating applications using edge
ACKNOWLEDGMENT tensor processing units,” in Proceedings of the International Conference
This work is supported by the National Science Foundation for High Performance Computing, Networking, Storage and Analysis,
2021, pp. 1–14.
(NSF) under grant number 2340249.
[14] M. E. Elbtity, B. Reidy, M. H. Amin, and R. Zand, “Heterogeneous pp. 58–68.
integration of in-memory analog computing architectures with tensor [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
processing units,” in Proceedings of the Great Lakes Symposium on tion with deep convolutional neural networks,” in Advances in Neural
VLSI 2023, ser. GLSVLSI ’23. New York, NY, USA: Association for Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and
Computing Machinery, 2023, p. 607–612. K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012.
[15] K. Seshadri, B. Akin, J. Laudon, R. Narayanaswami, and A. Yaz- [20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
danbakhsh, “An evaluation of edge tpu accelerators for convolutional object detection with region proposal networks,” 2016.
neural networks,” in 2022 IEEE International Symposium on Workload [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., “Going deeper
Characterization (IISWC), 2022, pp. 79–91. with convolutions,” 2014.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [22] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang et al.,
recognition,” 2015. “Mobilenets: Efficient convolutional neural networks for mobile vision
[17] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Kr- applications,” 2017.
ishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint [23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
arXiv:1811.02883, 2018. large-scale image recognition,” 2015.
[18] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina et al., [24] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal
“A systematic methodology for characterizing scalability of dnn ac- speed and accuracy of object detection,” 2020.
celerators using scale-sim,” in 2020 IEEE International Symposium on [25] Google. Coral AI, “Edge tpu inferencing overview,” 2020. [Online].
Performance Analysis of Systems and Software (ISPASS). IEEE, 2020, Available: https://coral.ai/docs/edgetpu/inference/