0% found this document useful (0 votes)

11 views7 pages

research paper

Uploaded by

Imaan Mufti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

research paper

Uploaded by

Imaan Mufti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Flex-TPU: A Flexible TPU with Runtime

Reconfigurable Dataflow Architecture

Mohammed Elbtity, Peyton Chandarana, and Ramtin Zand
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA
e-mail: [email protected], [email protected], [email protected]

Abstract—Tensor processing units (TPUs) are one of the most lar focus on maximizing data reuse while minimizing data
well-known machine learning (ML) accelerators utilized at large transfer. Since their advent in 2015, several variations of TPU
scale in data centers as well as in tiny ML applications. TPUs have been proposed [4], [10]–[14], which similarly adopt
offer several improvements and advantages over conventional
arXiv:2407.08700v1 [cs.AR] 11 Jul 2024

ML accelerators, like graphical processing units (GPUs), being the systolic array architecture but focus on modifying the
designed specifically to perform the multiply-accumulate (MAC) microarchitecture to achieve improvements in performance,
operations required in the matrix-matrix and matrix-vector power, energy, etc. For example, in 2022 the APTPU [11] was
multiplies extensively present throughout the execution of deep proposed which leverages approximate multipliers and adders
neural networks (DNNs). Such improvements include maximizing in the systolic array processing elements (PEs) to improve
data reuse and minimizing data transfer by leveraging the
temporal dataflow paradigms provided by the systolic array the performance, area, and power of the TPU design. Another
architecture. While this design provides a significant performance example is UPTPU [12], which utilizes power-gating to reduce
benefit, the current implementations are restricted to a single the energy consumption of the TPUs.
dataflow consisting of either input, output, or weight stationary The typical systolic array architecture consists of an N × N
architectures. This can limit the achievable performance of DNN
inference and reduce the utilization of compute units. Therefore, array of PEs, each of which implements a MAC operation
the work herein consists of developing a reconfigurable dataflow using a single multiplier and adder along with some registers
TPU, called the Flex-TPU, which can dynamically change the to store data for reuse. The dataflow in the systolic array is
dataflow per layer during run-time. Our experiments thoroughly a mapping scheme that depends on the microarchitecture of
test the viability of the Flex-TPU comparing it to conventional PEs and determines how the input data is fed to the array
TPU designs across multiple well-known ML workloads. The
results show that our Flex-TPU design achieves a significant along with how the partial results and outputs are generated
performance increase of up to 2.75× compared to conventional and stored. Instead of loading and storing to and from memory
TPU, with only minor area and power overheads. for each computation, each PE in the systolic array typically
Index Terms—Tensor processing unit (TPU), AI hardware employs one of the following dataflow paradigms:
accelerator, machine learning, systolic array, ML architecture.
• Input Stationary (IS): The inputs (or activations) remain
fixed in the systolic array PEs while the weights are
I. I NTRODUCTION distributed horizontally.
• Output Stationary (OS): Outputs are attached to MAC
In 2015, Google launched its tensor processing unit (TPU)
units as the inputs and weights are circulated among
project adopting the systolic array architecture, dating back
the units. As new inputs and weights are loaded and
to as early as 1979 [1], to accelerate machine learning (ML)
multiplied, they are then accumulated directly into the
workloads [2], [3]. The first version of Google’s TPU was
stationary outputs.
primarily designed to accelerate ML workloads in large data
• Weight Stationary (WS): Each weight is pre-loaded into
centers utilizing 8-bit integer (INT8) multiply and accumulate
a register attached to the MAC within each PE. During
(MAC) units to offer a peak performance of 92 tera operations
each cycle, the input activation data is multiplied by the
per second (TOPS) [3]. The most recent version of the TPU,
fixed weights and broadcast across the systolic array’s
the TPU v4, can accelerate training and inference using the
other processing elements.
TPU’s internal 16-bit brain-float (BF16) and INT8 precisions
to offer up to 275 teraflops of computational power [4]. In Even in 2024, most of the TPUs used in larger data centers,
2019, Google launched a smaller and low-power version of the or even the edge TPUs like the Google Coral Edge TPU [15],
TPU, called the Coral Edge TPU, that is suited to accelerate have only been engineered with one of these three dataflow ar-
the inference of the ML workloads at the edge [5]–[8]. The chitectures in hardware. However, this singular static dataflow
Edge TPU uses INT8 MAC core units [9] and realizes a peak architecture may not always provide the optimal performance
performance of four TOPS. depending on the specific implementation of a deep neural
In contrast to the graphical processing units (GPUs) already network’s (DNN) layers leading to significant performance
employed for this task, TPUs were specifically designed limitations. As shown in Fig. 1, our simulations of the ResNet-
to accelerate the common matrix-matrix and matrix-vector 18 [16] convolutional neural network (CNN), show that in
multiplications dominant in ML workloads with a particu- many cases the many layers of a DNN can perform better
into a functional TPU.
• Thorough experimentation which shows the validity and
increased performance of our design.
The remainder of the paper is organized as follows. In
Section II we present our Flex-TPU architecture and discuss
the specific changes we make to the PE microarchitecture.
In Section III we discuss the performance gains resulting
from the flexibility in dataflow in the Flex-TPU and the
(a)
marginal overheads incurred over conventional TPUs. Finally,
we conclude the paper in Section IV.
II. P ROPOSED F LEX -TPU
The systolic array is the core element of any TPU archi-
tecture. A systolic array consists of two or multi-dimensional
arrays of processing elements (PEs). Each PE in the systolic
array implements a multiply-and-accumulate (MAC) operation
by multiplying the weights and inputs with a multiplier and
(b) then adding this product with any previously computed partial
sums using an accumulator. The result of this summation
is then kept in the same PE or broadcast downstream to
other PEs to be used in further computations. Regardless of
the dataflow type, this MAC operation occurs in each PE
of the systolic array to accomplish matrix-matrix or matrix-
vector multiplication while maximizing data reuse without
introducing additional data transfer overhead.
The primary distinction between different TPU architectures
is typically the dataflow of the systolic array and its PEs.
(c)
Each dataflow has its own advantages and trade-offs for
Fig. 1: Cycles required for executing each layer in ResNet-18 ML workloads regarding power, data transfer, and compute
model using static dataflow architectures: (a) input stationary, units’ utilization efficiency dependent on specific workload
(b) output Stationary, and (c) weight stationary. The layer-wise characteristics. The choice between the IS, OS, or WS dataflow
comparison shows that the optimal dataflow can be different largely depends on the objectives of the computation, such
in each layer of the network emphasizing the potential benefits as maximizing data reuse, minimizing memory bandwidth, or
of a flexible TPU with a run-time reconfigurable dataflow. reducing latency. As shown in Fig. 1, ResNet-18’s optimal
dataflow varies across the layers between all three dataflows.
Hence, selecting the optimal dataflow at runtime for each layer
on a heterogenous distribution of dataflows. For example in can lead to significant performance gains.
Fig. 1, we see that ResNet-18’s first five layers are fastest Figure 2 shows the overall architecture of our proposed
on the weight stationary dataflow while the more intermediate Flex-TPU, which is equipped with a runtime dataflow re-
and final layers perform optimally on the output and input configurability feature. Similar to a conventional TPU, our
stationary dataflows, respectively. As later shown in Fig. 5, a proposed Flex-TPU design consists of weight memory, in-
majority of the TPU’s area is consumed by the systolic array put memory, output memory, and a systolic array of size
and accounts for 77%-80% of the entire TPU’s area consuming S = N × N PEs surrounded by the first-in-first-out (FIFO)
approximately 50%-89% of the overall power consumption of buffers as depicted in Fig. 2. Additionally, our Flex-TPU
the architecture. Thus, modifying the microarchitecture of the includes a Weight/IFMap Register File that stores the fixed
systolic array to support multiple different dataflows could lead or the “stationary” weights or the IFMaps - depending on the
to significant performance speedups for the TPU as a whole. selected dataflow - with output ports distributed among the
In this paper, to increase the performance of the TPU, PEs in the systolic array. Moreover, the Dataflow Generator
we propose a flexible runtime-reconfigurable dataflow TPU, block generates the memory read/write addresses to store or
called the Flex-TPU, in which the dataflow architecture of retrieve the IFMaps, weights, and OFMap according to the
the systolic array can be reconfigured for each layer of the selected dataflow dictated by the Configuration Management
DNN according to the workload characteristic. Herein, our Unit (CMU). The CMU selects the dataflow for each layer of
contributions consist of the following: the ML workload by informing the Dataflow Generator and
• A modified PE microarchitecture to support runtime- by reconfiguring the PEs within the systolic array to work
reconfigurable dataflows. according to the pre-determined dataflow for each layer. It
• The implementation of the modified processing elements is worth mentioning that the dataflow of each layer of the
Weight Memory multiplexers (MUXs) compared to the PE of a conventional
TPU. The MUXs are controlled by the configuration manage-
DEMUX ment unit and are utilized to select the optimal dataflow for
Weight/IFMap
each layer in the ML model during runtime. As investigated in
Register File FIFO FIFO FIFO
Section III-B, adding these three extra components to each PE
in the systolic array does moderately increase both area and
power consumption, but the flexibility of the design provides a
significant performance increase as discussed in section III-A.
As shown in Fig. 4, there are three possible runtime configu-
FIFO

PE PE PE
Input Memory (IFMAP)

rations of our flexible PEs that correspond to each of the three

common dataflows: IS, OS, and WS. In Fig. 4 (a), the input
FIFO

stationary (IS) configuration is shown where the input-feature-

DEMUX

PE PE PE
map (IFMap) is fixed in a register in the PE. To accomplish
the IS dataflow during runtime, the CMU sends both MUXs a
“0” control signal and the Main Controller pins the IFMap in
FIFO

PE PE PE the register in the PE. The IS dataflow often excels for layers
with small stridden convolutions or depthwise convolutions
due to the high input reuse. By minimizing the movement of
Configuration
MUX
Dataflow
Management
heavily reused input data, a significant amount of bandwidth
Generator Unit (CMU) can be saved, particularly in memory-bound operations or
Output Memory (OFMAP)
power-constrained environments.
Main Controller Flex-TPU Architecture Figure 4(b) shows the OS configuration mode of the PE. In
this mode, the IFMaps and Weights are multiplied and then
Fig. 2: The proposed Flex-TPU Architecture. moved through the PEs in the systolic array to be reused in
further operations. The partial sums remain fixed in the PEs
Partial and keep accumulating to form the final output feature map
Weight/IFMap Weight/IFMap IFMap/Weight
Sum (OFMap) of the layer. The OS dataflow mode is triggered by
n n m n a “1” control signal being sent from the CMU to the MUXs
of each PE and thus leads to the output of the MAC being
1 0
fixed inside of the accumulator. The OS dataflow is typically
advantageous in deeper layers of DNNs where a large number
of partial sums are being accumulated. Thus, keeping the
× 0 1 output fixed within the PE minimizes the need for frequent
memory stores of intermediate results benefiting layers with
higher computational intensity per output.
+ Figure 4 (c) shows the runtime configuration for the weight
Register Register Register
stationary (WS) dataflow. The WS dataflow mode is activated
n m n
with a “0” control signal sent from the CMU. However, instead
of the IFMap being fixed in the added register in the PE, the
Weight/IFMap OFMap IFMap/Weight weight is fixed for the duration of the computation. This can
be advantageous in the first layers of DNNs where the ratio
Fig. 3: The proposed Flex-TPU processing element (PE) with of input to weights is high. As a result, keeping the weights
runtime reconfigurable dataflow. stationary yields the efficient use of memory bandwidth and
improves the overall computational throughput.
Selecting the optimal dataflow strategy is dependent on
ML model is determined after training and before deployment multiple layer-specific characteristics such as IFMap dimen-
of the model on the Flex-TPU, reducing the complexity of sions, filter sizes, number of channels, and strides. To find the
the hardware. The Main Controller handles the data transfer optimal dataflow strategy for each layer in the DNN, we should
between memories/FIFOs and the systolic array, programming run each trained model on the Flex-TPU three times, once
the CMU units, and writes to the Weight/IFMap Register File. for each dataflow, during the development phase. From these
The proposed Flex-TPU’s architecture differs from that of the three runs, the dataflow that executes each layer’s computation
conventional TPU in two primary ways: 1) the processing in the least number of clock cycles is then selected as the
elements within the systolic array and 2) the controller driving optimal dataflow for that layer. Following this one-time pre-
the dataflow selections. deployment optimization procedure the optimal dataflow is
Figure 3 shows the microarchitecture of a processing ele- then programmed into the CMU by the Main Controller
ment in the Flex-TPU, which has one extra register and two and the CMU subsequently drives each processing element’s
Partial Partial Partial
Weight/IFMap IFMap Weight Weight Weight/IFMap IFMap Weight/IFMap Weight IFMap
Sum Sum Sum
n n m n n n m n n n m n

1 0 1 0 1 0

× 0 1 × 0 1 × 0 1

+ + +

Register Register Register Register Register Register Register Register Register

n m n n m n n m n

Weight/IFMap Partial Sum Weight Weight OFMap IFMap Weight/IFMap Partial Sum IFMap

(a) (b) (c)

Fig. 4: The three flexible PE dataflow configurations controlled by the two added MUXs: (a) IS, (b) OS, and (c) WS modes.

MUXs to reconfigure them with the optimal dataflow during

runtime as well as informing the Dataflow Generator to gener-
ate the read/write indices accordingly. This process only needs
to be performed once per DNN model prior to deployment
and during the development phase to optimize the per-layer
dataflows. While not a part of this work, we plan to explore
other methods of selecting the optimal dataflow for networks
deployed on the Flex-TPU in the future. Systolic
III. R ESULTS AND D ISCUSSIONS Array
As mentioned in Section II, the proposed Flex-TPU de-
sign modifies the processing elements in the systolic array
architecture to introduce dataflow flexibility with the inclusion
of a single extra register and two multiplexers. While this
work does not focus on making improvements to the other Fig. 5: The layout of the in-house designed TPU chip exhibit-
components of the architecture, like the FIFOs, the systolic ing the ratio of the systolic array compared to the surrounding
array is estimated to utilize approximately 77% to 80% of logic and controller.
the entire TPU’s area, as shown in Fig. 5. Additionally, the
systolic array also accounts for approximately 50%-89% of
the power consumption depending on the systolic array size S.
Thus, modifying the PEs of the systolic array can significantly TABLE I: Clock cycles required for Flex-TPU versus conven-
affect the overall performance of the TPU. The layout of the tional TPU with static dataflow.
in-house implementation of a TPU chip is shown in Fig. 5
Flex-TPU Static Dataflow
and demonstrates the overall ratio of the systolic array’s area Model
Cycles
Dataflow
Cycles
Speedup
compared to the entire layout. IS 1.176e+6 1.368
AlexNet
Herein, our results in Section III-A show the benefits of 8.598e+5 OS 8.852e+5 1.030
[19]
WS 1.188e+6 1.382
the Flex-TPU architecture in providing layer-by-layer dataflow IS 5.640e+6 1.438
optimizations to increase the overall throughput and perfor- FasterRCNN
3.922e+6 OS 4.368e+6 1.114
[20]
mance of a systolic array of size S = 32 × 32. As mentioned WS 4.710e+6 1.201
IS 2.525e+6 1.612
before, while there are significant performance gains to the GoogleNet
1.566e+6 OS 1.660e+6 1.060
Flex-TPU design, this increased performance does contribute [21]
WS 1.988e+6 1.269
to a slight increase in area and power consumption that is MobileNet
IS 2.349e+6 1.949
discussed in Section III-B. The performance, power, and area 1.206e+6 OS 1.373e+6 1.139
[22]
WS 1.531e+6 1.270
results are obtained using ScaleSim V2 [17], [18], a cycle- IS 2.839e+6 1.736
ResNet-18
accurate simulator for ML accelerators, and Synopsys Design [16]
1.636e+6 OS 1.718e+6 1.051
Compiler along with the 45nm Nangate Open Cell Library. WS 2.520e+6 1.540
IS 2.971e+7 1.368
VGG-13
2.172e+7 OS 2.231e+7 1.027
A. Performance [23]
WS 3.046e+7 1.402
In our experiments, the optimal dataflow was found by run- YOLO-Tiny
IS 3.729e+6 1.750
ning each of the selected DNN models on the three different 2.131e+6 OS 2.550e+6 1.196
[24]
WS 3.337e+6 1.566
dataflows, IS, OS, and WS, using the ScaleSim V2 systolic
TABLE II: Area, power, and critical path delay overheads comparing the Flex-TPU to a conventional TPU with OS dataflow.
Area (mm2 ) Power (mW ) Critical Path Delay (ns)
S
TPU Flex-TPU Overhead TPU Flex-TPU Overhead TPU Flex-TPU Overhead
8×8 0.070 0.080 13.607% 3.491 3.756 7.591% 5.80 5.92 2.07%
16 × 16 0.284 0.318 12.180% 13.850 15.241 10.045% 6.44 6.48 0.62%
32 × 32 1.192 1.311 10.052% 55.621 61.545 10.650% 6.63 6.69 0.90%

execution time. The execution times across the tested modes

for both the static dataflows and the Flex-TPU are shown in
Fig. 6. Across all models, the Flex-TPU is the best architecture
in terms of execution time, outperforming conventional TPU
architecture with static dataflows by as much as 10.99 ms
which could be the determining factor of whether a model is
considered real-time or not.
B. Power and Area Overheads
To examine the area and power utilization of our Flex-TPU
architecture, we employ the Synopsys Design Compiler to
synthesize a conventional TPU design along with the Flex-
TPU under the same design constraints. These constraints
consist of an uncertainty of 2%, a clock period of 10 ns, and
a clock network delay of 1 ns. In particular, we synthesized
three systolic array sizes consisting of S = 8 × 8, 16 × 16, and
Fig. 6: The inference time per model for a systolic array size 32×32 for both the conventional TPU and our proposed Flex-
of S = 32 × 32 for the varying dataflows: IS, OS, WS, and TPU and report the area, power, and critical path delay for each
our Flex-TPU. VGG is not shown because its notably longer in Table II. In this context, we focus solely on the OS dataflow
execution time disrupts the clarity of the graph. architecture for the conventional TPU, as it achieves the best
performance when compared to the IS and WS dataflow, as
array simulator [17], [18]. ScaleSim provides a layer-by- discussed in the previous subsection.
layer summary of the clock cycles required by a user-defined As expected, in Table II, the area and power consumption
systolic array of size S = N × N to perform the inference of the Flex-TPU are marginally higher compared to that of the
computations of a specific DNN. Each of these DNNs is conventional TPU with an overhead of at most 13.607% and
comprised of multiple convolutions and fully connected layers 6.44%, respectively. Considering the potential average speedup
and is deployed on a TPU with S = 32 × 32 systolic array. of 36.7% achieved across all dataflows from Table I and the
Table I, provides the total number of clock cycles required improved execution times shown in Figure 6, this area and
by each model using our Flex-TPU design as well as on a power overhead could be considered acceptable depending on
conventional TPU with a single static dataflow. As shown in the network to be deployed and considering applications where
Table I, the Flex-TPU provides a speedup of 1.027× to 1.949× the utmost speed is desired in a system.
across all models and dataflows. While both the area and power increase in the Flex-TPU,
the critical path delay remains very similar across each of the
Analyzing the speedup for each dataflow, most of the
systolic array sizes (S), not exceeding more than 2.07% in
models perform close to optimally employing the OS dataflow
the worst-case, i.e. S = 8 × 8. This implies that the Flex-
compared to the IS and WS dataflows. Across the investigated
TPU’s architecture, if implemented in silicon, could have a
models, the FasterRCNN and YOLO-Tiny appear to experi-
very comparable clock frequency to the conventional TPU and
ence a greater speedup using the OS dataflow. On average,
further highlights the aforementioned speedups discussed as
the Flex-TPU’s reconfigurable dataflow provides a speedup of
net speed increases in the Flex-TPU’s overall performance.
1.612×, 1.090×, and 1.400× compared to the single static
dataflows of IS, OS, and WS, respectively. This implies that C. Scalability
the IS and WS dataflows performed the poorest in terms of Thus far, our Flex-TPU architecture results have consisted
their overall computational execution time. of small systolic array sizes of S = 8×8, 16×16, and 32×32.
To determine the real-world execution time of each model These systolic array sizes would typically be used in smaller
running on a single dataflow, the number of clock cycles for devices such as the Google Coral Edge TPU [25] since they
each model was multiplied by the critical path delay of 6.63 ns are tailored for applications at the edge. In this subsection, we
for the conventional S = 32 × 32 TPU. For the same S = focus on investigating the current data center scale TPUs like
32 × 32 size configuration of the Flex-TPU, we multiplied the the Google TPU v1 [3] with a 256 × 256 systolic array.
optimal number of clock cycles found per layer by the Flex- In Figure 7, we compare the TPU designs with static
TPU’s critical path delay of 6.69 ns to obtain each model’s dataflow against our Flex-TPU architecture at a larger systolic
IS OS WS Flex-TPU IS OS WS Flex-TPU

1.16

0.876
0.93

0.741
100 100

0.708
0.685
Clock Cycles (×106)

Clock Cycles (×106)

0.599

0.596

0.582
0.533

0.511

0.451
0.446
0.443
0.418

0.46
0.413

0.42
0.351

0.345

0.333

0.318
0.299

0.286
0.278

0.262

0.259
0.257

0.3
0.237

0.233
0.226

0.223
0.217
0.192

0.181

0.179
0.165

0.158
0.158
0.126

0.122

0.13
0.112
0.102
0.0815

0.0715
10 1 10 1

0.0535
et

18
y

y
e

e
-Tin

-Tin
CN

CN
xN

leN

eN
et-

et-
erR

erR
Ale

Ale
bil

bil
LO

LO
sN

sN
og

og
Mo

Mo
YO

YO
t

t
Go

Go
Re

Re
Fas

Fas
Models Models
(a) (b)
Fig. 7: The inference clock cycles per model for a systolic array sizes of (a) S = 128 × 128 and (b) S = 256 × 256 for the
varying dataflows: IS, OS, WS, and our Flex-TPU demonstrating the scalability of our proposed Flex-TPU architecture.

array size of S = 128 × 128 and 256 × 256. Similar to the R EFERENCES
smaller scale S = 32 × 32 systolic array, the Flex-TPU still [1] H. T. Kung and C. E. Leiserson, “Systolic arrays (for vlsi),” in Sparse
provides a significant speed advantage compared to the IS Matrix Proceedings 1978, vol. 1. Society for industrial and applied
and WS dataflow. However, compared to the TPU with OS mathematics Philadelphia, PA, USA, 1979, pp. 256–282.
[2] N. Jouppi, “Quantifying the performance of the tpu, our first machine
dataflow, the Flex-TPU achieves further performance gains learning chip,” Google Cloud Platform Blog, Google, vol. 5, 2017.
at scale. In particular, the Flex-TPU with the 128 × 128 [3] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal et al., “In-
systolic array achieves an average speedup of 1.238×, and the datacenter performance analysis of a tensor processing unit,” in Proc.
of the 44th Annual Int. Symp. on Comput. Architecture, 2017, pp. 1–12.
256×256 achieves a 1.349× speedup compared to the 1.090× [4] N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan et al., “Tpu v4:
speedup of the 32 × 32 systolic array. This demonstrates the An optically reconfigurable supercomputer for machine learning with
Flex-TPU’s effectiveness in further accelerating ML workloads hardware support for embeddings,” in Proceedings of the 50th Annual
International Symposium on Computer Architecture, 2023, pp. 1–14.
in data centers at a larger scale. [5] B. C. Reidy, M. Mohammadi, M. E. Elbtity, and R. Zand, “Efficient
deployment of transformer models on edge tpu accelerators: A real
IV. C ONCLUSION system evaluation,” in Architecture and System Support for Transformer
Models (ASSYST ISCA), 2023.
As AI and ML gain increasing traction in daily life, there is [6] ——, “Work in progress: Real-time transformer inference on edge ai
a rising demand for more performant systems and architectures accelerators,” in 2023 IEEE 29th Real-Time and Embedded Technology
for accelerating these workloads. While conventional TPUs and Applications Symposium (RTAS), 2023, pp. 341–344.
[7] H. Smith, J. Seekings, M. Mohammadi, and R. Zand, “Realtime facial
have been instrumental in keeping up with this demand, their expression recognition: Neuromorphic hardware vs. edge ai acceler-
current static dataflow implementation potentially inhibits their ators,” in 2023 International Conference on Machine Learning and
full potential on some workloads. Thus, selecting an optimal Applications (ICMLA), 2023, pp. 1547–1552.
[8] M. Mohammadi, H. Smith, L. Khan, and R. Zand, “Facial expression
dataflow specific to the workload can lead to significant perfor- recognition at the edge: Cpu vs gpu vs vpu vs tpu,” in Proceedings of the
mance gains. Herein, we proposed the Flex-TPU architecture Great Lakes Symposium on VLSI 2023, ser. GLSVLSI ’23. New York,
highlighting the potential to further optimize the TPU design’s NY, USA: Association for Computing Machinery, 2023, p. 243–248.
[9] Google. Coral AI, “Tensorflow models on the edge tpu,”
performance without limitations caused by static dataflows. 2020. [Online]. Available: https://coral.ai/docs/edgetpu/models-intro/
The Flex-TPU accomplishes a runtime-reconfigurable dataflow #compatibility-overview
by adding two multiplexers and a single register to each [10] Z. Wang, G. Wang, H. Jiang, N. Xu, and G. He, “Cosa:co-operative
systolic arrays for multi-head attention mechanism in neural network
processing element. The experiments and simulation results using hybrid data reuse and fusion methodologies,” in 2023 60th
demonstrate performance increases across various ML work- ACM/IEEE Design Automation Conference (DAC), 2023, pp. 1–6.
loads with up to a 2.75× speedup without incurring significant [11] M. E. Elbtity, P. S. Chandarana, B. Reidy, J. K. Eshraghian, and R. Zand,
“Aptpu: Approximate computing based tensor processing unit,” IEEE
area and power overheads. Considering the popularity of Transactions on Circuits and Systems I: Regular Papers, pp. 1–0, 2022.
current TPU accelerators in data centers and edge applications, [12] P. Pandey, N. D. Gundi, K. Chakraborty, and S. Roy, “Uptpu: Improving
the higher performance achieved by the Flex-TPU positions it energy efficiency of a tensor processing unit through underutilization
based power-gating,” in 2021 58th ACM/IEEE Design Automation
as an appealing upgrade for future implementations of TPUs. Conference (DAC). IEEE, 2021, pp. 325–330.
[13] K.-C. Hsu and H.-W. Tseng, “Accelerating applications using edge
ACKNOWLEDGMENT tensor processing units,” in Proceedings of the International Conference
This work is supported by the National Science Foundation for High Performance Computing, Networking, Storage and Analysis,
2021, pp. 1–14.
(NSF) under grant number 2340249.
[14] M. E. Elbtity, B. Reidy, M. H. Amin, and R. Zand, “Heterogeneous pp. 58–68.
integration of in-memory analog computing architectures with tensor [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
processing units,” in Proceedings of the Great Lakes Symposium on tion with deep convolutional neural networks,” in Advances in Neural
VLSI 2023, ser. GLSVLSI ’23. New York, NY, USA: Association for Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and
Computing Machinery, 2023, p. 607–612. K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012.
[15] K. Seshadri, B. Akin, J. Laudon, R. Narayanaswami, and A. Yaz- [20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
danbakhsh, “An evaluation of edge tpu accelerators for convolutional object detection with region proposal networks,” 2016.
neural networks,” in 2022 IEEE International Symposium on Workload [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., “Going deeper
Characterization (IISWC), 2022, pp. 79–91. with convolutions,” 2014.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [22] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang et al.,
recognition,” 2015. “Mobilenets: Efficient convolutional neural networks for mobile vision
[17] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Kr- applications,” 2017.
ishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint [23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
arXiv:1811.02883, 2018. large-scale image recognition,” 2015.
[18] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina et al., [24] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal
“A systematic methodology for characterizing scalability of dnn ac- speed and accuracy of object detection,” 2020.
celerators using scale-sim,” in 2020 IEEE International Symposium on [25] Google. Coral AI, “Edge tpu inferencing overview,” 2020. [Online].
Performance Analysis of Systems and Software (ISPASS). IEEE, 2020, Available: https://coral.ai/docs/edgetpu/inference/

Computer Arithmetic
No ratings yet
Computer Arithmetic
34 pages
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware_Software Co-optimization for Probabilistic AI and Sparse Linear Algebra-Springer (155)
No ratings yet
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware_Software Co-optimization for Probabilistic AI and Sparse Linear Algebra-Springer (155)
155 pages
2024_Intel_Tech Tour TW_Lunar Lake AI Hardware Accelerators
No ratings yet
2024_Intel_Tech Tour TW_Lunar Lake AI Hardware Accelerators
60 pages
Tensor Processing Units TPU Paper ENG
No ratings yet
Tensor Processing Units TPU Paper ENG
54 pages
TPU v4-dual
No ratings yet
TPU v4-dual
28 pages
sensors-25-00083
No ratings yet
sensors-25-00083
29 pages
Faults NNAccelerator
No ratings yet
Faults NNAccelerator
27 pages
Motivation_for_and_Evaluation_of_the_First_Tensor_Processing_Unit
No ratings yet
Motivation_for_and_Evaluation_of_the_First_Tensor_Processing_Unit
10 pages
Tensor Casting: Co-Designing Algorithm-Architecture For Personalized Recommendation Training
No ratings yet
Tensor Casting: Co-Designing Algorithm-Architecture For Personalized Recommendation Training
14 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
Polymorphic Accelerators for Deep Neural Networks
No ratings yet
Polymorphic Accelerators for Deep Neural Networks
13 pages
Case Study Sem5 Tensorflow
No ratings yet
Case Study Sem5 Tensorflow
14 pages
2021-Jouppi 10 Lessions
No ratings yet
2021-Jouppi 10 Lessions
14 pages
Test Architecture for Systolic Array of Edge-Based
No ratings yet
Test Architecture for Systolic Array of Edge-Based
11 pages
1910.14488v5
No ratings yet
1910.14488v5
12 pages
Osdi18 Chen
No ratings yet
Osdi18 Chen
17 pages
3rd research paper
No ratings yet
3rd research paper
10 pages
TVM: An Automated End-to-End Optimizing Compiler For Deep Learning
No ratings yet
TVM: An Automated End-to-End Optimizing Compiler For Deep Learning
16 pages
Ten Lessons From Three Generations Shaped Google S Tpuv4i
No ratings yet
Ten Lessons From Three Generations Shaped Google S Tpuv4i
40 pages
ONE-SA Enabling Nonlinear Operations in Systolic Arrays For Efficient and Flexible Neural Network Inference .pdf
No ratings yet
ONE-SA Enabling Nonlinear Operations in Systolic Arrays For Efficient and Flexible Neural Network Inference .pdf
6 pages
E MLT G TPU: Xploring The Limits of Concurrency in Raining On Oogle S
No ratings yet
E MLT G TPU: Xploring The Limits of Concurrency in Raining On Oogle S
12 pages
Test Architecture For Systolic Array of Edge-Based AI Accelerator
No ratings yet
Test Architecture For Systolic Array of Edge-Based AI Accelerator
11 pages
Tesla Patent
No ratings yet
Tesla Patent
12 pages
make-04-00004-v3
No ratings yet
make-04-00004-v3
37 pages
Analyzing and Mitigating The Impact of Permanent Faults On A Systolic Array Based Neural Network Accelerator
No ratings yet
Analyzing and Mitigating The Impact of Permanent Faults On A Systolic Array Based Neural Network Accelerator
6 pages
report
No ratings yet
report
9 pages
5th AccML Paper 1
No ratings yet
5th AccML Paper 1
6 pages
Pal_2025_Eng._Res._Express_7_015317 (3)
No ratings yet
Pal_2025_Eng._Res._Express_7_015317 (3)
16 pages
Isca TSP
No ratings yet
Isca TSP
14 pages
unit-4
No ratings yet
unit-4
20 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Domain Architecture
No ratings yet
Domain Architecture
4 pages
5_lecture_28_01_25
No ratings yet
5_lecture_28_01_25
47 pages
GSM Tutorial
No ratings yet
GSM Tutorial
17 pages
2020.6.sparse-Tpu Ics2020
No ratings yet
2020.6.sparse-Tpu Ics2020
12 pages
Module-V CPU, TPU, GPU
No ratings yet
Module-V CPU, TPU, GPU
9 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Hardware-Friendly User-Specific Machine Learning For Edge Devices
No ratings yet
Hardware-Friendly User-Specific Machine Learning For Edge Devices
29 pages
Systolic Tensor Array An Efficient Structured-Sparse GEMM Accelerator For Mobile CNN Inference
No ratings yet
Systolic Tensor Array An Efficient Structured-Sparse GEMM Accelerator For Mobile CNN Inference
4 pages
Architectural Support for Machine Learning Accelerators-1
No ratings yet
Architectural Support for Machine Learning Accelerators-1
4 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
No ratings yet
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
35 pages
MAERI: Enabling Flexible Dataflow Mapping Over DNN Accelerators Via Programmable Interconnects
No ratings yet
MAERI: Enabling Flexible Dataflow Mapping Over DNN Accelerators Via Programmable Interconnects
3 pages
A_Survey_Comparing_Specialized_Hardware_And_Evolution_In_TPUs_For_Neural_Networks
No ratings yet
A_Survey_Comparing_Specialized_Hardware_And_Evolution_In_TPUs_For_Neural_Networks
7 pages
1119810450-3
No ratings yet
1119810450-3
6 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
Google TPU
No ratings yet
Google TPU
27 pages
Tensor Processing Unit
50% (2)
Tensor Processing Unit
23 pages
PE Implementation paper
No ratings yet
PE Implementation paper
2 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Npu AI
No ratings yet
Npu AI
54 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
ETH Zurich Talk - April 14, 2025
No ratings yet
ETH Zurich Talk - April 14, 2025
84 pages
Applications Enabled by FPGA-Based Technology
No ratings yet
Applications Enabled by FPGA-Based Technology
4 pages
10 1109@vlsi-Dat49148 2020 9196288
No ratings yet
10 1109@vlsi-Dat49148 2020 9196288
1 page
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
Quality Agreement between Supplier and Client
No ratings yet
Quality Agreement between Supplier and Client
21 pages
Tensor Processing Unit
100% (1)
Tensor Processing Unit
15 pages
Fortran Tutorials2 (1)
No ratings yet
Fortran Tutorials2 (1)
231 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Ilovepdf Merged-Compressed
No ratings yet
Ilovepdf Merged-Compressed
352 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
_Updated_2022H1030030G_mid_sem_report_PS
No ratings yet
_Updated_2022H1030030G_mid_sem_report_PS
14 pages
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
From Everand
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
Steve Brown
No ratings yet
Project Report Real
No ratings yet
Project Report Real
51 pages
18487900
100% (2)
18487900
22 pages
ATHARV_Zalke_resume(1)
No ratings yet
ATHARV_Zalke_resume(1)
4 pages
DSS05-Manage-Security-Services Icq Eng 1214
No ratings yet
DSS05-Manage-Security-Services Icq Eng 1214
30 pages
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
50% (2)
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
27 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
EMPOWERMENT TECHNOLOGIES 4TH QUARTER UNIT TEST
No ratings yet
EMPOWERMENT TECHNOLOGIES 4TH QUARTER UNIT TEST
4 pages
Cellular Network Configuration Error
No ratings yet
Cellular Network Configuration Error
67 pages
Manual For Students
No ratings yet
Manual For Students
21 pages
Cyber Security 3
No ratings yet
Cyber Security 3
16 pages
Alem 2015
No ratings yet
Alem 2015
42 pages
The Future of Online Retail
No ratings yet
The Future of Online Retail
31 pages
Proctor Forms and Analysis: Department of Information Technology
No ratings yet
Proctor Forms and Analysis: Department of Information Technology
7 pages
01HE533QDSPQNQGT9XN9QYWF8Ceshipper+ Inc - Rufus Workhero (10 Platforms)
No ratings yet
01HE533QDSPQNQGT9XN9QYWF8Ceshipper+ Inc - Rufus Workhero (10 Platforms)
4 pages
CCNA 2 Syllabus
No ratings yet
CCNA 2 Syllabus
1 page
Alstom MBCZ
No ratings yet
Alstom MBCZ
2 pages
IT Practical File CLASS 10 Edited (1)
No ratings yet
IT Practical File CLASS 10 Edited (1)
35 pages
Executive Summary For TQM
No ratings yet
Executive Summary For TQM
2 pages
Template Matrix
No ratings yet
Template Matrix
4 pages
Xiaocong Chen: Education
No ratings yet
Xiaocong Chen: Education
2 pages
Process Chains Monitoring in BWCCMS
No ratings yet
Process Chains Monitoring in BWCCMS
9 pages
Design of FPGA Based Solar Power Inverter
No ratings yet
Design of FPGA Based Solar Power Inverter
4 pages
The Evolution of IEC 62368-1: Keep Up With Changes in The 3rd Edition
No ratings yet
The Evolution of IEC 62368-1: Keep Up With Changes in The 3rd Edition
5 pages
Azure Sentinel Deployment and Migration Services
No ratings yet
Azure Sentinel Deployment and Migration Services
2 pages
4116R 001 153 Datasheet PDF
No ratings yet
4116R 001 153 Datasheet PDF
2 pages
Bits & Bytes at Production Engg by Ace Academy
71% (7)
Bits & Bytes at Production Engg by Ace Academy
127 pages

research paper

Uploaded by

research paper

Uploaded by

Flex-TPU: A Flexible TPU with Runtime

Reconfigurable Dataflow Architecture

rations of our flexible PEs that correspond to each of the three

stationary (IS) configuration is shown where the input-feature-

Register Register Register Register Register Register Register Register Register

(a) (b) (c)

MUXs to reconfigure them with the optimal dataflow during

execution time. The execution times across the tested modes

Clock Cycles (×106)

You might also like