Fast HBM Access With FPGAs Analysis Architectures and Applications
Fast HBM Access With FPGAs Analysis Architectures and Applications
Abstract—Over the past few decades, the gap between rapidly cases [6]–[8]. In contrast to traditional DRAM, HBM is usually
increasing computational power and almost stagnating memory lower clocked but uses wider busses and significantly more
bandwidth has steadily worsened. Recently, 3D die-stacking in channels. However, this requires a much higher degree of
form of High Bandwidth Memory (HBM) enabled the first major memory access parallelization than before. To facilitate this
jump in external memory throughput in years. In contrast to use, crossbars for global addressing have been implemented on
traditional DRAM it compensates its lower clock frequency with
wide busses and a high number of separate channels. However,
FPGAs. They considerably simplify system design and coop-
this also requires data to be spread out over all channels to reach eration between cores with different access pattern. However,
the full throughput. Previous research relied on manual HBM depending on their usage, these bus fabrics can also introduce
data partitioning schemes and handled each channel as an entirely severe limitations in throughput and latency [8]. Therefore,
independent entity. This paper in contrast also considers scalable it cannot be assumed that designs scale linearly with the
hardware adaptions and approaches system design holistically. theoretical bandwidth when switching from traditional DRAM
In this process we first analyze the problem with real world to HBM. This makes HBM devices in general more difficult to
measurements on a Xilinx HBM FPGA. Then we derive several handle and requires architectural adjustments to accelerators.
architectural changes to improve throughput and ease accelerator
design. Finally, a Roofline based model to more accurately Previous work that analyzes system behavior either tar-
estimate the expected performance in advance is presented. With geted traditional DRAM [9], [10] or GPUs [11], [12]. These
these measures we were able to increase the throughput by up approaches did not consider the interaction challenges between
to 3.78× with random and 40.6× with certain strided access FPGA accelerators and novel HBM memory interfaces. Several
patterns compared to Xilinx’ state-of-the-art switch fabric. further papers implemented FPGA designs with HBM but only
Keywords—Field Programmable Gate Arrays (FPGA), High- focused on workload specific data partitionings [6], [8], [13].
Bandwidth Memory (HBM), Computational Modeling As a consequence, they gave up the advantages of global
addressing and did not fully explore the possibilities of this
design space. In contrast, this paper presents a holistic analysis
I. I NTRODUCTION and evaluation based on the Roofline model [14] to guide
designers through a transition to HBM. It eases accelerator de-
The ever increasing demand for computational power re-
sign by more accurately estimating the expected performance
cently reached new heights with the widespread adoption
in advance and offering design guidelines. Furthermore, we
of deep learning and big data techniques. However, these
integrated several of these optimizations into an IP core. All
requirements are getting increasingly harder to meet due to
in all, our main contributions can be summarized as:
the ongoing slowdown of Moore’s Law and Dennard Scaling.
For this reason, a shift to more application-specific circuits • provide a comprehensive analysis of efficient HBM
and heterogeneous hardware can be seen in both academia access
and industry [1]. However, their full potential is often limited
by the available memory bandwidth since data often cannot • derive architecture guidelines for HBM usage from
be fetched fast enough [2]. This bottleneck is mainly caused this analysis
by the much slower pace DRAM manufacturers are able • provide a ready-to-use Memory Access Optimizer
to develop faster device generations. In the past, this effect (MAO) IP core which eases implementing these guide-
was particularly pronounced for FPGAs since they had a far lines
slower external memory access than GPUs [3]. The resulting
processor-memory gap has been a major challenge for more • proof our approach by applying the methodology to
than two decades [4]. state-of-the-art accelerators
Recent advancements in 3D die-stacking and packaging The paper is structured as follows: Section II explains
technology made it possible to significantly increase the HBM2 memory subsystem architectures at the example of Xil-
memory throughput [5]. This resulted in the creation of the inx devices and their problems. Then, prior work is discussed
High Bandwidth Memory (HBM) standard for this type of in Section III. Section IV provides a more comprehensive
memory organization. Since then, it has been implemented analysis of the memory characteristics and derives necessary
in commercial GPUs and FPGAs that promise unprecedented architecture considerations. Based on this our methodology
performance. For this reason, it has recently been utilized is evaluated with two example designs in Section V. Finally,
by several application-specific accelerators for various use Section VI summarizes the paper.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
Programmable Logic
Accelerator(s)
connections to route AXI transactions towards the destination
BM BM BM BM BM BM BM BM BM BM BM BM BM BM BM BM
when the address does not belong to the local switch. However,
... ??? Memory Access ??? ... only two outgoing busses are available in every direction.
Hereby, the memory capacity of every PCH is contiguously
Crossbar ... Crossbar Crossbar ... Crossbar mapped into successive sections of the global address space.
Switch Switch Switch Switch
ASIC ASIC
MC 0 MC 1 ... MC 6 MC 7 MC 8 MC 9 ... MC 14 MC 15
Although this kind of segmentation enables the desired
global addressing, it has several major drawbacks. First, it
changes the latency of memory accesses since AXI transac-
PCH 12 Mem
PCH 13 Mem
PCH 14 Mem
PCH 15 Mem
PCH 28 Mem
PCH 29 Mem
PCH 30 Mem
PCH 31 Mem
PCH 16 Mem
PCH 17 Mem
PCH 18 Mem
PCH 19 Mem
PCH 0 Mem
PCH 1 Mem
PCH 2 Mem
PCH 3 Mem
... ... tions potentially need to be routed over several hops. This
non-uniformity might be an issue for accelerators which expect
Chip 0 Chip 3 Chip 4 Chip 7
data to arrive approximately at the same time. Second, the
HBM Stack 0 HBM Stack 1
AXI M AXI S DDR M DDR S fast path slow path
limited number of lateral connections effectively reduces the
bandwidth if requests need to be routed over the same bus.
Fig. 1: Structure of HBM interfaces on the example of Xilinx Third, the used assignment of PCH address spaces makes in-
Virtex UltraScale+ FPGAs [15]. teroperation with a CPU and common programming languages
much more difficult. These usually assume that data lies for
their perspective contiguously in memory. If this data is simply
copied to HBM with such an address layout, it will be placed
II. BACKGROUND AND P ROBLEM D ESCRIPTION in the same PCH until its maximum capacity is reached. This
potentially causes BMs to contend for the same PCH and
To understand general system level implications of HBM therefore severely decreases latency and throughput. Overall
usage, it is first necessary to analyze the underlying technology these issues can lead to worse results then with traditional
constraints. Fig. 1 shows the structure of such a system on DRAM due to the lower clock frequency.
the example of Xilinx Virtex UltraScale+ FPGAs [15]. To
reach the promised high throughput, HBM uses by concept a These problems need to be considered when architectures
very high number of independent memory channels instead of for HBM are drafted. Here, a model is required that factors in
increasing the frequency. These are provided by parallelizing both memory throughput and compute capabilities to quickly
access to one or more stacks of N DRAM chips (N -Hi) each. estimate the overall system performance. One of the most
Hereby, every chip has two completely independent channels illustrative ones for this purpose is the Roofline model [14].
that are each further split into two 64 bit pseudo channels It places algorithm implementations in a 2D graph that limits
(PCH) with a common command path. Nevertheless, every achievable performance by simple computational and band-
PCH provides exclusive access to only its own associated width ceilings. However, these predictions can only be as
memory subsection via through-silicon vias. These can then good as the underlying assumptions. As the HBM throughput
be directly used by an equal number of bus masters (BM) in can widely vary, this behavior must also be accounted for.
an accelerator. However, this strict separation is generally not Otherwise the model would be misleading for system designers
suitable for all applications. This is for example the case when and lead to wrong architecture decisions. Therefore, a closer
the working set of a BM is bigger than its PCH capacity, as investigation of the performance impediments is necessary and
with graph algorithms where data anywhere in the memory presented in the following.
might be accessed. Therefore, an interconnect structure is
inevitable to provide more flexibility. This is called global III. R ELATED W ORK
addressing in the following. One drawback of such designs is
the high complexity. Especially due to the high number of I/O After the launch of HBM, its first productive use has been
pins that cover a great physical distance on the chip, extensive taken place in GPUs. In the course of this, the Roofline model
bus pipelining is needed to compensate the high wire delays. has also been used to predict application performance on these
These factors considerably complicate HBM interface design. systems [11], [12]. However, due to their very homogeneous
For this reason, compromises of throughput, latency, die size, architecture and memory hierarchy, these models are not
and flexibility are made. In fact, both Xilinx and Intel currently representative for FPGAs. To accommodate to this design
segment this structure into smaller crossbars [15], [16]. Due specialization, several papers explored Roofline modelling
to this inherent challenge of HBM based systems, our results of application-specific accelerators. These approaches relate
obtained on the Xilinx devices presented in the following can computational performance, memory bandwidth, and resource
be seen as indicative for a larger group of systems. consumption for e.g. HLS generated circuits [17] or CNN
accelerators [10]. Although they consider external memory
As depicted in Fig. 1, Xilinx devices use two 4-Hi stacks bandwidth an important limiting factor for design performance,
with 4 GB capacity each. This results in a total of 32 PCH they also assume that the theoretical maximum can simply
which are presented to the programmable logic (PL) as just always be reached. However, as shown in Section II this is not
as many 256 bit AXI3 busses with half the frequency. Hereby, accurate enough with HBM. In contrast, Göbel et al. take this
every two share a common memory controller (MC) to perform factor more into consideration by analyzing memory access
the AXI to DDR protocol conversion. The global addressing traces of software that is used for HLS [9]. A similar approach
interconnect structure is implemented as a segmented switch has been taken by Siracusa et al. who use additional random
network in which every four AXI slaves and two MCs are access and gather-scatter bandwidth ceilings in the Roofline
directly connected by a local crossbar switch. In contrast to model [18]. Although an accelerator connected to HBM has
Intel, these are part of the FPGA ASIC and contain lateral also been shown, they did not further investigate it. Therefore,
153
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Used memory access patterns
Single Channel Cross Channel
Stride SCS CCS
Random SCRA CCRA Fig. 2: Achievable throughput when AXI reads and writes are
issued with different ratios at 300 MHz.
it is still unclear which consequences its usage has on different
kinds of hardware.
Since then HBM usage on FPGAs has also become the A. Performance Analysis
main focus of several publications [6]–[8], [13]. A first ex-
ploration of its characteristics on Xilinx devices has been pre- To evaluate the most efficient accelerator design, develop-
sented by Wang et al. [13]. Although they benchmarked several ers need to systematically analyze all parameters that affect
basic latencies and throughputs for different MC configu- HBM access. Therefore, we identify and present them in the
rations, they only investigated isolated transactions between following. Similar to traditional DRAM, the impact on a single
one BM and one PCH each. This neglects the interference channel is considered at first. The chips used on the selected
of concurrent bus accesses that designs can suffer from in FPGA have a theoretical maximum bandwidth T hsc max of
real systems. A more accurate representation of the attainable 14.4 GB/s per PCH which leads to a total of 460 GB/s over
throughput of HBM FPGAs has been presented by several ap- all 32. However, as with all DRAM based technologies, this
plication specific accelerators [6]–[8]. Although a speedup over limit cannot be reached due to the cell refresh cycles. Xilinx
non-HBM systems could be shown, they also experienced the states the effective HBM throughput T hsc ef f on their devices
limits presented in Section II. For this reason, they manually as 7-9% lower [15]. In practice it can be even lower depending
partitioned and duplicated data such that every BM mainly on the following accelerator design choices.
accesses one PCH. However, a comprehensive exploration of
The first factor is the design frequency facc itself. With an
the HBM subsystem characteristics that can guide developers
AXI bus width W of 256 bit, 450 MHz are needed to reach
when conceiving accelerators has not been shown.
the theoretical limit. Similar to Kara et al. [8] we consider
The goal of this paper is to provide such a general this clock challenging for many accelerators to reach timing
methodology for HBM. We approach this in the following closure with. In this case, a common alternative is to double
by first analyzing the parameters that affect the performance W to provide the same throughput at half the frequency. As a
and deriving suitable architecture guidelines. Then we present consequence, more functional cores and thus more resources to
a ready-to-use IP core which eases implementing designs be able to process more data in parallel are used. However, as
adhering to these guidelines. Lastly, we proof our methodology stated before the hardware is already always underutilized due
by applying it to own accelerators in a Roofline model. to the DRAM intrinsic throughput degradation. Furthermore,
accelerators typically read and write concurrently over the
IV. S YSTEM A NALYSIS AND D ESIGN independent AXI channels while the HBM DDR protocol
provides only a common bidirectional bus. Therefore, the
Our HBM analysis in the following is based on measure- memory is usually not able to serve all AXI transactions at full
ments conducted on a Xilinx XCVU37P-2E FPGA. Due to the clock speed anyway. For this reason, it is in many cases not
regularity of HBM stacks, we assume that our methodology efficient to parallelize processing. Instead, designs with lower
is also applicable to other stack configurations. In general, we frequencies are often sufficient if a proportional ratio RWrat
used the memory controller settings Wang et al. found to be of concurrent read and write transaction compensate the lower
the best [13]. However, in contrast to them we do not only speed in one direction. Fig. 2 shows this effect on throughput
measure point-to-point characteristics but the overall system for a more common 300 MHz clock. It can be seen that
performance under certain workloads. Since tasks are often DRAM bus turnaround delays for concurrent reads and writes
decomposable into several more basic patterns, we used the reduced the performance by only 2% compared to reported
extreme cases listed in Table I as a basis. First, accelerators can unidirectional 450 MHz references [13], [15]. This maximal
be differentiated in their access locality to memory channels. value was already reached with the commonly encountered
Single Channel (SC) restricts every BM to its directly attached 2:1 ratio. Since the losses are small and timing closure prob-
PCH in a 1:1 port mapping. This eliminates interference of lems are often occurring, we also restrict the clock to more
other BM but requires data to be manually prepartitioned and conservative 300 MHz in the following to further explore the
possibly duplicated. Cross Channel (CC) in contrast assumes effect of this trade-off. In general, it is effective to reduce the
data lies globally contiguous in memory. Therefore, every BM clock frequency of HBM accelerators if it is compensated by
works on every PCH in an N:M mapping. Second, the order of an appropriate ratio of concurrent reads and writes.
accesses needs to be separated. Stride (S) linearly increases the
addresses by a certain length after every bus transaction. With A further factor that impacts throughput is the used access
CC it is assigned in a way such that every BM requests the pattern. As width all DRAM-based technologies, data must
globally subsequent data chunk in turn. Random Access (RA) first be latched before it can be accessed. This is done in
in contrast exposes no definite pattern and scatters accesses the granularity of a page with consecutive data that needs
over a larger area of the address space. However, each bus to be opened. Since accesses to open pages are faster, long
transaction works on a small (≤ 512 B) contiguous chunk of continuous bursts of data requests perform better. Fig. 3
data. The four combinations of these characteristics listed in demonstrates this effect of the access pattern and burst length
Table I are used as basic patterns in the following. BL until its AXI3 limit of 16. It can be seen that length
154
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
(a) Total throughput
(a) SCS (b) CCS
Rotate 2 Rotate 4 Rotate 6 Rotate 8
0 1 2 3 4 5 6 7 Master
...
...
Crossbar Switch Crossbar Switch
0 1 2 3 4 5 6 7 Channel
155
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
segmented switch networks described in Section II, requests TABLE II: HBM latency comparison
are gradually transported over lateral connections. Therefore,
the effective number of parallel busses that route data to CCS CCRA
the destination Nlat ef f also serves as an upper bound on Traffic Setup
Read
mean σ
Write
mean σ
Read
mean σ
Write
mean σ
throughput. Fig. 4a shows this limit by assigning every BM m XLNX 71.8 19.8 46.3 24.6 66.5 17.7 29.1 7.9
through an offset i a unique PCH m + i mod Nch max . Since Single
MAO 73.7 12.5 32.0 0.1 81.9 15.7 32.0 0.3
this is a rotation and the fabric architecture is also symmetric, Burst
XLNX 3020.8 1478.8 585.4 522.9 651.8 353.5 197.3 122.2
offsets bigger than Nch2max were equal to a rotation in the other MAO 264.5 13.4 72.0 0.7 546.2 158.4 93.2 23.8
156
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: HBM throughput comparison [GB/s]
CCS CCRA
XLNX MAO XLNX MAO
RD 9.6 (2.1%) 307 (66.7%) 32.0× 36.0 (7.8%) 134 (29.1%) 3.72×
WR 9.6 (2.1%) 307 (66.7%) 32.0× 48.0 (10.4%) 144 (32.3%) 3.00×
Both 13.0 (2.8%) 414 (90.0%) 40.6× 70.4 (15.3%) 266 (57.8%) 3.78×
157
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
1.E+06
Memory BW XLNX
Memory BW MAO 1.E+03
1.E+05 32 Ports 32 Ports (MAO)
Performance [GOPS/s] 16 Ports
Performance [GOPS/s]
8 Ports 16 Ports (MAO)
1.E+04
4 Ports
18.4X 28.5x 8 Ports (MAO)
1.E+02
1.E+03
TABLE V: Overview of Matrix Multiplication Accelerators in memory without gaps in the address space. While this CCS
pattern simplifies design, resource contention is expected as
Parameter Accelerator A Accelerator B analyzed in Section IV as many ports access the same PCH in
parallel. This bottleneck is later mitigated without additional
Architecture design effort with our MAO IP core.
RWrat 2:1 Mh :1 (Mh 2) Fig. 7a shows the Roofline model of Accelerator A for
facc 300 MHz 300 MHz
P 4 8 16 32 4 8 16 32
four different configurations of P . For this implementation,
OpI [OPs/B] 42 84 167 328 2 2 2 2 the operational intensity not only depends on the problem
Ccomp [GOPs/s] 2458 9831 39322 157286 68 137 274 547 size, but also on the accelerator implementation. With a bigger
Core 14% 56% 223% 895% 3% 6% 12% 24%
U til PE array more values can be kept in local storage. Therefore
Core+MAO 36% 79% 245% 917% 25% 28% 34% 46%
SU
HBM — 2× 3.9× 7.7× — 1× 1× 1× data must be reloaded less often and more operations can be
HBM+MAO 4.6× 18.4× 73.8× 248.2× 3.6× 7.1× 14.3× 28.5× performed on the same number of bytes on average. To check
HBM + ◦
Benefit
HBM+MAO ++ +++ the overall achievable memory throughput of the accelerator,
we have to take a look at RWrat as well as facc (see Table V).
As derived in Section IV we estimate the maximal achievable
resources. However, these must also be consistently provided memory throughput to be about 13 GB/s for the access pattern
with the necessary data. This is only possible due to the of Accelerator A in a system without MAO. In contrast,
much higher memory bandwidth HBM offers by increasing the with MAO we expect an increase to about the maximum
number of BMs P . Therefore, P directly corresponds to the HBM throughput of 416 GB/s. Then we measured the actual
degree of compute parallelization. Accelerator A scales with throughput to see if our estimation holds up. The results for
the PE array dimensions P Aw × P Ah . Accelerator B can be P = 32 amount to 12.55 GB/s without and 403.75 GB/s with
scaled in multiples of adder trees P Ah . Additionally, Table V the MAO IP core. This is about 3% off from what we estimated
also lists the operational intensity OpI, computational ceiling for both cases. Therefore, our model is sufficiently accurate
Ccomp , speedup SU compared to the baseline P = 4, as well for this early design space exploration. Furthermore, it can be
as the FPGA resource utilization U til. Ccomp is calculated seen that without optimized memory access the accelerator is
by summing all operations executed by the accelerator and memory bound for all configurations. The measured maximal
dividing the sum by the runtime. OpI is calculated by dividing throughput of 12.55 GB/s also shows that without any opti-
all operations by all bytes read and written by the accelerator. mization HBM does not bring any improvements here, as this
It can be seen that Ccomp increases with P but at the cost could also be achieved with traditional non-stacked DRAM.
of higher U til. Therefore, this growth is ultimately limited by Therefore, an optimization of the memory access pattern is
the FPGA capacity. If this additional computing power can crucial. When using our MAO IP core we were able to
be exploited further depends on the real achievable memory improve the accelerator performance up to 32.2× compared to
throughput. These factors are analyzed in the following. unoptimized access. Here, the system became compute bound
again for P < 32 even with MAO. For P = 32 the accelerator
As shown before memory requests consisting of long bursts is still memory bound, but as it utilizes the highest amount
with many outstanding transactions are beneficial for a high of available bandwidth (403.75 GB/s) it should be chosen for
throughput. Therefore, both cores immediately request as much an implementation using HBM. However, when the resource
data as possible to process all matrix values. This behavior usage is also considered, neither P = 16 nor P = 32 fit on
fulfills both requirements as long as the matrices are big our FPGA. Therefore, the P = 8 configuration is the best
enough. In a heterogeneous system where accelerators interact achievable with a memory throughput of 116 GB/s and a SU
with other cores, data can often not be partitioned in a way that of 18.4×. At this point the accelerator is compute bound and
the memory access from all is optimal. Therefore, we assume does not use the HBM benefits completely. Nevertheless, it
a allocation strategy where every matrix is contiguously stored still achieves a high performance due to the grid structure
158
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
that enables a high internal data reuse. Therefore, it is less R EFERENCES
susceptible to external memory bottlenecks. While this kind of [1] R. S. Williams, “What’s Next? [The End of Moore’s Law],” IEEE
accelerator seems very promising, the improvement gained by Comput. Sci. Eng. Mag., vol. 19, no. 2, pp. 7–13, Mar. 2017.
using HBM is not as high as expected, as the most beneficial [2] N. S. Kim, D. Chen, J. Xiong, and W. H. Wen-mei, “Heterogeneous
configurations are not implementable on current FPGAs. Computing Meets Near-Memory Acceleration and High-Level Synthe-
sis in the Post-Moore Era,” IEEE Micro, vol. 37, no. 4, pp. 10–18,
The Roofline model for Accelerator B is depicted in 2017.
Fig. 7b. Here, OpI only depends on the matrix size therefore [3] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding
does not change with P . Only Ccomp differs as it depends Performance Differences of FPGAs and GPUs,” in IEEE 26th Annu. Int.
on P and facc . According to our analysis in Section IV, the Symp. Field-Prog. Custom Comput. Mach. (FCCM), 2018, pp. 93–96.
maximum achievable bandwidth of Accelerator B is inherently [4] M. V. Wilkes, “The Memory Gap and the Future of High Performance
limited by facc to roughly 23 of the maximum throughput Memories,” SIGARCH Comput. Archit. News, vol. 29, no. 1, p. 27, Mar.
2001.
caused by the core running at 300 MHz while not having a
[5] H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim, “HBM
2:1 read-write ratio. In a system without MAO we estimate (High Bandwidth Memory) DRAM Technology and Architecture,” in
a maximum overall throughput of about 10 GB/s. In con- IEEE Int. Memory Workshop (IMW), May 2017.
trast, with MAO an increase to about 67% of the maximum [6] G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna, S. Stuijk,
throughput (277 GB/s) is to be expected again. Measuring O. Mutlu, and H. Corporaal, “NERO: A Near High-Bandwidth Memory
the actual throughput of this accelerator yields 9.59 GB/s for Stencil Accelerator for Weather Prediction Modeling,” in 30th Int. Conf.
unoptimized memory access, which is only 4% off from our Field-Prog. Log. and Applicat. (FPL), 2020, pp. 9–17.
estimation. With our MAO IP core it could be greatly increased [7] R. Ben Abdelhamid and Y. Yamaguchi, “A Block-Based Systolic Array
on an HBM2 FPGA for DNA Sequence Alignment,” in Appl. Recon-
to 273 GB/s which is only about 2% off from our estimation. figurable Comput. Archit., Tools, and Applicat., F. Rincón, J. Barba,
Our model was again sufficiently accurate. It can be seen H. K. H. So, P. Diniz, and J. Caba, Eds. Cham: Springer International
that the maximal achievable performance for this accelerator Publishing, 2020, pp. 298–313.
is only about 6% compared to the P = 8 configuration of [8] K. Kara, C. Hagleitner, D. Diamantopoulos, D. Syrivelis, and G. Alonso,
Accelerator A. In general, all configurations are memory bound “High Bandwidth Memory on FPGAs: A Data Analytics Perspective,”
for an unoptimized memory access. When using our MAO IP in 2020 30th Int. Conf. Field-Prog. Log. and Applicat. (FPL), 2020.
core, we could again measure an improvement of 28.5×. In [9] M. Gbel, A. Elhossini, and B. Juurlink, “A Methodology for Predict-
ing Application-Specific Achievable Memory Bandwidth for HW/SW-
this case all configurations became compute bound. However, Codesign,” in Euromicro Conf. Digital Syst. Des. (DSD), Aug 2017, pp.
the P = 32 configuration is less than 0.1% away from the 533–537.
memory ceiling. Therefore, this design is already very close [10] C. Park, S. Park, and C. S. Park, “Roofline-Model-Based Design Space
to its overall optimum. When looking at its corresponding Exploration for Dataflow Techniques of CNN Accelerators,” IEEE
resource allocation, we see that it is less than half as big as Access, vol. 8, 2020.
the P = 8 configuration of Accelerator A. Therefore, a design [11] M. Ujaldn, “HPC Accelerators with 3D Memory,” in IEEE Int. Conf.
using Accelerator B and the MAO IP core can be implemented Comput. Sci. and Eng. (CSE) and IEEE Int. Conf. Embedded and
Ubiquitous Comput. (EUC) and 15th Int. Symp. Distrib. Comput. and
on the FPGA. This shows that this type of accelerator can Applicat. for Business Eng. (DCABES), Aug 2016, pp. 320–328.
really benefit from HBM and an optimized memory access. [12] N. Ding and S. Williams, “An Instruction Roofline Model for GPUs,” in
IEEE/ACM Perform. Model., Benchmarking and Simul. High Perform.
While Accelerator B profits the most from the transition Comput. Syst. (PMBS), 2019, pp. 7–18.
to HBM with the least cost of hardware resources, for raw
[13] Z. Wang, H. Huang, J. Zhang, and G. Alonso, “Shuhai: Benchmarking
matrix multiplication performance Accelerator A should be im- High Bandwidth Memory on FPGAs,” in IEEE 28th Annu. Int. Symp.
plemented. A potential improvement for Accelerator B could Field-Prog. Custom Comput. Mach. (FCCM), 2020, pp. 111–119.
be to rework the design to achieve a higher facc . Furthermore, [14] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful
future FPGAs with more HBM stacks and therefore a higher Visual Performance Model for Multicore Architectures,” Commun.
memory throughput would make it possible to increase Ccomp ACM, vol. 52, no. 4, p. 6576, Apr. 2009.
even further. For Accelerator A the design could be optimized [15] Xilinx, Inc., “AXI High Bandwidth Memory Controller,” Jul. 2020.
to better exploit the available throughput with a smaller design, [Online]. Available: https://www.xilinx.com/support/documentation/ip
documentation/hbm/v1 0/pg276-axi-hbm.pdf
for example by applying a local buffer structure to redistribute
[16] Intel Corporation, “High Bandwidth Memory (HBM2) Inter-
values and scale the PE array linearly. face Intel FPGA IP User Guide,” Dec. 2020. [Online].
Available: https://www.intel.com/content/dam/www/programmable/us/
en/pdfs/literature/ug/ug-20031.pdf
VI. C ONCLUSION AND F UTURE W ORK
[17] B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi, “Per-
This paper analyzed the behavior of HBM and the influ- formance Modeling for FPGAs: Extending the Roofline Model with
ence of accelerator and bus fabric design decisions on it. It High-Level Synthesis Tools,” Int. J. Reconfig. Comput., vol. 2013, Jan.
2013.
further provided a simple model to perform first performance
[18] M. Siracusa, M. Rabozzi, E. Del Sozzo, L. Di Tucci, S. Williams, and
estimations. With two example cores, we could show that our M. D. Santambrogio, “A CAD-based Methodology to Optimize HLS
estimates and derived design guidelines were a benefit for code via the Roofline Model,” in IEEE/ACM Int. Conf. Computer Aided
system designers. In the future this data can be used to develop Des. (ICCAD), 2020.
additional SystemC models aiming for higher accuracy. [19] Wissolik, Mike and Zacher, Darren and Torza, Anthony and Day,
Brandon, “Virtex UltraScale+ HBM FPGA: A Revolutionary Increase
in Memory Performance,” Jul. 2019. [Online]. Available: https://www.
ACKNOWLEDGMENT xilinx.com/support/documentation/white papers/wp485-hbm.pdf
This work has been supported by the Xilinx University
Program (XUP) with the Virtex UltraScale+ HBM FPGA chip.
159
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.