Psdd Accelerator
Psdd Accelerator
1 INTRODUCTION
Probabilistic Sentential Decision Diagrams (PSDDs) were recently proposed to model distributions over structured
probability spaces that are deined by massive logical constraints [22, 36]. Traditionally, probability distributions
are modeled using graphical models such as Bayesian networks (BNs) [15, 23, 26, 28]. A BN employs a directed
acyclic graph (DAG) to capture dependencies among random variables. In the presence of massive logical
constraints, which naturally arise in many domains, the DAG can become too highly-connected to allow eicient
reasoning and learning in real world applications. PSDDs, on the other hand, provide a tractable representation
of probability distributions in this case because they are based on a sophisticated representation of logical
constraints known as Sentential Decision Diagrams (SDDs), which generalize and can be exponentially smaller
than Ordered Binary Decision Diagrams (OBDDs) [2, 16]. The efectiveness of PSDDs has been demonstrated in
numerous real-world applications with massive logical constraints. As an example, a classical Naive Bayesian
classiier of a board game trace requires 362,879 parameters, whereas a PSDD needs 1,793 parameters [10].
Other successful examples of PSDDs include learning user preferences [8], anomaly detection [10], and route
distribution modeling [9, 36]; see [17] for a survey.
Authors’ addresses: Young-kyu Choi, [email protected], Inha University, 100 Inha-ro Hitech 1012, Incheon, South Korea, 22212 and University
of California, Los Angeles, 404 Westwood Plaza, Los Angeles, California, USA, 90095; Carlos Santillana, [email protected]; Yujia Shen,
[email protected]; Adnan Darwiche, [email protected]; Jason Cong, [email protected], University of California, Los Angeles, 404
Westwood Plaza, Los Angeles, California, USA, 90095.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
[email protected].
© 2022 Association for Computing Machinery.
1936-7406/2022/9-ART $15.00
https://doi.org/10.1145/3561514
PSDDs are normally synthesized from a combination of data and symbolic knowledge in the form of logical
constraints. For example, in [36], PSDDs represent a distribution over routes, which are modeled as paths in a
graph. A path is encoded as a binary instantiation over the edge variables. The variable is set to 1 if the edge is
used on the path and 0 otherwise. Some instantiations of the variables correspond to invalid paths in the graph
(disconnected edges), and this leads to logical constraints. To learn a probability distribution over the valid paths,
one can irst construct a PSDD from the logical constraint among the edge variables. Then we can estimate the
weights on the PSDD from the data that consists of routes that are more frequently taken by the travelers.
PSDDs can also be synthesized from graphical models such as Bayesian networks [35], positioning them as
an inference tool for such models. Since PSDDs are synthesized, PSDDs can be very large and exhibit some
strong properties in comparison to other tractable circuit representations such as Sum Product Networks [30]
and Cutset Networks [31] (which are normally handcrafted or learned from data). All these models are variants
of Arithmetic Circuit (AC) representations of probability distributions [14], which allow inference in time linear
to the circuit size; see [18] for a recent survey of these circuit representations.
A ield-programmable gate array (FPGA) is a high-performance, energy-eicient reconigurable platform that
has accelerated many probabilistic inference problems. Examples include: Sum Product Network [33, 37, 38],
Bayesian Monte Carlo Makov Chain Model Inference [45], Bayesian Computing Machines [25], Bayesian Neural
Networks [3], and Bayesian Inference with Arithmetic Circuits [19, 27, 43]. But to our knowledge, there is no
previous work that accelerates PSDDs.
We accelerate PSDD on the Xilinx Alveo U250 FPGA Acceleration Card [41]. We implemented the two
commonly used PSDD queries: probability of the most probable explanation (MPEp ) query and marginal (MAR)
query. In order to reduce the development efort, we program in C and generate the bitstream using a high-level
synthesis (HLS) tool [13, 40]. But we found that accelerating the PSDD kernel in HLS presents a unique set
of challenges. As we construct a PSDD graph from a BN, the sparseness of the node connections lead to each
node having only a small number of child nodes. This property increases the proportion of a loop’s pipeline
epilogue/prologue compared to the loop’s length (details in Section 3). We could try the edge-centric processing
scheme [32, 44] to solve this problem, but a straightforward implementation in HLS causes new dependency
issues and worsens the initiation interval (II) of the pipelined loops (details in in Section 3). Moreover, in order
to perform a query on a real-world application with massive logical constraints, we process a large graph that
may not it on an FPGA. Thus, we have to make the processing elements (PEs) conigurable for irregular tree
connection patternsÐbut this complicates the parallelization process.
In this paper, we present optimization techniques to solve these problems. We propose a novel HLS-based
edge-centric processing scheme that achieves an II of 1 for a PSDD query with dependency issues. The scheduling
for this scheme is lightweight and can be done in a few seconds. Moreover, we exploit multiple levels of parallelism
that can be applied even if the entire graph cannot it on an FPGA. The proposed optimizations are fully compatible
with HLS, and the design was veriied on-board. Compared to related works whose performance largely depends
on the size of the graph, our work better retains the performance for datasets with various sizes.
2 BACKGROUND
2.1 Probabilistic Sentential Decision Diagrams
Figure 1a shows an example PSDD, which is based on a Boolean circuit known as a Sentential Decision Diagram
(SDD) [16] that is annotated with probabilistic parameters. The PSDD is composed of fragments shown in
Figure 2. In an internal fragment, the OR-gate can have an arbitrary number of inputs. Each child of the OR-gate
is associated with a parameter α i . The AND-gates have precisely two inputs each. Each left child pi is called a
prime, and each right child si is called a sub. Figure 1a highlights some PSDD fragments.
0.6 0.4
2 3 4 5
A B ¬A ¬B A ¬B ¬A B C ¬D
6 7
C ¬C D ¬D
(a) PSDD
(b) Vtree
Fig. 1. Figure 1a shows an example PSDD. Each fragment is boxed. Internal fragments are surrounded by an empty red box,
and leaf fragment is surrounded by shaded boxes. Further, some fragments are indexed for reference purpose, and the index
is indicated on the top let of the box. Figure 1b shows the vtree that the PSDD conforms to.
α1 ··· αn
α2
···
α1 α2
p1 s 1 p2 s 2 pn s n X ¬X X ¬X
(a) (b) (c) (d)
Fig. 2. Four types of PSDD Fragments: internal (2a) and boundary, which can be a positive literal (2b), a negative literal (2c)
or a simple OR-gate (2d). A PSDD can also have leaf nodes representing false, but we omit these as they are not necessary
for our acceleration.
Each PSDD conforms to a tree of variables, called a vtree [29]. A vtree is a binary tree whose leave are the
circuit variables (Figure 1b). The conformity is roughly as follows. For each internal fragment with primes pi and
subs si , there must exist a vtree node v where the variables of each prime pi appear in the left child of v, and the
variables of each sub si appear in the right child of v. For example, the root fragment of the PSDD in Figure 1a
conforms to vtree node 3 in Figure 1b. Each prime of this PSDD fragment, fragments 2 and 4, conforms to vtree
node 1. Each sub, fragments 3 and 5, conforms to vtree node 5.
2.3.2 MAR. The marginal query (MAR) calculates the probability of an instantiation over a subset of variables
E ⊆ X and is one of the most common probabilistic queries. For example, in medical diagnosis, the model describes
a probability distribution that assigns a probability to every possible symptom and disease. The marginal query
can be used to compute the probability of a particular symptom or the probability of a disease given a symptom
(with Bayes conditioning).
For a MAR query, the internal fragment in Figure 2a evaluates to
Σi {α i × value(pi ) × value(si )}. (2)
The diference with the MPE query in Eq. 1 is that the summation operation (instead of the max operation)
replaces the OR-gate of AC.
When evaluating the marginal probability at evidence e, leaf literals are set in a similar way as the MPE query.
Every leaf literal compatible with e is set to 1 and every leaf literal incompatible with e is set to 0. Consider again
the PSDD in Figure 1a. Given evidence B = 1, literal ¬B is set to 0 and all other literals are set to 1. Evaluating the
AC yields the following values for fragments:
Fragment ID 7 6 5 4 3 2 1
Value 1.00 1.00 1.00 0.25 1.00 0.33 0.29
Thus the marginal probability of B = 1 is 0.29.
The array prob[] is accessed very frequentlyÐwe need to fetch the probability for primes and subs (line 5)
and update the result (line 8). Thus, prob[] is stored in the FPGA internal memory. Moreover, a considerable
amount of memory is required to process large networks. The natural choice is to assign prob[] to Ultra-RAM
(URAM) [42], which is the largest internal memory resource2 in Alveo U250.
The most commonly used optimization techniques for accelerating an HLS kernel are pipelining and unrolling
[40]. One could consider pipelining the loop in line 1 of Fig. 3 and unrolling the loop in line 3; but edge_num[]
is a variable which makes it diicult for HLS compilers to determine the unrolling factor. Another option is to
pipeline the loop in line 3Ðthe code after applying the pipeline compiler directive (#pragma HLS pipeline) is
shown in Fig. 3. We will refer to this code as the baseline implementation.
To test the baseline implementation, we utilize a PSDD dataset compiled from the Mastermind network using
the method described in [35]. The Mastermind network is a commonly used Bayesian network that models the
Mastermind game. This network exhibits local structure [5] that can be exploited when compiling the PSDD.
Unlike general PSDD graphs, each node in the Mastermind network has at most two children nodes because it is
synthesized from a BN [35]. But we found that such sparsity in the network causes a severe negative efect on
the performance of an HLS-based implementation.
The problem we faced was the low processing rate of the pipelined loop. Even after adding the pipeline pragma
(line 4 of Fig. 3), the averaged processing rate in the Mastermind dataset turned out to be only 0.06 nodes per
cycle. This is because in the Mastermind dataset, each node has only 1.1 prime and sub child nodes on average.
When the innermost loop is invoked, the loop requires 12 cycles of pipeline epilogue/prologue overhead cycles. If
the averaged loop iteration is only 1.1, then the overhead dominates the time spent in actual computation.
To solve this problem, we refactor the code to be edge-centric [32, 44]. That is, we latten the outer loop in
line 1 of Fig. 3 with the inner loop in line 3, and we iterate on the index of the operations. The modiied HLS
code is shown in Fig. 4. Now there is only a single loop that traverses through all the edges in a bottom-up
fashion (line 1 of Fig. 4). Even if each parent node has only a few child nodes, the loop no longer sufers from the
repetitive pipeline epilogue/prologue overhead cycles.
Although the new processing scheme has the potential to achieve a higher processing rate, the naive HLS
implementation in Fig. 4 has several new problems. One of them is the overhead of storing more data. We need
the index parent to keep track of the parent of each edge (lines 3 and 5 of Fig. 4). This is a minor issue, and it can
be easily solved with additional memory.
The next issue is the dependency problem. Reading probability, performing the multiplication and the max/add
operations, and writing probability take several cycles of latencyÐfor an integer variable, it took 12 cycles. This
is a true dependency, because a probability written to a parent node may be read as a child prime/sub node
probability in subsequent iterations. Forcing the HLS tool to ignore this dependency results in a read-after-write
2 Alveo U250 has 45 MB of Ultra-RAM (URAM) and 12 MB of Block-RAM (BRAM).
Fig. 5. A possible dependency problem for Fig. 1a ater adopting edge-centric processing in Fig. 4
(RAW) hazard. An example is illustrated in Fig. 5Ð for the PSDD in Fig. 1a, the output (node 2) after processing
nodes 6 and 7 will be written 12 cycle later. The RAW hazard occurs if the probability for node 3 is read before
the updated value is written. This challenge will be addressed in Section 4.1.
Moreover, the naive edge processing scheme has a local memory port limitation problem. Even if the dependency
problem is somehow solved, the loop in Fig. 4 cannot be pipelined to 1 because the probability (colored red) is
read three times (from addresses prime[e], sub[e], and parent[e]) and written once (to address parent[e])
every iteration. We cannot solve this problem with the true dual-port mode [42] because prime[e] and sub[e]
addresses may be diferent; true dual-port mode only supports two independent addresses. The array partitioning
technique [12, 40] also does not help because we cannot guarantee that the read addresses and the write address
will always be diferent. We will discuss the solution to this local memory port problem in Section 4.2.
The last issue is the lack of adequate parallelism. To increase parallelism, we could consider partial unrolling
[11] of the loop in line 1 of Fig. 4. But such an approach also causes a dependency problem because the primes
and subs of an iteration may be written in the next iteration. For a small graph, we can resolve this issue by
exploiting the operation-level parallelismÐthat is, we could map the entire AC onto the FPGA similar to [37, 43].
But this is not feasible for Mastermind because we operate on a large graph with 42,558 nodes. It is possible to
implement a part of the graph on the FPGA, but the irregularity of the graph makes it diicult to supply the
operands to each OR/AND-gate without stalling. We will explain how to solve this problem in Section 5.
Fig. 6. Removing the inter-depth dependency problem in Fig. 5 with bubble insertion
Fig. 7. The HLS code ater applying the depth-batched static scheduling (Section 4.1) and the common parent clustering
(Section 4.2) [blue variables are the decoded static schedule, red variables are the accesses to the probability array, and green
directive allows programmer to manage dependency]
the example in Fig. 1a, node 1 is assigned depth 0, nodes 2, 3, 4, 5 are assigned depth 1, and nodes 6 and 7 are
assigned depth 2. Then we batch-process all nodes in the same depth. Since we take a bottom-up approach, the
nodes with the largest level (deepest depth) are irst processed in a batch, and then the nodes in the upper level
are batch-processed, and so on.
After this process, all the primes and subs writing to a common parent node will have the same depth and can
be easily controlled to avoid the depedency problem (more details in the common parent clustering approach in
Section 4.2). The only RAW hazard now remaining is the inter-depth RAW hazardÐan example was illustrated in
Fig. 5 where node 3 (depth 1) is read before the the probability write in depth 2 is completed. This issue is resolved
by adding bubbles (no-op) into the computation slots that have the inter-depth dependency problems (Fig. 6).
PEs will wait until the conlicting write operation to node 3 is resolved. The computation pipeline is stalled as a
result, but because we batch-process each depth, the proportion of the stall is small compared to the number of
nodes processed in each depth. In the Mastermind dataset, the proportion of bubble cycles is only 0.11%.
Fig. 8. II reduction to 1 ater applying the probability array separation technique (Section 4.2)
Fig. 7 presents the HLS kernel code after applying static scheduling. We pack the bubble instruction, the node
index of primes/subs, and weights into the array edge_schedule[]. Then the array is read and decoded in the
PEs (lines 6-7 of Fig. 7). The bubble instruction stops the probability storage from being updated (lines 9ś12 of
Fig. 7). Also, we add the "dependency inter false" pragma on array prob[] in line 4 of Fig. 7 to inform the HLS
tool that the dependency is now managed by the programmer. The static schedule is accessed once per edge per
query3 , so it is stored in the external DRAM and passed to the processing elements in a streaming fashion.
The order of processing is the same for the same dataset, so we can generate the static schedule oline and
reuse it for various diferent PSDD queries. Since it is determined oline, the proposed solution does not increase
the hardware cost. Also, whereas the ILP-based approach [24] typically takes tens of minutes to determine the
schedule, our approach can be inished in a matter of a few seconds because assigning a depth to all nodes in a
tree has a low complexity (more details to be presented in Sections 5.1 and 6.2).
In addition to resolving the dependency problem, the static scheduling scheme provides another beneit of
reducing the probability storage. Rather than allocating one node’s probability to each address space in the array
prob[], we time-share the array. This is possible because the address space for nodes that will be no longer be
accessed can be reused to store other nodes’ probabilities. For the example in Fig. 1a, the probability for node 3
only needs to be stored in a memory space from the clock cycle when nodes 6 and 7 are processed to the clock
cycle when node 1 is processed. The space allocated for node 3 can be used for other nodes in all other clock
cycles. This technique reduces the node storage for the Mastermind dataset by 81%.
If the MPE and MAR queries are implemented on separate PEs, we can insert fewer bubbles to an MPE query
because the max operation is simpler than addition (6 vs 12 cycles latency). But instead, we apply the same
static schedule for both types of queries. This is because we utilize the same pipeline architecture. We wanted to
quickly switch between diferent types of queries without reprogramming the FPGA. This helps decrease the
computation latency even if there is a diverse type of incoming queries.
3 The schedule is reused in the query-level parallelism that will be explained in Section 5.2
5 PARALLELIZATION
The work in [37, 43] achieves a large operation-level parallelism by mapping the entire graph on the FPGA. But
this approach has two limitations. First, the amount of available parallelism decreases when the graph size is small.
Second, it is diicult to parallelize a large graph that cannot it on the FPGA. The Mastermind dataset sufers
from the second problem since it is composed of 42,558 nodes. Therefore, we need other types of parallelism to
increase the throughput. In this section, we describe two solutions to overcome this problem.
Next we need to determine how to assign the nodes to diferent PEs. Initially, we have equally partitioned
the nodes in each level and assigned them to diferent PEs. However, we found that 68% of the instructions in
the Mastermind’s static schedule was either sending or receiving data from other PEs. Even though equal load
balancing was achieved, the parent nodes in a diferent level were not guaranteed to be in the same PE.
We solved this problem with a graph coarsening technique. The new PE assignment process is shown in
Algorithm 1. Given a PSDD graph, we irst determine the tree depth of each node with topological sorting (line 3).
Then we traverse in a bottom-up fashion (lines 4 and 5) and try to coarsen the cluster of nodes that share the
same parent node (lines 6 to 22). For each cluster in depth d (line 7), we look up the clusters in depth d − 1 for
common parent nodes (lines 10 and 11). If found, the common parent node and the child nodes are grouped
together (line 13). This process can naturally be combined with the common parent clustering method described
in Section 4.2. The coarsening continues until the size of the cluster reaches the limit N /S (line 12), where N
is the number of all nodes and S is the subtree level parallelism factor. If the size limit is reached, the cluster is
appended to the next depth cluster vector without further coarsening (line 22). If a cluster can ind no other
clusters with a common parent node, the parent nodes are merged with the cluster and appended to the next
depth cluster vector (line 20).
After obtaining a list of coarsened clusters for depth 0, we perform multi-way partitioning. Similar to the
largest processing time (LPT) algorithm [20], we sort the clusters in a descending order of its size (line 23). The
nodes in the sorted clusters are inserted into one of the S bins where the sum of nodes after insertion is the
smallest (line 26). All nodes are assigned a PE after this process.
If the nodes with an edge connection are assigned diferent PEs, inter-PE communication of probability is
needed. For the Mastermind dataset, we discovered that most of the communication occurs near the top part of
the tree using the proposed algorithm, and the proportion of inter-PE communication instruction is reduced to
2%. But we also noticed that the proportion of inter-PE communication worsens to 30% for certain datasets (more
details in Section 6.2). It remains as a future work to further improve the node assignment strategy.
Table 1 shows the resource consumption for various conigurations. The amount of computation resources
(LUT and DSPs) grows almost proportionally as the subtree parallel factor increases. The URAM consumption, on
the other hand, stays approximately the same because the number of nodes processed by each PE decreases as the
subtree parallel factor increases. This can be conirmed in the table which reveals that the URAM consumption
stays 16 even though the parallel factor increases from 1 to 2 to 4. The BRAM consumption grows rapidly (3 to 9
to 33) with a larger subtree parallel factor because the number of connections increases quadratically with a full
crossbar structure.
Table 1. Resource consumption and performance ater increasing subtree parallel factor
An important observation from Table 1 is the signiicant drop in the kernel clock frequency down to 234MHz
when the subtree parallel factor is 4. In Alveo U250, the four DRAM channels are located in a physically separated
Super Logic Region (SLR) [41], and 16 (=4×4) inter-PE FIFOs’ SLR crossings have a negative impact on the routing
process.
To solve this problem, we changed the inter-PE communication architecture to a 1D chain (thus reducing
the number of FIFOs crossing the SLR), and we added signal relay modules in the FIFOs using Autobridge [21].
The frequency is improved to 300MHz as a result. Even though the inter-PE FIFOs are now partially shared
among multiple PEs, the performance is not signiicantly degraded by the data congestion because of the small
proportion of inter-PE communication in the Mastermind dataset. The cost of the improved frequency is the
larger LUT consumption (from 8.5K to 12K). This is due to the inter-PE chain modules and the signal relay
modules. We obtain a speedup of 3.7 (=3.2/0.86 GOPS) when using a subtree parallel factor of four.
Fig. 10. Speedup ater increasing query-level parallelism in the Mastermind dataset
6 EXPERIMENTAL RESULTS
6.1 Experimental Setup
For development, we used the Xilinx Vitis 2019.2 uniied software platform [39]. The kernel was programmed in
C++ and compiled with Vivado HLS 2019.2 [40]. The tests were run on-board, using the Xilinx Alveo U250 [41]
platform. The probability is stored in 32b integers. The performance is measured in Giga Operations per Seconds
(GOPS).
In addition to the Mastermind dataset (which has been used for the optimization and parallelization process),
we also tested our accelerator with three more datasets from [6]: FS-04 (friends and smokers), Students (and
professors), and (random) Blockmap datasets (shown in Table 2). Each dataset is a grounded relational Bayesian
network, which is a very challenging problem for classical inference methods. Like Mastermind, these datasets
exhibit a vast amount of local structure, and PSDD can exploit this characteristic and achieve exact, tractable
inference. The datasets contain 3548Ð52789 nodes, and we cannot it all the operators for the entire graph on the
FPGA. The datasets have a sparse connectionÐhaving 1.0 to 1.4 edges per node (excluding leaf literal nodes). We
also list the number of the leaf literal nodes and edges in the datasets.
We have already analyzed the efect of increasing the subtree-level parallel factor and the query-level parallel
factor in Section 5.1 and Section 5.2, respectively. In this section, we present the cumulative efect of each
optimization starting with the baseline HLS code in Fig. 3. After applying the edge-centric processing with the
static scheduling, the common parent clustering and the probability array separation, the processing rate no
longer sufers heavily from the long pipeline epilogue/prologue cycles (12 cycles) even though there are only a
small number of child nodes (average:1.1). Also, we can achieve II of 1. The performance is improved by 21X
(Table 4). The subtree-level parallelism of four improves the performance by 3.7X (refer to the explanation in
Section 5.1), and the query-level parallelism of 32 leads to a speedup of 28X as observed in Fig. 10. After applying
all optimization steps, we achieved a cumulative speedup of 2,200X compared to the baseline code. The clock
frequency of the inal design is 273 MHz.
Table 4. Performance and cumulative speedup with proposed optimizations (Mastermind dataset)
After applying all the proposed optimizations, we have measured the performance in various datasets. The
result is presented in Table 5Ðthe irst row is calculated after considering the FPGA execution time only, and
the second row is calculated after considering both the FPGA execution time and the PCIE transfer time (of the
graph processing static schedule and the literal values of each query). Compared to other works that assume the
entire graph can it on the FPGA, our design architecture is scalable and retains a relatively high performance for
datasets of various sizes (see Section 7 for quantitative comparison). This is because the proposed design reads
the graph structure information from the DRAM and the performance does not depend on the size of the graph
that is mapped to the FPGA.
The maximum FPGA-only performance is 89 GOPS (Mastermind), and the averaged FPGA-only performance is
59 GOPS. But there is some performance variance among the datasets. For Blockmap dataset, the low performance
is due to the high proportion (34 %) of leaf literal nodes (Table 2) Ð much of the execution time is spent on
fetching and transposing the value of literal nodes rather than processing an edge. Apart from this factor, the
performance diference is mostly due to the efectiveness of the subtree parallelization step. The amount of
inter-PE communication (Section 5.1) is 2% for Mastermind, 23% for Students, 27% for Blockmap, and 30% for
FS-04, and the data congestion due to the simple 1-D chain architecture is further degrading the performance.
Moreover, the coarsening step explained in Section 5.1 introduces an unbalanced workload, which accounts for
the rest of the performance diference. It remains as a future work to reduce these overheads without severely
complicating the inter-PE communication architecture.
The performance drops to an average of 50 GOPS if we consider the PCIE transfer time in addition to the FPGA
execution time. The static schedule is fetched from the DRAM and sent to the FPGA for each batch of queries,
but it only needs to be transferred once through the PCIE because the same schedule is reused in the DRAM.
The literal values for each query, on the other hand, are transferred through both the PCIE and the DRAM only
onceÐthey are not reused in the DRAM. This leads to a larger performance drop for Blockmap dataset (30→17)
compared to FS-04 dataset (62→61)Ðthe Blockmap dataset has a larger proportion of literal nodes compared
to the FS-04 dataset (34% vs 1%, Table 2). This causes more time to be spent on transferring the literal values
through the PCIE.
Table 6 compares the performance between a multicore CPU implementation and the proposed FPGA im-
plementation. We use a two-socket Intel Xeon Gold 6244 server class node4 , and we parallelize the loop that
processes the queries with 32 OpenMP threads (the parallel factor is same as the FPGA query-level parallelism).
To avoid contention, prob[] has been separated for each OpenMP thread (the memory allocation time is excluded
from the execution time). The CPU implementation has been optimized with O3 lag. The experimental result
shows that the average performance is 2.9 GOPSÐso even though we use 2X CPUs, the performance is 20X slower
than the proposed FPGA implementation. The reason is mainly related to how fast the data can be supplied to
the computation units. The proposed FPGA static scheduling provides operands to the OR/AND-gate almost
every cycle. In a CPU, this is not guaranteed since the data for thousands of nodes and multiple threads cannot
it into the L1/L2 cache (notice that, unlike the FPGA performance, the CPU performance is generally higher in
smaller datasets). Also, the superior performance is attributed to the customized computation/memory pipeline.
Table 6. Performance comparison between 32-thread multicore CPU and proposed FPGA implementation (in GOPS)
Among all the FPGA resources, the one with the highest utilization is BRAM. But even the highest utilization
ratio is relatively small (31%). This is because we wanted to ease the PnR process by keeping the consumption
under 50% for all resources. Thus the subtree-level parallelism and the query-level parallelism was limited to 4
and 32, respectively.
7 RELATED WORKS
There are several recent papers that accelerate graph applications (e.g., sparse matrix-vector multiplication)
on an FPGA. HitGraph [44] is a high-performance edge-centric graph accelerator with several performance
optimization techniques such as node bufer and data layout optimization. ThunderGP [7] is an HLS-based graph
processing framework which automatically builds FPGA accelerators based on its high-level APIs. It supports
several eicient memory access patterns such as scatter/gather, coalescing, and prefetching. However, these
papers target general graph applications and cannot exploit unique characteristics that exist in graph models
such as BN, SPN, or PSDD.
In the remainder of this section, we will review related FPGA acceleration works for the BN and the sum-product
network (SPN).
The Bayesian Computing Machine (BCM) has been presented in [25]. It supports the sum-product algorithm
and the max-sum algorithm. Their hardware is composed of a network of processors and memory connected
through a switching crossbar. They propose an optimal scheduling of computation and memory units to minimize
the execution time. This work has been implemented on the Berkeley Emulation Engine FPGA platform [1].
The work in [43] uses a high-level synthesis tool chain similar to our work. They accelerate the Arithmetic
Circuit on the Xilinx Zynq embedded platform. They provide an option to reconigure the network parameters
with the data fetched through AXI bus. The parallelism is achieved by compiling the entire network onto the
FPGA, but such an approach limits the network size to no more than 511 nodes. Also, their accelerator achieves a
performance of only about 0.01 GFLOPS, possibly due to the data transfer overhead.
The work in [37] accelerates the SPN inference problem on a Virtex-7 FPGA. They map the SPN tree to a
hardware datapath composed of pipelined functional units and shift registers. Similar to [43], the entire SPN
tree is implemented on-chip. In a separate work [24], they present an architecture where the operators are
time-sharedÐwhich makes it possible to process larger graphs. The number representation in a SPN graph can
be eiciently optimized based on the histogram of the variablesÐreaders are referred to [38] for an automated
method that inds the best number representation.
The work in [34] converts a PSDD graph to a SPN graph by replacing AND with products and OR with sums.
The resulting SPN graph is accelerated with a tree of customized processors, each with private registers. Their
simulation result reports a peak performance of 11.6 operations/cycle.
Compared to these works, our work concentrates on developing an HLS-friendly optimization method to
improve the processing rate and parallelism in PSDD graphs. A quantitative comparison is shown in Table 8. Since
each work uses a diferent graph structure, number representation, and dataset, it is diicult to make a direct
comparison. Instead, the performance is provided in either GOPS or Giga Edges Traversed Per Second (GTEPS).
The table also presents the graph size, resource consumption, platform, and performance. The performance of
[37] is estimated by multiplying the number of add/mul operations by the reported throughput (FPGA time only).
Table 8. Graph size, platform, resource consumption, and performance comparison with related works implemented on
FPGA
Our work can outperform accelerators that do not assume a particular graph model (e.g., [44] or [7]) because
our processing engine can exploit PSDD-speciic characteristicsÐsuch as each node having a small number of
child nodes or being connected to prime and sub nodes. Our work also outperforms related BN [25] and SPN
work [24] by 3.0X and 9.0X on average, respectively. These works read the graph information from memory
similar to our work. It is unclear if our work signiicantly outperforms [37]Ðit is possible that the higher frequency
(273 MHz vs 200 MHz) may have been achieved due to more advanced FPGA technology (Alveo U250 vs VC709).
This may have led to the better performance (59 GOPS vs 31 GOPS, on average). It is also diicult to make a
direct comparison on the LUT/DSP consumption, because our work is operating on 32b integers. But it is still
worth noting that [37] can only process relatively smaller graphs with the entire SPN tree being mapped to the
FPGA operatorsÐwhereas our work can process larger graphs eiciently with the proposed static scheduling
method. Also, our work has relatively less variance on the performance with a diferent graph size because of the
processing rate optimizations in Section 4.
8 CONCLUSION
We presented an HLS-friendly accelerator design for PSDD. We found that changing the processing scheme from
node-centric to edge-centric helps maintain a high processing rate even if there are only a few number of children
per parent node. This led to a dependency problem, which was solved with static scheduling. The throughput of
the PSDD pipeline was improved to II of 1 with common parent clustering and array separation techniques. We
also proposed subtree-level and query-level parallelization methods that can be used to improve the computation
speed of a large tree that cannot it on an FPGA. Experimental results show that the optimizations improve
the performance of the baseline implementation by 2,200X. The proposed architecture has a speedup of 20X
over CPU implementation, and it outperforms the BN and SPN FPGA acceleration work that stores the graph
information in the memory by 3.0X-9.0X. As a future work, we plan to further reduce the performance variance
among diferent datasets with load balancing improvement and inter-PE communication reduction. This PSDD
acceleration project has been open-sourced at https://github.com/carlossantillana/psdd/tree/alveo250.
ACKNOWLEDGMENTS
This research is supported by Inha University Research Grant, National Research Foundation (NRF) Grant funded
by Korea Ministry of Science and ICT (MSIT) (2022R1F1A1074521), US NSF Grant on RTML: Large: Acceleration to
Graph-Based Machine Learning (CCF-1937599), and Xilinx Heterogeneous Accelerated Compute Cluster (HACC)
Program. We thank Yuze Chi, Vidushi Dadu, Licheng Guo, Jason Lau, Michael Lo, Tony Nowatzki, and Yizhou
Sun for the discussion and the help with the experiments. We also thank Marci Baun for proofreading this article.
REFERENCES
[1] Berkeley. 2008. BEE3 (Berkeley Emulation Engine). https://www.microsoft.com/en-us/research/project/bee3/
[2] Simone Bova. 2016. SDDs are Exponentially More Succinct than OBDDs. In AAAI. AAAI Press, 929ś935.
[3] Ruizhe Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, and Yanzhi Wang. 2018. VIBNN: Hardware
acceleration of Bayesian neural networks. ACM SIGPLAN Notices 53, 2 (2018), 476ś488.
[4] Hei Chan and Adnan Darwiche. 2006. On the Robustness of Most Probable Explanations. In UAI ’06, Proceedings of the 22nd Conference
in Uncertainty in Artiicial Intelligence. AUAI Press.
[5] Mark Chavira and Adnan Darwiche. 2008. On probabilistic inference by weighted model counting. Artiicial Intelligence 172, 6 (2008),
772ś799.
[6] Mark Chavira, Adnan Darwiche, and Manfred Jaeger. 2006. Compiling relational Bayesian networks for exact inference. International
Journal of Approximate Reasoning 42, 1-2 (2006), 4ś20.
[7] X. Chen, H. Tan, Y. Chen, B. He, W. Wong, and D. Chen. 2021. ThunderGP: HLS-based graph processing framework on FPGAs. In Proc.
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. 69ś80.
[8] Arthur Choi, Guy Van den Broeck, and Adnan Darwiche. 2015. Tractable Learning for Structured Probability Spaces: A Case Study in
Learning Preference Distributions. In Proceedings of the Twenty-Fourth International Joint Conference on Artiicial Intelligence (IJCAI).
AAAI Press, 2861ś2868.
[9] Arthur Choi, Yujia Shen, and Adnan Darwiche. 2017. Tractability in Structured Probability Spaces. In Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Information Processing Systems. 3477ś3485.
[10] Arthur Choi, Nazgol Tavabi, and Adnan Darwiche. 2016. Structured Features in Naive Bayes Classiication. In Proceedings of the Thirtieth
AAAI Conference on Artiicial Intelligence. AAAI Press, 3233ś3240.
[11] Young-kyu Choi and Jason Cong. 2018. HLS-based optimization and design space exploration for applications with variable loop bounds.
In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1ś8.
[12] J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011. Automatic memory partitioning and scheduling for throughput and power optimization.
ACM Trans. Design Automation of Electronic Systems 16, 2 (2011), 1ś25.
[13] J. Cong et al. 2011. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems 30, 4 (Apr. 2011), 473ś491.
[14] Adnan Darwiche. 2003. A diferential approach to inference in Bayesian networks. J. ACM 50, 3 (2003), 280ś305.
[15] Adnan Darwiche. 2009. Modeling and Reasoning with Bayesian Networks. Cambridge University Press.
[16] Adnan Darwiche. 2011. SDD: A New Canonical Representation of Propositional Knowledge Bases. In IJCAI. IJCAI/AAAI, 819ś826.
[17] Adnan Darwiche. 2020. Three Modern Roles for Logic in AI. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems (PODS). ACM, 229ś243.
[18] Adnan Darwiche. 2022. Tractable Boolean and Arithmetic Circuits. In Neuro-symbolic Artiicial Intelligence: The State of the Art, Pascal
Hitzler and Md Kamruzzaman Sarker (Eds.). Vol. 342. Frontiers in Artiicial Intelligence and Applications. IOS Press, Chapter 6.
[19] Johannes Geist, Kristin Y. Rozier, and Johann Schumann. 2014. Runtime Observer Pairs and Bayesian Network Reasoners On-board
FPGAs: Flight-Certiiable System Health Management for Embedded Systems. In International Conference on Runtime Veriication (Lecture
Notes in Computer Science, Vol. 8734). Springer, 215ś230.
[20] R. L. Graham. 1966. Bounds for certain multiprocessing anomalies. Bell System Technical Journal 45, 9 (1966), 1563ś1581.
[21] L. Guo, Y. Chi, J. Wang, J. Lau, W. Qiao, E. Ustun, Z. Zhang, and J. Cong. 2021. AutoBridge: Coupling Coarse-Grained Floorplanning and
Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays.
[22] Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. 2014. Probabilistic Sentential Decision Diagrams. In Principles of
Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference (KR). AAAI Press.
[23] Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models - Principles and Techniques. MIT Press.
[24] Hanna Kruppe, Lukas Sommer, Lukas Weber, Julian Oppermann, Cristian Axenie, and Andreas Koch. 2021. Eicient Operator Sharing
Modulo Scheduling for Sum-Product Network Inference on FPGAs. https://www.esa.informatik.tu-darmstadt.de/assets/publications/
materials/2021/2021_SAMOS_HK.pdf
[25] Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. High-throughput Bayesian computing machine with reconigurable hardware. In
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. 73ś82.
[26] Kevin P. Murphy. 2012. Machine learning - a probabilistic perspective. MIT Press.
[27] Xie Pan and Yu Jinsong. 2017. Diagnosis via arithmetic circuit compilation of Bayesian network and calculation on FPGA. In 2017 13th
IEEE International Conference on Electronic Measurement & Instruments (ICEMI). 35ś41.
[28] Judea Pearl. 1989. Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann.
[29] Knot Pipatsrisawat and Adnan Darwiche. 2008. New Compilation Languages Based on Structured Decomposability. In Proceedings of
the Twenty-Third AAAI Conference on Artiicial Intelligence. AAAI Press, 517ś522.
[30] Hoifung Poon and Pedro M. Domingos. 2011. Sum-Product Networks: A New Deep Architecture. In UAI 2011, Proceedings of the
Twenty-Seventh Conference on Uncertainty in Artiicial Intelligence. AUAI Press, 337ś346.
[31] Tahrima Rahman, Prasanna Kothalkar, and Vibhav Gogate. 2014. Cutset Networks: A Simple, Tractable, and Scalable Approach for
Improving the Accuracy of Chow-Liu Trees. In Machine Learning and Knowledge Discovery in Databases - European Conference (ECML
PKDD) (Lecture Notes in Computer Science, Vol. 8725). Springer, 630ś645.
[32] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 472ś488.
[33] N. Shah, L. Olascoaga, W. Meert, and M. Verhelst. 2020. Acceleration of probabilistic reasoning through custom processor architecture.
In Design, Automation & Test in Europe Conference & Exhibition (DATE). 322ś325.
[34] Nimish Shah, Laura I Galindez Olascoaga, Wannes Meert, and Marian Verhelst. 2020. Acceleration of probabilistic reasoning through
custom processor architecture. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 322ś325.
[35] Yujia Shen, Arthur Choi, and Adnan Darwiche. 2016. Tractable Operations for Arithmetic Circuits of Probabilistic Models. In Advances
in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems. 3936ś3944.
[36] Yujia Shen, Arthur Choi, and Adnan Darwiche. 2018. Conditional PSDDs: Modeling and Learning With Modular Knowledge. In
Proceedings of the Thirty-Second AAAI Conference on Artiicial Intelligence. AAAI Press, 6433ś6442.
[37] Lukas Sommer, Julian Oppermann, Alejandro Molina, Carsten Binnig, Kristian Kersting, and Andreas Koch. 2018. Automatic mapping
of the sum-product network inference problem to FPGA-based accelerators. In 2018 IEEE 36th International Conference on Computer
Design (ICCD). 350ś357.
[38] Lukas Sommer, Lukas Weber, Martin Kumm, and Andreas Koch. 2020. Comparison of Arithmetic Number Formats for Inference
in Sum-Product Networks on FPGAs. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing
Machines (FCCM). 75ś83.
[39] Xilinx. 2020. Vitis Uniied Software Platform. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html
[40] Xilinx. 2020. Vivado High-Level Synthesis (UG902). https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug902-
vivado-high-level-synthesis.pdf
[41] Xilinx. 2021. Alveo U250 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u250.html
[42] Xilinx. 2021. UltraScale Architecture Memory Resources. https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-
memory-resources.pdf
[43] S. Zermani, C. Dezan, H. Chenini, J. Diguet, and R. Euler. 2015. FPGA implementation of Bayesian network inference for an embedded
diagnosis. In 2015 IEEE Conference on Prognostics and Health Management (PHM). 1ś10.
[44] S. Zhou, R. Kannan, V. K Prasanna, G. Seetharaman, and Q. Wu. 2019. Hitgraph: High-throughput graph processing framework on
FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249ś2264.
[45] Stephanie Zierke and Jason D Bakos. 2010. FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference
methods. BMC Bioinformatics 11, 1 (2010), 1ś12.