0% found this document useful (0 votes)
4 views

Psdd Accelerator

Uploaded by

richard21wagner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Psdd Accelerator

Uploaded by

richard21wagner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

FPGA Acceleration of Probabilistic Sentential Decision Diagrams

with High-Level Synthesis


YOUNG-KYU CHOI, Inha University, South Korea and University of California, Los Angeles, USA
CARLOS SANTILLANA, YUJIA SHEN, ADNAN DARWICHE, and JASON CONG, University of
California, Los Angeles, USA
Probabilistic Sentential Decision Diagrams (PSDDs) provide eicient methods for modeling and reasoning with probability
distributions in the presence of massive logical constraints. PSDDs can also be synthesized from graphical models such
as Bayesian networks (BNs) therefore ofering a new set of tools for performing inference on these models (in time linear
in the PSDD size). Despite these favorable characteristics of PSDDs, we have found multiple challenges in PSDD’s FPGA
acceleration. Problems include limited parallelism, data dependency, and small pipeline iterations. In this paper, we propose
several optimization techniques to solve these issues with novel pipeline scheduling and parallelization schemes. We designed
the PSDD kernel with a high-level synthesis (HLS) tool for ease of implementation and veriied it on Xilinx Alveo U250 board.
Experimental results show that our methods improve the baseline FPGA HLS implementation performance by 2,200X and the
multicore CPU implementation by 20X. The proposed design also outperforms state-of-the-art BN and Sum Product Network
(SPN) accelerators that store the graph information in memory.
CCS Concepts: · Computer systems organization → High-level language architectures; Reconigurable computing.
Additional Key Words and Phrases: PSDD, HLS, FPGA

1 INTRODUCTION
Probabilistic Sentential Decision Diagrams (PSDDs) were recently proposed to model distributions over structured
probability spaces that are deined by massive logical constraints [22, 36]. Traditionally, probability distributions
are modeled using graphical models such as Bayesian networks (BNs) [15, 23, 26, 28]. A BN employs a directed
acyclic graph (DAG) to capture dependencies among random variables. In the presence of massive logical
constraints, which naturally arise in many domains, the DAG can become too highly-connected to allow eicient
reasoning and learning in real world applications. PSDDs, on the other hand, provide a tractable representation
of probability distributions in this case because they are based on a sophisticated representation of logical
constraints known as Sentential Decision Diagrams (SDDs), which generalize and can be exponentially smaller
than Ordered Binary Decision Diagrams (OBDDs) [2, 16]. The efectiveness of PSDDs has been demonstrated in
numerous real-world applications with massive logical constraints. As an example, a classical Naive Bayesian
classiier of a board game trace requires 362,879 parameters, whereas a PSDD needs 1,793 parameters [10].
Other successful examples of PSDDs include learning user preferences [8], anomaly detection [10], and route
distribution modeling [9, 36]; see [17] for a survey.
Authors’ addresses: Young-kyu Choi, [email protected], Inha University, 100 Inha-ro Hitech 1012, Incheon, South Korea, 22212 and University
of California, Los Angeles, 404 Westwood Plaza, Los Angeles, California, USA, 90095; Carlos Santillana, [email protected]; Yujia Shen,
[email protected]; Adnan Darwiche, [email protected]; Jason Cong, [email protected], University of California, Los Angeles, 404
Westwood Plaza, Los Angeles, California, USA, 90095.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
[email protected].
© 2022 Association for Computing Machinery.
1936-7406/2022/9-ART $15.00
https://doi.org/10.1145/3561514

ACM Trans. Reconig. Technol. Syst.


2 • Choi, et al.

PSDDs are normally synthesized from a combination of data and symbolic knowledge in the form of logical
constraints. For example, in [36], PSDDs represent a distribution over routes, which are modeled as paths in a
graph. A path is encoded as a binary instantiation over the edge variables. The variable is set to 1 if the edge is
used on the path and 0 otherwise. Some instantiations of the variables correspond to invalid paths in the graph
(disconnected edges), and this leads to logical constraints. To learn a probability distribution over the valid paths,
one can irst construct a PSDD from the logical constraint among the edge variables. Then we can estimate the
weights on the PSDD from the data that consists of routes that are more frequently taken by the travelers.
PSDDs can also be synthesized from graphical models such as Bayesian networks [35], positioning them as
an inference tool for such models. Since PSDDs are synthesized, PSDDs can be very large and exhibit some
strong properties in comparison to other tractable circuit representations such as Sum Product Networks [30]
and Cutset Networks [31] (which are normally handcrafted or learned from data). All these models are variants
of Arithmetic Circuit (AC) representations of probability distributions [14], which allow inference in time linear
to the circuit size; see [18] for a recent survey of these circuit representations.
A ield-programmable gate array (FPGA) is a high-performance, energy-eicient reconigurable platform that
has accelerated many probabilistic inference problems. Examples include: Sum Product Network [33, 37, 38],
Bayesian Monte Carlo Makov Chain Model Inference [45], Bayesian Computing Machines [25], Bayesian Neural
Networks [3], and Bayesian Inference with Arithmetic Circuits [19, 27, 43]. But to our knowledge, there is no
previous work that accelerates PSDDs.
We accelerate PSDD on the Xilinx Alveo U250 FPGA Acceleration Card [41]. We implemented the two
commonly used PSDD queries: probability of the most probable explanation (MPEp ) query and marginal (MAR)
query. In order to reduce the development efort, we program in C and generate the bitstream using a high-level
synthesis (HLS) tool [13, 40]. But we found that accelerating the PSDD kernel in HLS presents a unique set
of challenges. As we construct a PSDD graph from a BN, the sparseness of the node connections lead to each
node having only a small number of child nodes. This property increases the proportion of a loop’s pipeline
epilogue/prologue compared to the loop’s length (details in Section 3). We could try the edge-centric processing
scheme [32, 44] to solve this problem, but a straightforward implementation in HLS causes new dependency
issues and worsens the initiation interval (II) of the pipelined loops (details in in Section 3). Moreover, in order
to perform a query on a real-world application with massive logical constraints, we process a large graph that
may not it on an FPGA. Thus, we have to make the processing elements (PEs) conigurable for irregular tree
connection patternsÐbut this complicates the parallelization process.
In this paper, we present optimization techniques to solve these problems. We propose a novel HLS-based
edge-centric processing scheme that achieves an II of 1 for a PSDD query with dependency issues. The scheduling
for this scheme is lightweight and can be done in a few seconds. Moreover, we exploit multiple levels of parallelism
that can be applied even if the entire graph cannot it on an FPGA. The proposed optimizations are fully compatible
with HLS, and the design was veriied on-board. Compared to related works whose performance largely depends
on the size of the graph, our work better retains the performance for datasets with various sizes.

2 BACKGROUND
2.1 Probabilistic Sentential Decision Diagrams
Figure 1a shows an example PSDD, which is based on a Boolean circuit known as a Sentential Decision Diagram
(SDD) [16] that is annotated with probabilistic parameters. The PSDD is composed of fragments shown in
Figure 2. In an internal fragment, the OR-gate can have an arbitrary number of inputs. Each child of the OR-gate
is associated with a parameter α i . The AND-gates have precisely two inputs each. Each left child pi is called a
prime, and each right child si is called a sub. Figure 1a highlights some PSDD fragments.

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 3

0.6 0.4

2 3 4 5

0.33 0.67 0.75 0.25


1 1

A B ¬A ¬B A ¬B ¬A B C ¬D
6 7

0.2 0.8 0.4 0.6

C ¬C D ¬D
(a) PSDD

(b) Vtree

Fig. 1. Figure 1a shows an example PSDD. Each fragment is boxed. Internal fragments are surrounded by an empty red box,
and leaf fragment is surrounded by shaded boxes. Further, some fragments are indexed for reference purpose, and the index
is indicated on the top let of the box. Figure 1b shows the vtree that the PSDD conforms to.

ACM Trans. Reconig. Technol. Syst.


4 • Choi, et al.

α1 ··· αn
α2

···

α1 α2
p1 s 1 p2 s 2 pn s n X ¬X X ¬X
(a) (b) (c) (d)

Fig. 2. Four types of PSDD Fragments: internal (2a) and boundary, which can be a positive literal (2b), a negative literal (2c)
or a simple OR-gate (2d). A PSDD can also have leaf nodes representing false, but we omit these as they are not necessary
for our acceleration.

Each PSDD conforms to a tree of variables, called a vtree [29]. A vtree is a binary tree whose leave are the
circuit variables (Figure 1b). The conformity is roughly as follows. For each internal fragment with primes pi and
subs si , there must exist a vtree node v where the variables of each prime pi appear in the left child of v, and the
variables of each sub si appear in the right child of v. For example, the root fragment of the PSDD in Figure 1a
conforms to vtree node 3 in Figure 1b. Each prime of this PSDD fragment, fragments 2 and 4, conforms to vtree
node 1. Each sub, fragments 3 and 5, conforms to vtree node 5.

2.2 PSDD Semantics and ueries


A PSDD represents a probability distribution over the binary variables X that appear in the vtree. We use a
bold letter X to represent a set of variables, and a normal letter X to represent a single variable.1 The semantics
of a PSDD can be deined by unfolding it into an AC through replacing each AND-gate with a multiplication
operation and each OR-gate with a weighted sum. The weights are the parameters that annotate the children of
the OR-gate. By properly setting the leaf literals of this AC for a given instantiation x of variables X, the AC will
evaluate to the probability of instantiation x.
To evaluate the AC at a given variable instantiation x, we set each leaf literal in the AC to 1 if it is compatible
with instantiation x, and to 0 otherwise. A positive X is compatible with an instantiation that sets X = 1, and it is
not compatible with an instantiation that sets X = 0. The opposite is true for the negative literal ¬X . To evaluate
the probability of an instantiation, e.g., A = 0, B = 0, C = 0, D = 0, we compute the value of each fragment in a
bottom-up order. The following table shows the values of the literals:
Literals A ¬A B ¬B C ¬C D ¬D
.
Values 0 1 0 1 0 1 0 1
The value of each fragment is computed using the corresponding arithmetic operations. For example, the value
of the internal fragment 2 is computed as:
0.33 × 0 × 0 + 0.67 × 1 × 1 = 0.67,
1 We abuse the notation X to also represent a positive literal of variable X .

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 5

and the value of fragment 4 is:


0.75 × 0 × 1 + 0.25 × 1 × 0 = 0.0.
Given a variable instantiation, the value of each fragment represents the conditional probability Pr(x |
{y0 , · · · yn }, where the conditions {y0 , · · · , yn } are called contexts [22]. The query x corresponds to the subset of
the input instantiation whose variables appear beneath the fragment. If the input instantiation is A = 0, B = 0, C =
0, D = 0, the query for fragment 4 corresponds to A = 0, B = 0. Roughly speaking, the context consists of a set of
input instatiations whose probability computation depends on the fragment (please refer to [22] for details). The
context for fragment 4 consists of instantiations A = 1, B = 0, C = 1, D = 0 and A = 0, B = 1, C = 1, D = 0. Then
the value of fragment 4 equals to Pr(A = 0, B = 0 | {{A = 1, B = 0, C = 1, D = 0}, {A = 0, B = 1, C = 1, D = 0, }}),
which would have a probability of 0.0 (matches the value of the corresponding arithmetic operation).
The values of the rest of the labeled fragments are listed in the following:
Fragment ID 7 6 5 4 3 2 1
Value 0.60 0.80 0.00 0.00 0.48 0.67 0.19

2.3 Most Probable Explanation and Marginal ueries


We have implemented the two commonly used probabilistic queries: probability of the most probable explanation
(MPEp ) and marginal (MAR) queries. The queries are discussed in detail in this section.
2.3.1 MPEp . The most probable explanation (MPE) query inds the most likely instantiation x of variables X
that is compatible with a given evidence. The evidence is described by an instantiation e over a subset of the
PSDD variables E ⊆ X. For example, in natural language processing, we can predict the most likely sentence
structure given an observed sentence. We could construct a model of the sentence that describes the probability
distribution over sentences and grammatical structures. The MPE query can be invoked to ind the most likely
sentence structure that is compatible with the observed sentence.
The MPE query is computed by evaluating an MPE circuit, which is constructed similar to the AC of a PSDD.
The diference is that the max operation replaces the OR-gate of AC [4]. For example, the internal fragment in
Figure 2a evaluates to
max{α i × value(pi ) × value(si )}, (1)
i
where value(pi ) and value(si ) are the values of primes and subs for that fragment.
In this paper, we explain the methodology for performing an MPEp queryÐwhich computes the probability
of the most likely instantiation by evaluating the MPE circuit. This is the most compute-intensive part of the
MPE query. The most likely instantiation can be found by recording the child node that provides the maximal
probability for each node and performing a backward pass over the MPE circuit to recover the corresponding
instantiation (see Section 12.3.2 in [15] for details). For simplicity, we will refer to MPEp query as the MPE query
from now on.
Suppose e is an instantiation of variables E ⊆ X. We can compute the probability of an MPE query by evaluating
the AC of a PSDD as discussed earlier. One diference is that both literals of a variable Y ∈ X \ E will be set to 1
since they are both compatible with the instantiation e. For example, given evidence B = 1, literal ¬B is set to 0
and all other literals are set to 1. Evaluating the AC in Figure 1a under this setting yields the following values to
fragments:
Fragment ID 7 6 5 4 3 2 1
Value 0.60 0.80 1.00 0.25 0.48 0.33 0.1
Then, the probability of the most likely instantiation that is compatible with B = 1 is 0.1 (the value of Fragment
1).

ACM Trans. Reconig. Technol. Syst.


6 • Choi, et al.

Fig. 3. Baseline HLS implementation for PSDD acceleration

2.3.2 MAR. The marginal query (MAR) calculates the probability of an instantiation over a subset of variables
E ⊆ X and is one of the most common probabilistic queries. For example, in medical diagnosis, the model describes
a probability distribution that assigns a probability to every possible symptom and disease. The marginal query
can be used to compute the probability of a particular symptom or the probability of a disease given a symptom
(with Bayes conditioning).
For a MAR query, the internal fragment in Figure 2a evaluates to
Σi {α i × value(pi ) × value(si )}. (2)
The diference with the MPE query in Eq. 1 is that the summation operation (instead of the max operation)
replaces the OR-gate of AC.
When evaluating the marginal probability at evidence e, leaf literals are set in a similar way as the MPE query.
Every leaf literal compatible with e is set to 1 and every leaf literal incompatible with e is set to 0. Consider again
the PSDD in Figure 1a. Given evidence B = 1, literal ¬B is set to 0 and all other literals are set to 1. Evaluating the
AC yields the following values for fragments:
Fragment ID 7 6 5 4 3 2 1
Value 1.00 1.00 1.00 0.25 1.00 0.33 0.29
Thus the marginal probability of B = 1 is 0.29.

3 BASELINE HLS IMPLEMENTATION AND CHALLENGES


In this section, we present the baseline MPE query implementation in HLS and identify the challenges in the
acceleration. We compute the output of each PSDD fragment (we will simply refer to this as a node from now on)
in the outermost loop (line 1) of Fig. 3. As explained in Eqs. 1 and 2, each node performs a maximum (MPE) or
addition (MAR) operation for OR gates, and a multiplication for AND gates. We will refer to these operations as
the edges for the rest of the paper. Each node processes its edges in the innermost loop shown in line 3 of Fig. 3.
For the MPE query, we use the maximum operation; for the MAR query, we use the addition operation (line 6 of
Fig. 3). Since the MPE and MAR queries have a very similar computation pattern (Eqs. 1 and 2), we employ the C
ternary operator (which implies a mux in HLS) to choose between the result of the two diferent queries. The
outermost loop in line 1 traverses through all the nodes in a bottom-up fashion.
We store α i in the array weight[], the value of the node probability for the prime value(pi ) and the sub
value(si ) in the array prob[], the number of pi and si for each node in the array edge_num[], and the node
index of pi and si in prime[] and sub[]. The node probability values are stored in an integer type.

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 7

Fig. 4. Naive edge-centric processing in HLS

The array prob[] is accessed very frequentlyÐwe need to fetch the probability for primes and subs (line 5)
and update the result (line 8). Thus, prob[] is stored in the FPGA internal memory. Moreover, a considerable
amount of memory is required to process large networks. The natural choice is to assign prob[] to Ultra-RAM
(URAM) [42], which is the largest internal memory resource2 in Alveo U250.
The most commonly used optimization techniques for accelerating an HLS kernel are pipelining and unrolling
[40]. One could consider pipelining the loop in line 1 of Fig. 3 and unrolling the loop in line 3; but edge_num[]
is a variable which makes it diicult for HLS compilers to determine the unrolling factor. Another option is to
pipeline the loop in line 3Ðthe code after applying the pipeline compiler directive (#pragma HLS pipeline) is
shown in Fig. 3. We will refer to this code as the baseline implementation.
To test the baseline implementation, we utilize a PSDD dataset compiled from the Mastermind network using
the method described in [35]. The Mastermind network is a commonly used Bayesian network that models the
Mastermind game. This network exhibits local structure [5] that can be exploited when compiling the PSDD.
Unlike general PSDD graphs, each node in the Mastermind network has at most two children nodes because it is
synthesized from a BN [35]. But we found that such sparsity in the network causes a severe negative efect on
the performance of an HLS-based implementation.
The problem we faced was the low processing rate of the pipelined loop. Even after adding the pipeline pragma
(line 4 of Fig. 3), the averaged processing rate in the Mastermind dataset turned out to be only 0.06 nodes per
cycle. This is because in the Mastermind dataset, each node has only 1.1 prime and sub child nodes on average.
When the innermost loop is invoked, the loop requires 12 cycles of pipeline epilogue/prologue overhead cycles. If
the averaged loop iteration is only 1.1, then the overhead dominates the time spent in actual computation.
To solve this problem, we refactor the code to be edge-centric [32, 44]. That is, we latten the outer loop in
line 1 of Fig. 3 with the inner loop in line 3, and we iterate on the index of the operations. The modiied HLS
code is shown in Fig. 4. Now there is only a single loop that traverses through all the edges in a bottom-up
fashion (line 1 of Fig. 4). Even if each parent node has only a few child nodes, the loop no longer sufers from the
repetitive pipeline epilogue/prologue overhead cycles.
Although the new processing scheme has the potential to achieve a higher processing rate, the naive HLS
implementation in Fig. 4 has several new problems. One of them is the overhead of storing more data. We need
the index parent to keep track of the parent of each edge (lines 3 and 5 of Fig. 4). This is a minor issue, and it can
be easily solved with additional memory.
The next issue is the dependency problem. Reading probability, performing the multiplication and the max/add
operations, and writing probability take several cycles of latencyÐfor an integer variable, it took 12 cycles. This
is a true dependency, because a probability written to a parent node may be read as a child prime/sub node
probability in subsequent iterations. Forcing the HLS tool to ignore this dependency results in a read-after-write
2 Alveo U250 has 45 MB of Ultra-RAM (URAM) and 12 MB of Block-RAM (BRAM).

ACM Trans. Reconig. Technol. Syst.


8 • Choi, et al.

Fig. 5. A possible dependency problem for Fig. 1a ater adopting edge-centric processing in Fig. 4

(RAW) hazard. An example is illustrated in Fig. 5Ð for the PSDD in Fig. 1a, the output (node 2) after processing
nodes 6 and 7 will be written 12 cycle later. The RAW hazard occurs if the probability for node 3 is read before
the updated value is written. This challenge will be addressed in Section 4.1.
Moreover, the naive edge processing scheme has a local memory port limitation problem. Even if the dependency
problem is somehow solved, the loop in Fig. 4 cannot be pipelined to 1 because the probability (colored red) is
read three times (from addresses prime[e], sub[e], and parent[e]) and written once (to address parent[e])
every iteration. We cannot solve this problem with the true dual-port mode [42] because prime[e] and sub[e]
addresses may be diferent; true dual-port mode only supports two independent addresses. The array partitioning
technique [12, 40] also does not help because we cannot guarantee that the read addresses and the write address
will always be diferent. We will discuss the solution to this local memory port problem in Section 4.2.
The last issue is the lack of adequate parallelism. To increase parallelism, we could consider partial unrolling
[11] of the loop in line 1 of Fig. 4. But such an approach also causes a dependency problem because the primes
and subs of an iteration may be written in the next iteration. For a small graph, we can resolve this issue by
exploiting the operation-level parallelismÐthat is, we could map the entire AC onto the FPGA similar to [37, 43].
But this is not feasible for Mastermind because we operate on a large graph with 42,558 nodes. It is possible to
implement a part of the graph on the FPGA, but the irregularity of the graph makes it diicult to supply the
operands to each OR/AND-gate without stalling. We will explain how to solve this problem in Section 5.

4 PROCESSING RATE OPTIMIZATION IN HLS


In this section, we propose an HLS-based method to improve the processing rate and the initiation interval (II) of
the edge-centric processing scheme.

4.1 Depth-Batched Static Scheduling


We will describe how to solve the dependency issue (Section 3) that emerged from the edge-centric processing
scheme. There are two approaches to this problem. One is a dynamic solution that detects nodes with data hazards.
An example can be found in [44] which employs a complex mutex-based locking mechanism on the target node.
But this increases the hardware cost. Instead, we take a static approach. This is a hardware-friendly solution
exploiting the fact that the PSDD structure of a dataset is ixed, even for diferent queries. This characteristic
allows us to explicitly manage the scheduling in an oline compilation.
A related static approach was proposed for SPN in [24], where they formulate an Integer Linear Programming
(ILP) problem for the modulo scheduling and the binding of arithmetic operators. In this paper, we take a more
lightweight approach to quickly determine the static schedule. We irst perform a topological sorting of the
nodes starting from the head node of PSDD tree, and we assign a tree level (depth of a tree) to each node. For

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 9

Fig. 6. Removing the inter-depth dependency problem in Fig. 5 with bubble insertion

Fig. 7. The HLS code ater applying the depth-batched static scheduling (Section 4.1) and the common parent clustering
(Section 4.2) [blue variables are the decoded static schedule, red variables are the accesses to the probability array, and green
directive allows programmer to manage dependency]

the example in Fig. 1a, node 1 is assigned depth 0, nodes 2, 3, 4, 5 are assigned depth 1, and nodes 6 and 7 are
assigned depth 2. Then we batch-process all nodes in the same depth. Since we take a bottom-up approach, the
nodes with the largest level (deepest depth) are irst processed in a batch, and then the nodes in the upper level
are batch-processed, and so on.
After this process, all the primes and subs writing to a common parent node will have the same depth and can
be easily controlled to avoid the depedency problem (more details in the common parent clustering approach in
Section 4.2). The only RAW hazard now remaining is the inter-depth RAW hazardÐan example was illustrated in
Fig. 5 where node 3 (depth 1) is read before the the probability write in depth 2 is completed. This issue is resolved
by adding bubbles (no-op) into the computation slots that have the inter-depth dependency problems (Fig. 6).
PEs will wait until the conlicting write operation to node 3 is resolved. The computation pipeline is stalled as a
result, but because we batch-process each depth, the proportion of the stall is small compared to the number of
nodes processed in each depth. In the Mastermind dataset, the proportion of bubble cycles is only 0.11%.

ACM Trans. Reconig. Technol. Syst.


10 • Choi, et al.

Fig. 8. II reduction to 1 ater applying the probability array separation technique (Section 4.2)

Fig. 7 presents the HLS kernel code after applying static scheduling. We pack the bubble instruction, the node
index of primes/subs, and weights into the array edge_schedule[]. Then the array is read and decoded in the
PEs (lines 6-7 of Fig. 7). The bubble instruction stops the probability storage from being updated (lines 9ś12 of
Fig. 7). Also, we add the "dependency inter false" pragma on array prob[] in line 4 of Fig. 7 to inform the HLS
tool that the dependency is now managed by the programmer. The static schedule is accessed once per edge per
query3 , so it is stored in the external DRAM and passed to the processing elements in a streaming fashion.
The order of processing is the same for the same dataset, so we can generate the static schedule oline and
reuse it for various diferent PSDD queries. Since it is determined oline, the proposed solution does not increase
the hardware cost. Also, whereas the ILP-based approach [24] typically takes tens of minutes to determine the
schedule, our approach can be inished in a matter of a few seconds because assigning a depth to all nodes in a
tree has a low complexity (more details to be presented in Sections 5.1 and 6.2).
In addition to resolving the dependency problem, the static scheduling scheme provides another beneit of
reducing the probability storage. Rather than allocating one node’s probability to each address space in the array
prob[], we time-share the array. This is possible because the address space for nodes that will be no longer be
accessed can be reused to store other nodes’ probabilities. For the example in Fig. 1a, the probability for node 3
only needs to be stored in a memory space from the clock cycle when nodes 6 and 7 are processed to the clock
cycle when node 1 is processed. The space allocated for node 3 can be used for other nodes in all other clock
cycles. This technique reduces the node storage for the Mastermind dataset by 81%.
If the MPE and MAR queries are implemented on separate PEs, we can insert fewer bubbles to an MPE query
because the max operation is simpler than addition (6 vs 12 cycles latency). But instead, we apply the same
static schedule for both types of queries. This is because we utilize the same pipeline architecture. We wanted to
quickly switch between diferent types of queries without reprogramming the FPGA. This helps decrease the
computation latency even if there is a diverse type of incoming queries.

3 The schedule is reused in the query-level parallelism that will be explained in Section 5.2

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 11

4.2 Resolving Memory Port Limitation in HLS


As demonstrated in Section 3 and Fig. 4, we cannot achieve an II of 1 in the naive edge processing scheme because
of the internal memory port limitation problem. We will detail how to solve this problem in the HLS syntax.
The irst part of the solution consists of clustering the edges of a common parent. If we adjust the static
schedule so that an edge that shares the same parent is processed immediately after one another, the updated
probability value (the maximum value in MPE and the summation value in MAR) can be read from a temporary
register. This can be observed in the line 10 of Fig. 7Ða temporary register node_prob is read and written in the
same line (the value is also written to the probability URAM in line 11). There is no RAW hazard because the
addition to node_prob is serialized and can be done in a single cycle. When we process an edge that does not have
the same parent as the previous edge, we can initialize node_prob using a lag from the static schedule (line 13).
By adopting this technique, one probability array read is removed. This approach can naturally be combined with
the depth-batched processing in Section 4.1 because the child of a common parent has the same depth in a tree.
The second part of the solution is the probability array separation. We exploit the fact that a node is either a
prime node or a sub node, and it cannot be both. We separate the node probability storage for primes and subs.
The revised HLS code is shown in Fig. 8. One data is read from the prime node probability storage (p_prob[]),
and one data is read from the sub node probability storage (s_prob[]). They are used for the computation in
line 8. If the parent node is a prime, the output of each iteration is written to p_prob[] (line 11). It is a sub, the
output is written to s_prob[] (line 12). With this code modiication, p_prob[] and s_prob[] has one read and
one write per iteration, and we can reduce the II to 1. This approach does not double the URAM consumption
because there are approximately the same number of primes and subs, and the size of p_prob[] and s_prob[]
are each about the half of prob[].

5 PARALLELIZATION
The work in [37, 43] achieves a large operation-level parallelism by mapping the entire graph on the FPGA. But
this approach has two limitations. First, the amount of available parallelism decreases when the graph size is small.
Second, it is diicult to parallelize a large graph that cannot it on the FPGA. The Mastermind dataset sufers
from the second problem since it is composed of 42,558 nodes. Therefore, we need other types of parallelism to
increase the throughput. In this section, we describe two solutions to overcome this problem.

5.1 Subtree-level Parallelism


We increase the throughput of the PSDD accelerator by processing multiple subtrees in parallel. We split the
prime and the sub probability memory (p_prob and s_prob) into multiple subtrees and attach OR/AND-gate
operators to each memory. The architecture is illustrated in Fig. 9. The address of the memory (prime, sub, parent)
is read from the schedule decoder (the decoded control signals are colored in blue). The probability of the prime
and sub nodes can be directly fed into the operators if the data exists in the local memory. If not, it will be read
from the local memory of other PEs. Prime and sub node probability can be sent to other PEs as well. The data will
be selected based on the schedule-decoded signal use_local_mem, and the target ID of the PEs are selected with
signals TX_PE_ID and RX_PE_ID (these control signals were omitted in Fig. 8 for simplicity). After we compute the
AND gate (multiplication) with weights, we calculate a new temporary node probability node_prob with addition
or max, depending on the type of the query (selected with the control signal mar). Register node_prob may be
initialized by the signal node_end if the child nodes no longer share a common parent node (Section 4.2). Register
node_prob will be written to p_prob or s_prob depending on the control signals bubble and is_parent_prime.
They may be sent to other PEs if the result is read in other PEs.
Since each PE needs to read a separate schedule from DRAM, we attach a DRAM access module per PE. Alveo
U250 has four DRAM channels, and thus we employed the subtree parallel factor of four.

ACM Trans. Reconig. Technol. Syst.


12 • Choi, et al.

Fig. 9. PE0 architecture when the subtree-level parallel factor is four

Next we need to determine how to assign the nodes to diferent PEs. Initially, we have equally partitioned
the nodes in each level and assigned them to diferent PEs. However, we found that 68% of the instructions in
the Mastermind’s static schedule was either sending or receiving data from other PEs. Even though equal load
balancing was achieved, the parent nodes in a diferent level were not guaranteed to be in the same PE.
We solved this problem with a graph coarsening technique. The new PE assignment process is shown in
Algorithm 1. Given a PSDD graph, we irst determine the tree depth of each node with topological sorting (line 3).
Then we traverse in a bottom-up fashion (lines 4 and 5) and try to coarsen the cluster of nodes that share the
same parent node (lines 6 to 22). For each cluster in depth d (line 7), we look up the clusters in depth d − 1 for
common parent nodes (lines 10 and 11). If found, the common parent node and the child nodes are grouped
together (line 13). This process can naturally be combined with the common parent clustering method described
in Section 4.2. The coarsening continues until the size of the cluster reaches the limit N /S (line 12), where N
is the number of all nodes and S is the subtree level parallelism factor. If the size limit is reached, the cluster is
appended to the next depth cluster vector without further coarsening (line 22). If a cluster can ind no other
clusters with a common parent node, the parent nodes are merged with the cluster and appended to the next
depth cluster vector (line 20).
After obtaining a list of coarsened clusters for depth 0, we perform multi-way partitioning. Similar to the
largest processing time (LPT) algorithm [20], we sort the clusters in a descending order of its size (line 23). The
nodes in the sorted clusters are inserted into one of the S bins where the sum of nodes after insertion is the
smallest (line 26). All nodes are assigned a PE after this process.
If the nodes with an edge connection are assigned diferent PEs, inter-PE communication of probability is
needed. For the Mastermind dataset, we discovered that most of the communication occurs near the top part of
the tree using the proposed algorithm, and the proportion of inter-PE communication instruction is reduced to

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 13

Algorithm 1: PE assignment of nodes


1 Input: PSDD graph
2 Output: List of nodes assigned to each PE: PE 0 , PE 1 , ..., PE S
3 Assign depth to all nodes
4 Cd M AX ← All nodes with the largest depth (dMAX )
5 for d = dMAX, ..., 1 do
6 Cd−1 ← ∅
7 for each cluster Cdi ∈ Cd do
8 coarsened ← f alse
9 size_limited ← f alse
j
10 for each cluster Cd−1 ∈ Cd −1 do
j
11 if Any nodes in Parent(Cdi ) exists in Cd−1 then // share common parent nodes
j i
12 if |Cd −1 ∪ Cd | < N /S then
j j
13 Cd−1 ← Cd−1 ∪ Cdi
14 coarsened ← true
15 else // cannot be coarsened since it will exceed the cluster size limit
16 size_limited ← true
17 break
18 if coarsened == f alse then
19 if size_limited == f alse then
20 Append {Cdi ∪ Parent (Cdi )} to Cd−1
21 else
22 Append Cdi to Cd−1

// The below is a multi-way partitioning based on LPT scheduling


23 sort(C 0 ) // in a descending order of the cluster size
24 for C 00 , C 01 , ..., ∈ C 0 do
25 Insert all the nodes in C 0i into one of PE 0 , PE 1 , ..., PE S bins if the number of nodes in the bin is the
smallest after insertion.

2%. But we also noticed that the proportion of inter-PE communication worsens to 30% for certain datasets (more
details in Section 6.2). It remains as a future work to further improve the node assignment strategy.
Table 1 shows the resource consumption for various conigurations. The amount of computation resources
(LUT and DSPs) grows almost proportionally as the subtree parallel factor increases. The URAM consumption, on
the other hand, stays approximately the same because the number of nodes processed by each PE decreases as the
subtree parallel factor increases. This can be conirmed in the table which reveals that the URAM consumption
stays 16 even though the parallel factor increases from 1 to 2 to 4. The BRAM consumption grows rapidly (3 to 9
to 33) with a larger subtree parallel factor because the number of connections increases quadratically with a full
crossbar structure.

ACM Trans. Reconig. Technol. Syst.


14 • Choi, et al.

Table 1. Resource consumption and performance ater increasing subtree parallel factor

Par. Intercon. LUT/FF/DSP/BRAM/URAM CLK GOPS


1 Full 2.3K / 4.4K / 3 / 3 / 16 300 0.86
2 Cross 4.3K / 8.3K / 6 / 9 / 16 300 1.7
4 bar 8.5K / 16K / 12 / 33 / 16 234 2.6
4 Chain+[21] 12K / 22K / 12 / 33 / 16 300 3.2

An important observation from Table 1 is the signiicant drop in the kernel clock frequency down to 234MHz
when the subtree parallel factor is 4. In Alveo U250, the four DRAM channels are located in a physically separated
Super Logic Region (SLR) [41], and 16 (=4×4) inter-PE FIFOs’ SLR crossings have a negative impact on the routing
process.
To solve this problem, we changed the inter-PE communication architecture to a 1D chain (thus reducing
the number of FIFOs crossing the SLR), and we added signal relay modules in the FIFOs using Autobridge [21].
The frequency is improved to 300MHz as a result. Even though the inter-PE FIFOs are now partially shared
among multiple PEs, the performance is not signiicantly degraded by the data congestion because of the small
proportion of inter-PE communication in the Mastermind dataset. The cost of the improved frequency is the
larger LUT consumption (from 8.5K to 12K). This is due to the inter-PE chain modules and the signal relay
modules. We obtain a speedup of 3.7 (=3.2/0.86 GOPS) when using a subtree parallel factor of four.

5.2 uery-level Parallelism


The BRAM and the LUT consumption of the inter-PE interconnect for subtree-level parallelism increases rapidly
as we add more PEs. Also, each PE requires a separate DRAM access module to fetch the static scheduling
information.
In order to increase the parallelism even with a limited number of DRAM channels, we propose query-level
parallelism. As the name suggests, we compute multiple queries in parallel. The static schedule from DRAM
is shared among all diferent queries. The probability storage, AND/OR gates, max operations, and node_prob
storage in Fig. 9 are kept separate for each query (these variables are duplicated in the HLS code). Therefore,
increasing the level of parallelism increases the LUT/FF/URAM usage of PEs, not the complexity of the inter-PE
interconnect nor the DRAM channel usage.
Since the static schedule is shared among diferent queries, no modiication is required for the schedule data.
But we need to change the ordering of the literal values in the DRAM so that they can initialize the probability
storage in parallel. The vector of input literal values for a query is originally placed next to the literal values of
another query. After reading the literal values for multiple queries (that correspond to the query-level parallel
factor) from the DRAM, the accelerator transposes the literal input data so that diferent queries’ values for the
same literal are placed adjacent to each other. This allows the probability storage to be initialized in a SIMD style
(the initialization timing is the same and the value is diferent) at the beginning of the query computation.
Fig. 10 presents the efect on the performance after increasing the query-level parallelism. It shows that the
performance improves from 3.2 GOPS to 89 GOPS when we increase the query-level parallel factor from 1 to 32.
The improvement is slightly sub-linear because it is more diicult to close the timing as the resource utilization
is increased. The clock frequency dropped from 300 MHz to 273 MHz for the design with a 32 query-level parallel
factor. The LUT/FF/URAM consumption of PEs and the BRAM consumption of the inter-PE chain increase
approximately linearly with the query-level parallelism factor. The storage for transposing the literal values
occupies about 16K LUTs and 32K FFs when the subtree-level parallel factor is 4 and query-level parallel factor is
32 (the resource breakdown will be presented in Section 6.3).

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 15

Fig. 10. Speedup ater increasing query-level parallelism in the Mastermind dataset

6 EXPERIMENTAL RESULTS
6.1 Experimental Setup
For development, we used the Xilinx Vitis 2019.2 uniied software platform [39]. The kernel was programmed in
C++ and compiled with Vivado HLS 2019.2 [40]. The tests were run on-board, using the Xilinx Alveo U250 [41]
platform. The probability is stored in 32b integers. The performance is measured in Giga Operations per Seconds
(GOPS).
In addition to the Mastermind dataset (which has been used for the optimization and parallelization process),
we also tested our accelerator with three more datasets from [6]: FS-04 (friends and smokers), Students (and
professors), and (random) Blockmap datasets (shown in Table 2). Each dataset is a grounded relational Bayesian
network, which is a very challenging problem for classical inference methods. Like Mastermind, these datasets
exhibit a vast amount of local structure, and PSDD can exploit this characteristic and achieve exact, tractable
inference. The datasets contain 3548Ð52789 nodes, and we cannot it all the operators for the entire graph on the
FPGA. The datasets have a sparse connectionÐhaving 1.0 to 1.4 edges per node (excluding leaf literal nodes). We
also list the number of the leaf literal nodes and edges in the datasets.

Table 2. Datasets used for experiment

Dataset FS-04 Mastermind Students Blockmap


# of nodes 52789 42558 6921 3548
# of leaf literal nodes 528 2328 687 1218
# of edges 73048 45272 7582 2334
# of edges per node 1.4 1.1 1.2 1.0

6.2 Acceleration Results


The time required to generate the schedule for depth-batched processing (Section 4.1), common parent clustering
(Section 4.2), and subtree-level parallelism (Section 5.1) is shown in Table 3. The scheduling can be performed in
seconds due to its low complexity. Also, the scheduling only needs to be done once per dataset, and it can be
reused across many diferent queries.

ACM Trans. Reconig. Technol. Syst.


16 • Choi, et al.

Table 3. Schedule genereation time (in seconds)

Dataset FS-04 Mastermind Students Blockmap AVG


Time 2.5 0.60 0.088 0.036 0.81

We have already analyzed the efect of increasing the subtree-level parallel factor and the query-level parallel
factor in Section 5.1 and Section 5.2, respectively. In this section, we present the cumulative efect of each
optimization starting with the baseline HLS code in Fig. 3. After applying the edge-centric processing with the
static scheduling, the common parent clustering and the probability array separation, the processing rate no
longer sufers heavily from the long pipeline epilogue/prologue cycles (12 cycles) even though there are only a
small number of child nodes (average:1.1). Also, we can achieve II of 1. The performance is improved by 21X
(Table 4). The subtree-level parallelism of four improves the performance by 3.7X (refer to the explanation in
Section 5.1), and the query-level parallelism of 32 leads to a speedup of 28X as observed in Fig. 10. After applying
all optimization steps, we achieved a cumulative speedup of 2,200X compared to the baseline code. The clock
frequency of the inal design is 273 MHz.

Table 4. Performance and cumulative speedup with proposed optimizations (Mastermind dataset)

Base Proc. rate Subtree Query


line opt. paral. paral.
GOPS 0.041 0.86 3.2 89
Speedup - 21X 3.7X 28X
Cum SpdUp 1.0X 21X 78X 2,200X

After applying all the proposed optimizations, we have measured the performance in various datasets. The
result is presented in Table 5Ðthe irst row is calculated after considering the FPGA execution time only, and
the second row is calculated after considering both the FPGA execution time and the PCIE transfer time (of the
graph processing static schedule and the literal values of each query). Compared to other works that assume the
entire graph can it on the FPGA, our design architecture is scalable and retains a relatively high performance for
datasets of various sizes (see Section 7 for quantitative comparison). This is because the proposed design reads
the graph structure information from the DRAM and the performance does not depend on the size of the graph
that is mapped to the FPGA.

Table 5. Performance in various datasets (in GOPS)

Dataset FS-04 Mastermind Students Blockmap AVG


Performance (FPGA only) 62 89 56 30 59
Performance (FPGA+PCIE) 61 77 45 17 50

The maximum FPGA-only performance is 89 GOPS (Mastermind), and the averaged FPGA-only performance is
59 GOPS. But there is some performance variance among the datasets. For Blockmap dataset, the low performance
is due to the high proportion (34 %) of leaf literal nodes (Table 2) Ð much of the execution time is spent on
fetching and transposing the value of literal nodes rather than processing an edge. Apart from this factor, the
performance diference is mostly due to the efectiveness of the subtree parallelization step. The amount of
inter-PE communication (Section 5.1) is 2% for Mastermind, 23% for Students, 27% for Blockmap, and 30% for
FS-04, and the data congestion due to the simple 1-D chain architecture is further degrading the performance.

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 17

Moreover, the coarsening step explained in Section 5.1 introduces an unbalanced workload, which accounts for
the rest of the performance diference. It remains as a future work to reduce these overheads without severely
complicating the inter-PE communication architecture.
The performance drops to an average of 50 GOPS if we consider the PCIE transfer time in addition to the FPGA
execution time. The static schedule is fetched from the DRAM and sent to the FPGA for each batch of queries,
but it only needs to be transferred once through the PCIE because the same schedule is reused in the DRAM.
The literal values for each query, on the other hand, are transferred through both the PCIE and the DRAM only
onceÐthey are not reused in the DRAM. This leads to a larger performance drop for Blockmap dataset (30→17)
compared to FS-04 dataset (62→61)Ðthe Blockmap dataset has a larger proportion of literal nodes compared
to the FS-04 dataset (34% vs 1%, Table 2). This causes more time to be spent on transferring the literal values
through the PCIE.
Table 6 compares the performance between a multicore CPU implementation and the proposed FPGA im-
plementation. We use a two-socket Intel Xeon Gold 6244 server class node4 , and we parallelize the loop that
processes the queries with 32 OpenMP threads (the parallel factor is same as the FPGA query-level parallelism).
To avoid contention, prob[] has been separated for each OpenMP thread (the memory allocation time is excluded
from the execution time). The CPU implementation has been optimized with O3 lag. The experimental result
shows that the average performance is 2.9 GOPSÐso even though we use 2X CPUs, the performance is 20X slower
than the proposed FPGA implementation. The reason is mainly related to how fast the data can be supplied to
the computation units. The proposed FPGA static scheduling provides operands to the OR/AND-gate almost
every cycle. In a CPU, this is not guaranteed since the data for thousands of nodes and multiple threads cannot
it into the L1/L2 cache (notice that, unlike the FPGA performance, the CPU performance is generally higher in
smaller datasets). Also, the superior performance is attributed to the customized computation/memory pipeline.
Table 6. Performance comparison between 32-thread multicore CPU and proposed FPGA implementation (in GOPS)

Dataset FS-04 Mastermind Students Blockmap AVG


CPU 1.9 2.2 4.2 3.3 2.9
FPGA 62 89 56 30 59
Speedup 32X 41X 13X 9X 20X

We will present a quantitative comparison with related works in Section 7.

6.3 Accuracy and Resource Consumption


As mentioned in the experimental result, the bitwidth of the probability variables is 32b. After comparing with
the MPE and the MAR results of the loating-point-based CPU implementation, we found that the signal-to-
quantization-noise ratio (SQNR) is 66 dB ś 97 dB, with an average of 85 dB. We can conclude that there is almost
no accuracy loss in employing the ixed-point arithmetic because this SQNR value exceeds typically expected
ixed-point representation accuracy (around 30 dB or more).
Table 7 details the post-PnR resource consumption of the PSDD acceleration kernel. We exclude the resource
used in the static region and the DRAM controller. The table reveals that the PEs consume most of the DSPs
(96%) for the computation. The PEs also require a large amount of URAM (256) to store the node probability.
The inter-PE chain (introduced in Section 5.1 for the subtree-level parallelism) consumes several BRAMs (684)
because of the FIFO bufers among the PEs. It also consumes a large portion of LUTs to implement its control
circuits. About half of the LUTs/FFs in the DRAM access modules are used for transposing the literal values for
the query-level parallelism (Section 5.2).
4 Xeon 6244 and Alveo U250 are based on a comparable technology (14nm vs 16nm) and have the same release year (2019).

ACM Trans. Reconig. Technol. Syst.


18 • Choi, et al.

Table 7. Post-PnR resource consumption on Alveo U250

LUT FF DSP BRAM URAM


DRAM Access 40K 50K 12 21 0
PEs 42K 66K 256 0 256
Inter-PE Chain 33K 76K 0 684 0
Total Res. Cons. 115K 193K 268 705 256
Util. Ratio 7.7% 6.2% 2.2% 31% 20%

Among all the FPGA resources, the one with the highest utilization is BRAM. But even the highest utilization
ratio is relatively small (31%). This is because we wanted to ease the PnR process by keeping the consumption
under 50% for all resources. Thus the subtree-level parallelism and the query-level parallelism was limited to 4
and 32, respectively.

7 RELATED WORKS
There are several recent papers that accelerate graph applications (e.g., sparse matrix-vector multiplication)
on an FPGA. HitGraph [44] is a high-performance edge-centric graph accelerator with several performance
optimization techniques such as node bufer and data layout optimization. ThunderGP [7] is an HLS-based graph
processing framework which automatically builds FPGA accelerators based on its high-level APIs. It supports
several eicient memory access patterns such as scatter/gather, coalescing, and prefetching. However, these
papers target general graph applications and cannot exploit unique characteristics that exist in graph models
such as BN, SPN, or PSDD.
In the remainder of this section, we will review related FPGA acceleration works for the BN and the sum-product
network (SPN).
The Bayesian Computing Machine (BCM) has been presented in [25]. It supports the sum-product algorithm
and the max-sum algorithm. Their hardware is composed of a network of processors and memory connected
through a switching crossbar. They propose an optimal scheduling of computation and memory units to minimize
the execution time. This work has been implemented on the Berkeley Emulation Engine FPGA platform [1].
The work in [43] uses a high-level synthesis tool chain similar to our work. They accelerate the Arithmetic
Circuit on the Xilinx Zynq embedded platform. They provide an option to reconigure the network parameters
with the data fetched through AXI bus. The parallelism is achieved by compiling the entire network onto the
FPGA, but such an approach limits the network size to no more than 511 nodes. Also, their accelerator achieves a
performance of only about 0.01 GFLOPS, possibly due to the data transfer overhead.
The work in [37] accelerates the SPN inference problem on a Virtex-7 FPGA. They map the SPN tree to a
hardware datapath composed of pipelined functional units and shift registers. Similar to [43], the entire SPN
tree is implemented on-chip. In a separate work [24], they present an architecture where the operators are
time-sharedÐwhich makes it possible to process larger graphs. The number representation in a SPN graph can
be eiciently optimized based on the histogram of the variablesÐreaders are referred to [38] for an automated
method that inds the best number representation.
The work in [34] converts a PSDD graph to a SPN graph by replacing AND with products and OR with sums.
The resulting SPN graph is accelerated with a tree of customized processors, each with private registers. Their
simulation result reports a peak performance of 11.6 operations/cycle.
Compared to these works, our work concentrates on developing an HLS-friendly optimization method to
improve the processing rate and parallelism in PSDD graphs. A quantitative comparison is shown in Table 8. Since
each work uses a diferent graph structure, number representation, and dataset, it is diicult to make a direct

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 19

comparison. Instead, the performance is provided in either GOPS or Giga Edges Traversed Per Second (GTEPS).
The table also presents the graph size, resource consumption, platform, and performance. The performance of
[37] is estimated by multiplying the number of add/mul operations by the reported throughput (FPGA time only).

Table 8. Graph size, platform, resource consumption, and performance comparison with related works implemented on
FPGA

HitGraph[44] ThunderGP[7] BCM[25] [37] [24] Our work


Model General General BN SPN SPN PSDD
# of nodes 0.7M-41M 45K-21M N/A 11ś287 63ś2,699 3,548ś52,789
Graph Info DRAM DRAM DRAM On-chip N/A DRAM
Num Repr. N/A 32b int loat double [38] 32b int
Platform VU5P VCU1525 Virtex-5 VC709 Ultra96 U250
CLK (MHz) 200 250 N/A 200 205ś350 273
LUT 380K 1,100K N/A 78Kś346K 27Kś59K 115K
DSP 62 150 N/A 60ś1.5K 54ś342 268
GOPS N/A N/A 20 2.1ś57 1.9ś13 30ś89
GTEPS 1.0ś3.4 1.7ś6.4 N/A N/A N/A 10ś30

Our work can outperform accelerators that do not assume a particular graph model (e.g., [44] or [7]) because
our processing engine can exploit PSDD-speciic characteristicsÐsuch as each node having a small number of
child nodes or being connected to prime and sub nodes. Our work also outperforms related BN [25] and SPN
work [24] by 3.0X and 9.0X on average, respectively. These works read the graph information from memory
similar to our work. It is unclear if our work signiicantly outperforms [37]Ðit is possible that the higher frequency
(273 MHz vs 200 MHz) may have been achieved due to more advanced FPGA technology (Alveo U250 vs VC709).
This may have led to the better performance (59 GOPS vs 31 GOPS, on average). It is also diicult to make a
direct comparison on the LUT/DSP consumption, because our work is operating on 32b integers. But it is still
worth noting that [37] can only process relatively smaller graphs with the entire SPN tree being mapped to the
FPGA operatorsÐwhereas our work can process larger graphs eiciently with the proposed static scheduling
method. Also, our work has relatively less variance on the performance with a diferent graph size because of the
processing rate optimizations in Section 4.

8 CONCLUSION
We presented an HLS-friendly accelerator design for PSDD. We found that changing the processing scheme from
node-centric to edge-centric helps maintain a high processing rate even if there are only a few number of children
per parent node. This led to a dependency problem, which was solved with static scheduling. The throughput of
the PSDD pipeline was improved to II of 1 with common parent clustering and array separation techniques. We
also proposed subtree-level and query-level parallelization methods that can be used to improve the computation
speed of a large tree that cannot it on an FPGA. Experimental results show that the optimizations improve
the performance of the baseline implementation by 2,200X. The proposed architecture has a speedup of 20X
over CPU implementation, and it outperforms the BN and SPN FPGA acceleration work that stores the graph
information in the memory by 3.0X-9.0X. As a future work, we plan to further reduce the performance variance
among diferent datasets with load balancing improvement and inter-PE communication reduction. This PSDD
acceleration project has been open-sourced at https://github.com/carlossantillana/psdd/tree/alveo250.

ACM Trans. Reconig. Technol. Syst.


20 • Choi, et al.

ACKNOWLEDGMENTS
This research is supported by Inha University Research Grant, National Research Foundation (NRF) Grant funded
by Korea Ministry of Science and ICT (MSIT) (2022R1F1A1074521), US NSF Grant on RTML: Large: Acceleration to
Graph-Based Machine Learning (CCF-1937599), and Xilinx Heterogeneous Accelerated Compute Cluster (HACC)
Program. We thank Yuze Chi, Vidushi Dadu, Licheng Guo, Jason Lau, Michael Lo, Tony Nowatzki, and Yizhou
Sun for the discussion and the help with the experiments. We also thank Marci Baun for proofreading this article.

REFERENCES
[1] Berkeley. 2008. BEE3 (Berkeley Emulation Engine). https://www.microsoft.com/en-us/research/project/bee3/
[2] Simone Bova. 2016. SDDs are Exponentially More Succinct than OBDDs. In AAAI. AAAI Press, 929ś935.
[3] Ruizhe Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, and Yanzhi Wang. 2018. VIBNN: Hardware
acceleration of Bayesian neural networks. ACM SIGPLAN Notices 53, 2 (2018), 476ś488.
[4] Hei Chan and Adnan Darwiche. 2006. On the Robustness of Most Probable Explanations. In UAI ’06, Proceedings of the 22nd Conference
in Uncertainty in Artiicial Intelligence. AUAI Press.
[5] Mark Chavira and Adnan Darwiche. 2008. On probabilistic inference by weighted model counting. Artiicial Intelligence 172, 6 (2008),
772ś799.
[6] Mark Chavira, Adnan Darwiche, and Manfred Jaeger. 2006. Compiling relational Bayesian networks for exact inference. International
Journal of Approximate Reasoning 42, 1-2 (2006), 4ś20.
[7] X. Chen, H. Tan, Y. Chen, B. He, W. Wong, and D. Chen. 2021. ThunderGP: HLS-based graph processing framework on FPGAs. In Proc.
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. 69ś80.
[8] Arthur Choi, Guy Van den Broeck, and Adnan Darwiche. 2015. Tractable Learning for Structured Probability Spaces: A Case Study in
Learning Preference Distributions. In Proceedings of the Twenty-Fourth International Joint Conference on Artiicial Intelligence (IJCAI).
AAAI Press, 2861ś2868.
[9] Arthur Choi, Yujia Shen, and Adnan Darwiche. 2017. Tractability in Structured Probability Spaces. In Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Information Processing Systems. 3477ś3485.
[10] Arthur Choi, Nazgol Tavabi, and Adnan Darwiche. 2016. Structured Features in Naive Bayes Classiication. In Proceedings of the Thirtieth
AAAI Conference on Artiicial Intelligence. AAAI Press, 3233ś3240.
[11] Young-kyu Choi and Jason Cong. 2018. HLS-based optimization and design space exploration for applications with variable loop bounds.
In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1ś8.
[12] J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011. Automatic memory partitioning and scheduling for throughput and power optimization.
ACM Trans. Design Automation of Electronic Systems 16, 2 (2011), 1ś25.
[13] J. Cong et al. 2011. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems 30, 4 (Apr. 2011), 473ś491.
[14] Adnan Darwiche. 2003. A diferential approach to inference in Bayesian networks. J. ACM 50, 3 (2003), 280ś305.
[15] Adnan Darwiche. 2009. Modeling and Reasoning with Bayesian Networks. Cambridge University Press.
[16] Adnan Darwiche. 2011. SDD: A New Canonical Representation of Propositional Knowledge Bases. In IJCAI. IJCAI/AAAI, 819ś826.
[17] Adnan Darwiche. 2020. Three Modern Roles for Logic in AI. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems (PODS). ACM, 229ś243.
[18] Adnan Darwiche. 2022. Tractable Boolean and Arithmetic Circuits. In Neuro-symbolic Artiicial Intelligence: The State of the Art, Pascal
Hitzler and Md Kamruzzaman Sarker (Eds.). Vol. 342. Frontiers in Artiicial Intelligence and Applications. IOS Press, Chapter 6.
[19] Johannes Geist, Kristin Y. Rozier, and Johann Schumann. 2014. Runtime Observer Pairs and Bayesian Network Reasoners On-board
FPGAs: Flight-Certiiable System Health Management for Embedded Systems. In International Conference on Runtime Veriication (Lecture
Notes in Computer Science, Vol. 8734). Springer, 215ś230.
[20] R. L. Graham. 1966. Bounds for certain multiprocessing anomalies. Bell System Technical Journal 45, 9 (1966), 1563ś1581.
[21] L. Guo, Y. Chi, J. Wang, J. Lau, W. Qiao, E. Ustun, Z. Zhang, and J. Cong. 2021. AutoBridge: Coupling Coarse-Grained Floorplanning and
Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays.
[22] Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. 2014. Probabilistic Sentential Decision Diagrams. In Principles of
Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference (KR). AAAI Press.
[23] Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models - Principles and Techniques. MIT Press.
[24] Hanna Kruppe, Lukas Sommer, Lukas Weber, Julian Oppermann, Cristian Axenie, and Andreas Koch. 2021. Eicient Operator Sharing
Modulo Scheduling for Sum-Product Network Inference on FPGAs. https://www.esa.informatik.tu-darmstadt.de/assets/publications/
materials/2021/2021_SAMOS_HK.pdf

ACM Trans. Reconig. Technol. Syst.


FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis • 21

[25] Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. High-throughput Bayesian computing machine with reconigurable hardware. In
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. 73ś82.
[26] Kevin P. Murphy. 2012. Machine learning - a probabilistic perspective. MIT Press.
[27] Xie Pan and Yu Jinsong. 2017. Diagnosis via arithmetic circuit compilation of Bayesian network and calculation on FPGA. In 2017 13th
IEEE International Conference on Electronic Measurement & Instruments (ICEMI). 35ś41.
[28] Judea Pearl. 1989. Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann.
[29] Knot Pipatsrisawat and Adnan Darwiche. 2008. New Compilation Languages Based on Structured Decomposability. In Proceedings of
the Twenty-Third AAAI Conference on Artiicial Intelligence. AAAI Press, 517ś522.
[30] Hoifung Poon and Pedro M. Domingos. 2011. Sum-Product Networks: A New Deep Architecture. In UAI 2011, Proceedings of the
Twenty-Seventh Conference on Uncertainty in Artiicial Intelligence. AUAI Press, 337ś346.
[31] Tahrima Rahman, Prasanna Kothalkar, and Vibhav Gogate. 2014. Cutset Networks: A Simple, Tractable, and Scalable Approach for
Improving the Accuracy of Chow-Liu Trees. In Machine Learning and Knowledge Discovery in Databases - European Conference (ECML
PKDD) (Lecture Notes in Computer Science, Vol. 8725). Springer, 630ś645.
[32] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 472ś488.
[33] N. Shah, L. Olascoaga, W. Meert, and M. Verhelst. 2020. Acceleration of probabilistic reasoning through custom processor architecture.
In Design, Automation & Test in Europe Conference & Exhibition (DATE). 322ś325.
[34] Nimish Shah, Laura I Galindez Olascoaga, Wannes Meert, and Marian Verhelst. 2020. Acceleration of probabilistic reasoning through
custom processor architecture. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 322ś325.
[35] Yujia Shen, Arthur Choi, and Adnan Darwiche. 2016. Tractable Operations for Arithmetic Circuits of Probabilistic Models. In Advances
in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems. 3936ś3944.
[36] Yujia Shen, Arthur Choi, and Adnan Darwiche. 2018. Conditional PSDDs: Modeling and Learning With Modular Knowledge. In
Proceedings of the Thirty-Second AAAI Conference on Artiicial Intelligence. AAAI Press, 6433ś6442.
[37] Lukas Sommer, Julian Oppermann, Alejandro Molina, Carsten Binnig, Kristian Kersting, and Andreas Koch. 2018. Automatic mapping
of the sum-product network inference problem to FPGA-based accelerators. In 2018 IEEE 36th International Conference on Computer
Design (ICCD). 350ś357.
[38] Lukas Sommer, Lukas Weber, Martin Kumm, and Andreas Koch. 2020. Comparison of Arithmetic Number Formats for Inference
in Sum-Product Networks on FPGAs. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing
Machines (FCCM). 75ś83.
[39] Xilinx. 2020. Vitis Uniied Software Platform. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html
[40] Xilinx. 2020. Vivado High-Level Synthesis (UG902). https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug902-
vivado-high-level-synthesis.pdf
[41] Xilinx. 2021. Alveo U250 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u250.html
[42] Xilinx. 2021. UltraScale Architecture Memory Resources. https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-
memory-resources.pdf
[43] S. Zermani, C. Dezan, H. Chenini, J. Diguet, and R. Euler. 2015. FPGA implementation of Bayesian network inference for an embedded
diagnosis. In 2015 IEEE Conference on Prognostics and Health Management (PHM). 1ś10.
[44] S. Zhou, R. Kannan, V. K Prasanna, G. Seetharaman, and Q. Wu. 2019. Hitgraph: High-throughput graph processing framework on
FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249ś2264.
[45] Stephanie Zierke and Jason D Bakos. 2010. FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference
methods. BMC Bioinformatics 11, 1 (2010), 1ś12.

ACM Trans. Reconig. Technol. Syst.

You might also like