Performance Evaluation of A Two Dimensional Lattice Boltzmann-2017
Performance Evaluation of A Two Dimensional Lattice Boltzmann-2017
The Unified Parallel C (UPC) language from the Partitioned Global Address Space (PGAS) family unifies the
advantages of shared and local memory spaces and offers a relatively straight forward code parallelisation with
the Central Processing Unit (CPU). In contrast, the Computer Unified Device Architecture (CUDA) development
kit gives a tool to make use of the Graphics Processing Unit (GPU). We provide a detailed comparison between
these novel techniques through the parallelisation of a two-dimensional lattice Boltzmann method based
fluid flow solver. Our comparison between the CUDA and UPC parallelisation takes into account the required
conceptual effort, the performance gain, and the limitations of the approaches from the application oriented
developers’ point of view. We demonstrated that UPC led to competitive efficiency with the local memory
implementation. However, the performance of the shared memory code fell behind our expectations, and
we concluded that the investigated UPC compilers could not treat efficiently the shared memory space. The
CUDA implementation proved to be more complex compared to the UPC approach mainly because of the
complicated memory structure of the graphics card which also makes GPUs suitable for the parallelisation of
the lattice Boltzmann method.
Additional Key Words and Phrases: artitioned Global Address Space, PGAS, Unified Parallel C, UPC, nVidia,
Compute Unified Device Architecture, CUDA, Computational Fluid Dynamics, CFD, lattice Boltzmann method,
LBM
Author’s addresses: M. Szőke, Faculty of Engineering, University of Bristol, UK; T. I. Józsa, School of Engineering, University
of Edinburgh, UK; Á. Koleszár, Soliton Systems Europe A/S, Taastrup, Denmark; I. Moulitsas and L. Könözsy, Computational
Engineering Sciences, Cranfield University, UK..
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
39
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2017 ACM. 0098-3500/2017/5-ART39 $15.00
DOI: 0000001.0000001
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
©ACM 2017. This is the author's version of the work. It is posted here by permission of ACM for your personal use.
Not for redistribution. Please refer to any applicable publisher terms of use
39:2 M. Szőke et al.
1 INTRODUCTION
In the world of High Performance Computing (HPC), the effective implementation and paralleli-
sation are vital for novel scientific software. Computational Fluid Dynamics (CFD) targets fluid
flow modelling, which is a typical application field of HPC. While the Message Passing Interface
(MPI) (Message Passing Interface Forum 2012) has become the dominant technique in parallel
computing, other approaches, like the Partitioned Global Address Space (PGAS) (PGAS 2015) and
the General-Purpose Computing on Graphics Processing Units (GPGPU), reared their heads in the
last decade.
Co-Array Fortran (Numrich and Reid 1998), Chapel (Chamberlain et al. 2007), X-10 (Ebcioglu
et al. 2004), Titanium (Yelick et al. 1998) and Unified Parallel C (UPC) (Chauwvin et al. 2007) are
members of the PGAS model family. These languages attempt to offer an easier way for parallel
programming on multi-core Central Processing Units (CPU) based systems compared to MPI. This
involves keeping the code (a) portable: optimisation is made by the compiler in terms of architecture;
(b) readable and productive: such languages were shown to be easier to code and easier to read
(Cantonnet et al. 2004); (c) well performing: it was shown that such languages offer the same or
even better performance than MPI (Johnson 2005; Mallón et al. 2009).
Performance-centred investigations were carried out to compare commercial UPC compilers
(IBM, Cray, HP) with open source UPC compilers (GNU, Michigan, Berkeley). Husbands et al. (2003)
reported that Berkeley UPC (BUPC) is competitive with the commercial HP compiler. The former
achieved high performance in pointer-to-shared arithmetic because of its own compact pointer
representation.
Mallón et al. (2009) compared UPC to MPI. They found that UPC showed poor results in collective
performance against MPI (Taboada et al. 2009) due to high start-up communication latencies. In
other aspects, their UPC code performed better than MPI.
Zhang et al. (2011) implemented the Barnes-Hut algorithm in UPC. They reported that the
problem can be conveniently approached in UPC because the algorithm has dynamically changing
communication patterns that can be handled by the implicit communication management of UPC.
They reported poor performance because of the lack of efficient data management when their
code relied on shared memory. The problem was resolved with the help of additional language
extensions and optimisations ensuring that the data is cached accordingly from the global memory
prior to it is requested.
Most of the performance evaluations were done via synthetic benchmarks such as FFT calcula-
tions, N-Queens problem, NAS benchmarks from NASA (El-Ghazawi et al. 2006) etc., and only a
limited number of papers focused on the application of UPC for physical problems. One of these is
the work of Markidis and Lapenta (2010), where a particle-in-cell UPC code was implemented to
simulate plasma. They experienced performance degradation for a high number of CPUs, the effect
was dedicated to a specific part of their solver.
Johnson (2005) used Berkeley UPC compiler for an in-house code on a Cray supercomputer (Cray
Inc. 2012) to run CFD simulations. Their code solved the incompressible Navier-Stokes Equations
(NSE) using the finite element method. The UPC code showed better performance than the MPI
version. The performance difference was bigger for a higher number of CPUs: UPC performed
better than MPI, especially above 64 threads. It was shown that MPI required more time to pass
small sizes of messages than UPC. Above a certain message size, UPC still performed better but the
difference was negligible.
Although HPC centres are dominated mainly by multi-core CPUs, researchers discovered the
potential for scientific computing on Graphics Processing Units (GPU) in the early 2000s (McClana-
han 2010). The Compute Unified Device Architecture (CUDA) Software Development Kit (SDK)
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:3
was released to popularise GPGPU on nVidia graphical cards (Sanders and Kandrot 2010). The
CUDA libraries can be added to several languages, along with Fortran, Java, Matlab and Python, but
most of the time it is used with C/C++. The essentially parallel nature of the graphical cards proved
to be applicable in several fields, from pure mathematics (Manavski and Valle 2008; Zhang et al.
2010), through image processing (Stone et al. 2008), to physics (Anderson et al. 2008), including
fluid flow modelling (Chentanez and Müller 2011; Ren et al. 2014).
The lattice Boltzmann method has become quite popular on parallel architectures because of
the local nature of the operations which is discussed in Section 2. The first parallel solvers were
relying on CPUs, and were reported in the 1990s (Amati et al. 1997; Kandhai et al. 1998). Later on,
GPU architectures were proven to be a good basis to parallelise the LBM. Two-dimensional CUDA
implementation was published by Tölke (2010), where a speedup of ≈20 was reported. Several
descriptions of three-dimensional solvers can be found, such as Ryoo et al. (2008), where a speedup
of 12.5 was measured. Rinaldi et al. (2012) reached a speedup of 131 using advanced strategies with
CUDA. The LBM solver was proved to be highly efficient in multi-GPU environment as well (Xian
and Takayuki 2011). As far as the authors know, only Valero-Lara and Jansson (2015) considered
using UPC for the LBM.
The advantages of the LBM are its good scalability, explicit time step formulation, applicability
for multiphase flows (Succi 2001) and applicability for flows with relatively high Knudsen number
(Mohamed 2011). The latter one is one of the main limitations of the NSE. The LBM has been
developed for incompressible subsonic flows, it has second order of accuracy, and relatively high
memory requirements due to the discretisation of particle directions.
In this paper, we present and compare the performance gain achieved after the CUDA and UPC
parallelisation of an in-house LBM code. The solver handles two-dimensional fluid flow problems
using the LBM. The comparison of the two currently applied architectures is not widely discussed
from the performance point of view. Since HPC is often used by mathematicians, physicists, and
engineers with limited computer science knowledge, we also aim to inspect how user-friendly
CUDA and UPC are. We investigated the effect of the following factors:
a memory structure of UPC: shared and local variables;
b spatial resolution;
c hardware;
d collision models;
e data representation: single and double precision;
f required programming effort for the different codes.
The UPC implementation was evaluated on two different clusters, and the results were compared
to the CUDA parallelisation on two different architectures. To quantify the performance the speedup
was defined as SU = t serial /t parallel , where t serial is the execution time of the serial code and t parallel is
the execution time of the parallel code.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:4 M. Szőke et al.
by Chorin (1967)), thus the method can be used to describe incompressible fluid flows (He and Luo
1997).
∂f
+ (®c · ∇)f = Ω (1)
∂t
We applied the D2Q9 model, which means that the resolved flow field was two-dimensional,
and particle propagation was allowed in nine discrete directions (Fig. 1). The time marching was
divided into four steps from the implementation point of view:
I. Collision The collision modelling treats the right hand side of Eq. (1). Two collision models
were analysed:
a In the BGKW model, the collision is described by Eq. (2) which was derived by Bhat-
nagar et al. (1954) and by Welander (1954). Here τ is the relaxation factor, calculated
from the lattice viscosity of the fluid, and f eq is the so called local equilibrium distri-
bution function described by the Maxwell-Boltzmann distribution (Maxwell 1860). It
is important to note that the collision term can be computed independently for every
direction after the discretisation as
1
Ω = (f eq − f ). (2)
τ
b The Multi Relaxation Time (MRT) model was presented and reviewed by d’Humières
(1992) and d’Humières et al. (2002). In this case, instead of using a single constant (τ −1 )
to describe the collision, the model applies a matrix which depends on the resolved
directions (D2Q9). In our case, this matrix has a dimension of 9×9. This approach
yields a matrix multiplication for each lattice. Despite the fact that this model is
computationally more expensive than the BGKW, it is widely applied since the flow
field can be more accurately resolved.
II. Streaming The streaming process occurs when the directional distribution functions
“travel” to the neighbouring cells: second term on the left hand side of Eq. (1). This process
is presented in Fig. 1(b).
III. Boundary treatment These steps are followed by the handling of the boundaries. In the
current paper, the boundary description suggested by Zou and He (1997) was used for the
moving wall, and the so called bounce-back boundary condition (Succi 2001) was used to
handle the no-slip condition at the stationary walls.
IV. Update macroscopic Once the distribution functions were known at the end of the time
step, the macroscopic variables (density, x- and y-directional velocity components) had to
be computed from the distribution functions to recover the flow field.
The sum of the listed items is referred from now on as the main loop. The grid generations and the
initialisation of the microscopic and macroscopic variables took place before the main loop.
In terms of the computational grid, the method is based on a uniform Cartesian lattice, which is
represented in Fig. 1(a), where the nine directions of the streaming are displayed as well. In order
to examine the effect of the spatial resolution on the performance gain, four different grids were
investigated. In the following, the lattices are referred to based on the names given in Table 1.
The simulated fluid flow was the well known lid-driven cavity, which is a common validation
case for CFD solvers (Ghia et al. 1982). The aspect ratio of the domain was unity. The lid on the
top moved with a defined positive x-directional velocity, while all the other boundaries were
stationary walls (see Fig. 1(a)). From these conditions, the Reynolds number in the domain was
defined based on the lattice quantities as Re = n x u lid /νl , where n x = n is the number of lattices
along the x-direction, u lid is the lid velocity, and νl is the lattice viscosity, which was set to be 0.1.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:5
u lid
Lid
Wall Wall 8 1 2
7 3
y 0
6 5 4
Wall
x
(a) Lattice and the applied boundary conditions (b) Streaming and the nine discrete directions
ci )
(®
The Reynolds number of the simulations was 1000. At the end of the computations, the coordinates
and the macroscopic variables were saved. A qualitative validation of the flow field can be found in
Appendix B. For a more comprehensive validation and verification, we refer to the work of Józsa
et al. (2016).
3 PARALLELISATION
The serial code was written in C, and the code was built up as it was presented in Section 2. The
serial implementation was based on one-dimensional Array of Structures (AoS), similarly to the
UPC version. The Cells structure included the macroscopic variables (velocity components as U and
V, density as Rho, etc.) as scalars, while the distribution function F was stored as a nine-dimensional
array within the Cells structure. Thus (Cells+i)->F[k] referred to the ith lattice in the domain
and the corresponding distribution function in the kth discrete direction. The parallelisation process
can be followed in Appendix A. The serial simulations were performed on Archer, the United
Kingdom National Supercomputing Service.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:6 M. Szőke et al.
First, we parallelised the solver with UPC and ran the codes on Archer and Astral. The latter is the
HPC cluster of Cranfield University. The hyper-threading technology (Intel 2015) was switched off
on both architectures. The properties of the two clusters are listed in Table 2. A higher performance
can be expected on Archer since it holds several optimisation properties such as hardware supported
shared memory addressing. Note that the interconnection between the nodes are different in the
two investigated clusters. On Archer the commercial Cray C compiler was available, while the
open source BUPC compiler was installed on Astral.
Second, the CUDA parallelisation was tested. The performance gain was evaluated on two
different nVidia GPUs. The relevant properties of the graphical cards are listed in Table 3. While
the GeForce cards are cheaper devices, as they are primarily designed for computer games, the
Tesla cards are more expensive, directly designed for scientific computing. The code development
was carried out on a desktop using the GTX 550Ti card, while the Tesla GPU of the University of
Edinburgh‘s Indy cluster was used for further investigations.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:7
the shared address space and in the local memory at the same time. A schematic draw of the local
and the shared memory spaces is showed in Fig. 2. Implementing UPC based codes therefore offers
the opportunity to combine the advantages of the conventional parallelisation methods, such as
OpenMP and MPI, whose restrict the programmer to rely purely on either shared or local memory
only. The UPC compiler is designed to manage and handle the data layout between the threads
and nodes. This keeps the architecture based problems hidden from the user, i.e. optimisation is
expected to be performed by the compiler.
Fig. 2. Memory model of the local and shared memory spaces (Chauwvin et al. 2007)
The syntax of the local memory declarations is the same as in the standard C language. Data
exchange between the threads is performed via the upc_memput, upc_memget and upc_memcpy
functions. The first function copies local data to the shared space, and the second copies data from
the shared to the local memory. The last function performs data copy from shared to shared address
space. Note that the first and second functions are similar to the MPI_Send and MPI_Recv functions.
The shared memory based data needs to be declared according to the UPC standards. In this
case, the programmer must declare a compile time constant block size, which defines how many
elements of a vector belongs to each thread. The data is laid down between the threads with a
round-robin fashion using the corresponding block size. We can see that the compile time constant
restriction is the biggest drawback of the investigated language. If the programmer wishes to lay
down the whole mesh, for example here in the shared memory, then the mesh size needs to be
known in advance of the compilation. In other words, different executable files are needed for
different mesh sizes.
To perform shared memory based operations, UPC offers the usage of upc_forall, which is an
extension of the standard C for loop. Each shared variable within a vector has an affinity term that
describes which thread the given element belongs to. Based on this information, the upc_forall
distributes the computational load between the threads. UPC also offers the usage of barriers, locks,
shared pointers, collectives etc. For further description we refer to Chauwvin et al. (2007).
In our case, two UPC codes were implemented: (a) one with shared memory implementation,
and (b) another code relying on the local memory. The streaming step of these codes are given
as an example in Appendix A. In the former code, the data is laid down in the shared memory,
and in the latter one, the data is stored in the local memory. The shared memory based code
exploits the novelty of UPC, i.e. this code relies on the upc_forall function and shared pointer
declarations. This leads to a more easily readable code. The second approach, which exploits data
locality and follows the logic of MPI implementations, offers better speed and lower latency time.
As a disadvantage, the upc_memput, upc_memget memory operations were required; therefore this
code is more complex and consists of more lines. The computational load was distributed equally
between the threads in both implementations, since the mesh size was divisible by the number of
threads during the simulations.
During the compilation of the UPC codes, similarly to the serial code, the performance affecting
flags were avoided. The compile command of the UPC codes, for example using four threads, was
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:8 M. Szőke et al.
upcc -T=4 *.c -lm -o LBMSolver. The execution simplified to upcrun LBMSolver on both HPC
systems.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:9
After the boundary treatment was identified as the bottleneck of the computations (see
Section 4, Fig. 6), the global search, which was performed based on a boolean mask at every
time step to find the boundary lattices, was replaced. In the last version the boundary lattices
were selected during the initialisation so that the boundary treatment kernel function “knew”
the location of the boundaries in advance.
In every case, one-dimensional grids and blocks were used; furthermore 256 threads were
initialised within every block (block size). The number of the blocks (grid size) varied automatically
as a function of the mesh size. This set-up proved to be the most efficient computationally, although
it resulted in a strong limitation in terms of the maximum mesh size. The theoretical maximum
thread number of the devices is 65535×1024 (maximum grid size times maximum block size). In
the first two implementations the threads were assigned to every distribution function in every
lattice which led to a maximum cell number 65535×256/9≈1864207. This limitation was overcome
in the third CUDA parallelisation step by assigning the threads to the cells. This way the maximum
number of lattices was nine times higher.
The last version of the CUDA code was compiled with the nvcc -arch=sm_20 -rdc=true
command. Here the first flag defines the virtual architecture of the device, while the second one
allows the user to compile the files separately and link them at the end. (The first and the second
version did not need the -rdc=true flag, since the kernel functions were in one file.) The authors
note that compiling the code with a more recent virtual architecture for the K20 GPU, for instance
-arch=sm_35, would probably result in an enhanced parallel performance. The -arch=sm_20 flag
was used because this was the most recent virtual architecture supported by both of the tested
GPUs. The detailed analysis of the code performance can be found in Section 4.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:10 M. Szőke et al.
Grid
Collision model Precision Coarse Medium Fine Ultra fine
BGKW Single 1.204 4.974 45.01 177.6
BGKW Double 1.264 9.664 59.77 240.5
MRT Single 1.696 7.092 51.56 204.1
MRT Double 1.956 11.946 65.75 263.1
code was slower than the serial. By crossing nodes, the speedup continued to increase indicating
appropriate communication between the nodes.
64 128
Astral, BGKW Astral, BGKW
Astral, MRT Astral, MRT
Archer, BGKW Archer, BGKW
Archer, MRT Archer, MRT
Linear Speedup Linear Speedup
Speedup [−]
Speedup [−]
32 64
16 32
8 16
4 8
4
48 16 32 64 128 48 16 32 64 128
Threads [−] Threads [−]
Fig. 3. speedup of the main loop as a function of the parallelisation approach, fine mesh, double precision
arithmetic
Fig. 3(a) indicates that none of the simulations exploit the maximum performance of the su-
percomputers. Despite the hardware support of the shared memory operations on Archer, the
simulations did not show better speedup results compared to Astral. We hypothesised that the
compilers could not handle the shared pointers and manage the data between the shared and local
memory spaces effectively. As a first step, performance analysis was conducted on Archer using
CrayPat (Cray 2015), which showed that the data was managed and “fed” to the CPUs properly.
As a next step, we tried to find other reasons for the problem. We reckon that the poor performance
might have been caused by one of the following factors, and so we took measures to overcome
them:
a usage of shared pointers. All of them were tested with static variables;
b inappropriate time measurement. We tested different approaches such as the clock(), and
the MPI_Wtime() commands;
c inappropriate usage of the upc_forall command. Different methods were examined to
distribute the computational load, for instance working threads were defined based on
affinity of shared variables (&Cells[i]) or modular division of integers (i % THREADS);
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:11
d usage of Cells structure. The structure was eliminated (see the code samples in Appendix A);
e lack of optimisation flags. All of the available optimisation flags (-O1, -O2, -O3) were tested.
None of these modifications resulted in better speedup in the case of shared variables, i.e. the
experienced performance of the shared code was the same for each listed factors. Therefore, we
concluded that the compiler cannot handle the data properly without additional modifications,
and for this particular application the compiler is still not mature enough. As it was presented by
Zhang et al. (2011), the problem could be resolved by outer libraries and user implemented machine
level data management. Adding these low level modifications to the shared code would eliminate
the main advantage of UPC, namely the quick and user-friendly parallelisation environment. To
overcome this problem the data was rather transferred to the local memory and another, MPI-like
code was developed.
128 128
Astral, BGKW Astral, Single Prec.
Astral, MRT Astral, Double Prec.
Archer, BGKW Archer, Single Prec.
Archer, MRT Archer, Double Prec.
Linear Speedup Linear Speedup
Speedup [−]
Speedup [−]
64 64
32 32
16 16
8 8
4 4
48 16 32 64 128 48 16 32 64 128
Threads [−] Threads [−]
(a) Local memory based speedup, single precision arith- (b) Comparison of double and single precision speedup,
metic MRT collision model
Fig. 3(b) shows the local memory based speedup using the fine mesh and double precision. This
approach gave significantly better results. Here we can see that crossing a node did not introduce
significant latency either, the compilers were capable of managing the halo swap between the
nodes.
Fig. 4(a) shows the speedup as a function of the collision model. The two models had similar
parallel efficiency. We can see that the BGKW collision model (solid lines) enabled slightly better
speedup results than the MRT collision model. Fig. 4(b) gives us a basis for an explicit comparison
between the performance of the single and double precision executions. This graph shows the
speedup achieved on the fine mesh with the MRT collision model. We can see that the single
precision results (continuous lines) are better above 16 threads than the double precision ones
(dashed lines). Between 16 and 32 threads the first node was crossed on both architectures. The
difference between the single and double precision curves above 16 threads are originated from the
communication costs. The double precision approach requested more data handling resulting in
lower speedup.
We plotted the effect of mesh size on the speedup in Fig. 5(a) and 5(b) for Astral and Archer,
respectively. If we consider more than 32 threads, then we may conclude that better speedup were
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:12 M. Szőke et al.
achieved with increasing mesh size. Unexpectedly, this finding was not valid for the ultra fine
mesh, where performance degradation was experienced on both architectures. To find the “leakage”
in the performance we measured the time spent with the data transfer on the fine and the ultra
fine meshes using 128 threads. With this set up, the halo included twice as much data on the ultra
fine mesh than on the fine mesh, so that the data transfer should take roughly twice as long as
well. In contrast, the data transfer took six times longer on the ultra fine mesh compared to the
fine. The performance degradation experienced on the ultra fine mesh was caused by the increased
communication costs, which seems to be a relatively strong limitation. We note that the ultra fine
mesh consists of approximately one million cells, so the computational load of the processors is
still reasonably low when 128 threads are allocated.
128 128
Coarse Coarse
Medium Medium
Fine Fine
Ultra Fine Ultra Fine
Speedup [−]
Speedup [−]
64 64
32 32
16 16
48 16 32 64 128 48 16 32 64 128
Threads [−] Threads [−]
Fig. 5. The effect of mesh size on the main loop speedup, double precision number representation
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:13
Collision Collision
Streaming Streaming
Boundary treatment Boundary treatment
Update macro. var. Update macro. var.
(c) UPC–BGKW (d) CUDA 1–BGKW (e) CUDA 2–BGKW (f) CUDA 3–BGKW
(g) UPC–MRT (h) CUDA 1–MRT (i) CUDA 2–MRT (j) CUDA 3–MRT
Fig. 6. Profiling results based on single precision simulations with fine mesh on the GTX 550Ti graphical card,
and on Archer using 64 threads
happened within the blocks, so that it could be done more efficiently. While the first and the second
CUDA development steps were more favourable for the streaming, the third step was specifically
developed to decrease the cost of the collision.
The scalability of the codes as a function of the hardware is displayed in the bar charts in Fig. 7.
As we can see in Figs. 7(a) and 7(b), the speedup of the BGKW and the MRT models were bounded
around 50 on Archer, and around 80 on the Tesla K20 card. In case of the K20 card, the performance
gap between double and single precision execution is clearly visible: while a maximum speedup of
around 80 was measured with single precision arithmetic (Fig. 7(a)), a speedup around 65 could
be achieved with double precision arithmetic in the case of the BGKW model (Fig. 7(c)). A similar
trend can be seen in the case of the MRT model as well, with a slightly wider gap between the single
precision and double precision arithmetic (Figs. 7(b) and 7(d)). Considering that the double precision
processing power of the K20 unit is approximately a third of its single precision processing power
(Table 3), it might seem surprising that the performance gap is only around 20%. If we also consider
that two of the main steps in the LBM, (streaming and boundary treatment) are essentially data
copying, then the relatively small gap makes more sense: the high memory bandwidth of the GPU
compensated for the lack of computing power.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:14 M. Szőke et al.
80 GTX550Ti 80 GTX550Ti
Tesla K20 Tesla K20
70 70
Archer 64 threads Archer 64 threads
60 60
Speedup
Speedup
50 50
40 40
30 30
20 20
10 10
Coarse Medium Fine Ultra fine Coarse Medium Fine Ultra fine
Grid Grid
(a) Single precision, BGKW (b) Single precision, MRT
80 GTX550Ti 80 GTX550Ti
Tesla K20 Tesla K20
70 70
Archer 64 threads Archer 64 threads
60 60
Speedup
Speedup
50 50
40 40
30 30
20 20
10 10
Coarse Medium Fine Ultra fine Coarse Medium Fine Ultra fine
Grid Grid
(c) Double precision, BGKW (d) Double precision, MRT
Fig. 7. Speedup of the main loop as a function of the grid spacing and the hardware
Interestingly, the GTX550Ti device showed better parallel performance with the MRT model
compared to the BGKW model (Fig. 7(a) and 7(b)). While this card gave higher speedup with DP
in the case of the BGKW model, using DP led to a drastic performance drop with the MRT model
(compare Fig. 7(a) with 7(c) and Fig. 7(b) with 7(d)).
Based on the charts, the K20 card performed in almost every case better than the GTX550Ti. In
fact, the K20 card was slightly slower than the GTX550Ti only when the grid size was small. In our
case, the medium grid proved to be big enough to utilise the better potentials of the K20 GPU. As
the grid size increased we could measure an increasing speedup in the case of the K20 device, while
the GTX550Ti card reached its limits at the fine mesh. These results mirror the GPUs’ evolution,
and correlate well with the hardware parameters (e. g. CUDA cores) given in Table 3.
Ideally, in the case of the CPU parallelisation, when the number of threads is kept constant and
the grid size changes, a nearly constant speedup can be expected. After looking at Fig. 7, it becomes
clear how far away this application is from an ideal situation: increasing the number of lattices up
to a certain point (fine mesh), resulted in an increasing speedup. This happened because as the
problem size increased, the time spent with the halo swap decreased relative to the time spent with
the computation of the different operations. The speedup on Archer with 64 threads was close to
the ideal when single precision arithmetic was used on the fine mesh (Figs. 7(a) and 7(b)). However,
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:15
250 250
Speedup
Speedup
200 200
150 150
100 100
50 50
250 250
Speedup
Speedup
200 200
150 150
100 100
50 50
Fig. 8. Speedup of the different operations on the fine grid as a function of the hardware. (Coll–Collision;
Str–Streaming; BT–Boundary treatment; UM–Update macroscopic)
using double precision arithmetic means an increased computational load for each threads, it
also means increased communication between the threads. Probably this is the reason why the
parallel performance of the double precision execution was lower on Archer compared to the single
precision simulations when the fine mesh was investigated. (Figs. 7(c) and 7(d)).
In order to gain a better understanding of the results, deeper analyses of the code is required.
The speedup of the main parts of the code are shown in Fig. 8. It is important to recognize that,
theoretically, only the speedup of the collision step should change when we consider different
models. Indeed, the other operations show only a small deviation (compare Figs. 8(a), 8(c) with 8(b),
8(d)). After a first look, we can see that the boundary treatment, which was identified as the bottle
neck of the performance (Fig. 6), was significantly improved in the final step of the CUDA code.
This high speedup was measured, because the global search of the boundaries was replaced (see
Section 3.2). Furthermore, it is also visible that the other parts of the code had a relatively uniform
speedup in the case of the CUDA implementation: for the K20 card around 50 with single precision
and 40 with double precision. When compared with the K20, the GTX550Ti device showed better
speedup of the boundary treatment but a worse speedup of the other operations. The unexpected
behaviour of the GTX550Ti card when compared to K20 can be caused by its structure which was
designed for gaming, or the different CUDA release (see Table 3).
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:16 M. Szőke et al.
The speedup of the operations measured on Archer shows less deviation. However, it was found
that the measured speedup exceeded the theoretical limit (64) several times, especially for the
collision and the update macroscopic parts, while other times the parallel code underperformed. This
behaviour seems to be logical if we take into account that the collision and the update macroscopic
operations do not require any communication between the processes. Furthermore, the data of the
partitioned mesh fit the cache of the nodes, while the same data exceeded the cache size of a single
node in the case of the serial execution.
Although the straightforward shared memory approach of UPC proved to be inefficient, the
classical MPI-like local parallelisation technique gave acceptable results. Thanks to its simple syntax,
it needed less programming effort compared to CUDA. The corresponding number of lines for the
different codes are listed in Table 5. We can see that the most efficient CUDA implementation took
≈40% more lines, while the longest UPC implementation needed only ≈15% more lines compared
to the serial code. It is arguable whether counting the number of lines is representative of the
programming effort but in this case it is well correlated to the required work. The efficient paralleli-
sation with CUDA required roughly twice as much office hours than the two UPC implementations.
Without a doubt, we can conclude that the highest amount of conceptual effort was required by
the CUDA approach followed by the local UPC code and the shared UPC code.
It is important to see what kind of compromises application oriented developers need to make.
The situation can be described with the help of the triangle shown in Fig. 9. The edges of the triangle
contain the good properties of a high performance programming approach: low conceptual effort,
high performance and low hardware costs. For the lattice Boltzmann method, the first possible
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:17
scenario includes a low cost hardware (Tesla K20 of $2300) and a reasonably high performance
but we have to pay the price of conceptual effort because of the GPU’s programming environment.
Another extreme scenario is when a higher budget is available (Intel E5 processors, each for $2900),
and we can work with a more flexible programming environment, and probably end up with
an efficient code in a shorter period of time. This situation makes the local memory based UPC
programming environment a suitable candidate. Additionally to these two cases, when the available
budget is limited, one can consider running single core simulations on a cheap hardware. This would
clearly require longer computations because of the low performance. The choice between the three
scenarios is still usually made by the time frame, the available hardware, and skill set. However, the
shared memory approach of UPC aims to provide another low effort, high performance scenario,
our investigations highlighted that the compilers need further development to achieve this goal.
5 CONCLUSIONS
We parallelised an in-house, two-dimensional lattice Boltzmann solver using CPU and GPU paral-
lelisation approaches. We presented the UPC implementation of the lattice Boltzmann method and
compared the parallel efficiency of our CUDA and UPC codes using the two-dimensional physical
problem of the lid-driven cavity. The UPC codes were tested with two different compilers on two
different clusters, while the CUDA codes were run on two different GPUs. A detailed performance
analysis of the different implementations was performed to provide an insight into the parallel
capabilities of UPC and CUDA when it comes to the lattice Boltzmann method.
The parallelisation of the collision proved to be crucial since this is the part in the algorithm
where the majority of the computation happens. Based on our experience the efficiency of this part
determines the globally experienced efficiency, and it typically means favourable implementation
for the update macroscopic operation as well. We would like to draw attention to the boundary
treatment as well, since it can easily become the bottle neck of the parallel code, although its
execution time is essentially limited by the memory bandwidth similarly to the streaming.
The UPC code using the shared memory approach showed surprisingly low performance com-
pared to the serial code. We found that the investigated compilers could not automatically manage
the data transfer between the threads efficiently. The further development of the compilers may
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:18 M. Szőke et al.
solve this issue and make the UPC approach more user-friendly and attractive for future scientific
programmers. Until then, we can enjoy the simple syntax of UPC for local memory based imple-
mentation, which was proven to be more efficient and suitable for the parallelisation of the lattice
Boltzmann method.
The CUDA development was presented through three different steps which highlighted that the
used data structures (namely the AoS and the SoA approaches), and the data distribution strategies
have a significant effect on the parallel performance. We can confirm that the nVidia graphics cards,
especially the ones designed for scientific computing, are highly suitable for the parallelisation of
the lattice Boltzmann method. Based on our measurements a single GPU might compete with 3-4
supercomputer nodes (around 80 threads) or more, for a significantly lower price. To reach this
high performance developers need more specific skills and programming effort when it is compared
to the local UPC implementation.
CODE AVAILABILITY
The developed codes are available as open source and can be downloaded from GitHub at https:
//github.com/mate-szoke/ParallelLbmCranfield. The codes are available under the MIT license. The
folders also include all mesh and setup files used to perform the documented simulations.
ACKNOWLEDGMENTS
This work used the ARCHER UK National Supercomputing Service (http://www.archer.ac.uk). The
authors would like to thank the Edinburgh Parallel Computing Centre for providing access to the
INDY cluster. We are also grateful for the ASTRAL support given by the Cranfield University IT
team. Further thanks go to Tom-Robin Teschner and Anton Shterenlikht for useful discussions,
Kirsty Jean Grant for the proofreading, and Gennaro Abbruzzese for the original C++ version of
the code and the mesh generator which was used for our simulations.
A CODE SECTIONS
The following code sections cover the streaming:
• Serial code
1 for ( i =0; i <(* m ) *(* n ) ; i ++) { // sweep through the domain
2 if ( ( Cells + i ) -> Fluid == 1 ) { // if the lattice is in the fluid domain
3 for ( k =0; k <9; k ++) { // sweep along the nine discrete directions
4 // if streaming is allowed in the current direction
5 if ( (( Cells + i ) -> StreamLattice [ k ]) == 1) {
6 // the current distr . fct . travels to the corresponding neighbour
7 ( Cells + i ) ->F [ k ] = ( Cells + i + c [ k ] ) -> METAF [ k ];
8 } } } }
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:19
1 upc_forall ( i =0; i <((* nx ) *(* ny ) ) ; i ++; & Fluid [ i ]) { // sweep in the whole
domain
2 if ( Fluid [ i ] == 1) { // if the lattice is in the fluid domain
3 for ( k =0; k <9; k ++) { // sweep along the nine discrete directions
4 // if streaming is allowed in the current direction
5 if ( StreamLattice [ i ][ k ] == 1 ) {
6 // the current distr . fct . travels to the corresponding neighbour
7 F [ i ][ k ] = METAF [ i + c [ k ]][ k ];
8 } } } }
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:20 M. Szőke et al.
B QUALITATIVE VALIDATIONS
Qualitative validation of the computed velocity field at Re = 1000.
REFERENCES
G. Amati, S. Succi, and R. Piva. 1997. Massively parallel lattice-Boltzmann simulation of turbulent channel flow. International
Journal of Modern Physics C 8, 4 (1997), 869–877.
J. A. Anderson, C. D. Lorenz, and A. Travesset. 2008. General purpose molecular dynamics simulations fully implemented
on graphics processing units. Journal of Computational Physics 227, 10 (2008), 5342–5359.
P. L. Bhatnagar, E. P. Gross, and M. Krook. 1954. A model for collision processes in gases I: Small amplitude processes in
charged and neutral one-component systems. Physical Review 94, 3 (1954), 511.
F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. 2004. Productivity analysis of the UPC language. In Proceedings of the
18th International Parallel and Distributed Processing Symposium. IEEE, 254.
B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International
Journal of High Performance Computing Applications 21, 3 (2007), 291–312.
S. Chauwvin, P. Saha, F. Cantonnet, S. Annareddy, and T. El-Ghazawi. 2007. UPC Manual. The George Washington University,
Washington, DC. Version 1.2.
N. Chentanez and M. Müller. 2011. Real-time Eulerian water simulation using a restricted tall cell grid. 30, 4 (2011), 82.
A. J. Chorin. 1967. A numerical method for solving incompressible viscous flow problems. Journal of Computational Physics
2, 1 (1967), 12–26.
Inc. Cray. 2015. Performance Measurement and Analysis Tools (s-2376-63 ed.). Cray.
Cray Inc. 2012. Cray standard C and C++ reference manual. (2012).
D. d’Humières. 1992. Generalized lattice-Boltzmann equations. In Rarefied gas dynamics: Theory and simulations
(ed. B. D. Shizgal & D. P. Weaver) (1992).
D. d’Humières, I. Ginzburg, M. Krafczyk, P. Lallemand, and L. S. Luo. 2002. Multiple–relaxation–time lattice Boltzmann
models in three dimensions. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical
and Engineering Sciences 360, 1792 (2002), 437–451.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
2D lattice Boltzmann solver using CUDA and PGAS UPC 39:21
0.8
0.6
y/H
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
x/H
(a) Streamlines, (Ghia et al. 1982) (b) Streamlines and the velocity contour, serial MRT sim-
ulation on medium mesh, results coloured by u/u lid
Kemal Ebcioglu, Vijay Saraswat, and Vivek Sarkar. 2004. X10: Programming for Hierarchical Parallelism and Non-Uniform
Data Access. In International Workshop on Language Runtimes, OOPSLA 2004.
T. A. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, and A. S. Mohamed. 2006. Benchmarking parallel compilers: A UPC
case study. Future Generation Computer Systems 22, 7 (2006), 764 – 775.
U. Ghia, K. N. Ghia, and C. T. Shin. 1982. High-Re solutions for incompressible flow using the Navier-Stokes equations and
a multigrid method. Journal of Computational Physics 48, 3 (1982), 387–411.
X. He and L.-S. Luo. 1997. Lattice Boltzmann model for the incompressible Navier-Stokes equation. Journal of Statistical
Physics 88, 3-4 (1997), 927–944.
P. Husbands, C. Iancu, and K. Yelick. 2003. A performance analysis of the Berkeley UPC compiler. In Proceedings of the 17th
Annual International Conference on Supercomputing. ACM, 63–73.
Intel. 2015. Automated Relational Knowledgebase (ARK). (2015). http://ark.intel.com/ Accessed 15/02/2015.
A. Johnson. 2005. Unified Parallel C within computational fluid dynamics applications on the Cray X1. In Proceedings of the
Cray User’s Group Conference. Albuquerque. 1–9.
I. T. Józsa, M. Sző, T.-R. Teschner, L. Könözsy, and I. Moulitsas. 2016. Validation and Verification of a 2D lattice Boltzmann
solver for incompressible fluid flow. ECCOMAS Congress 2016 - Proceedings of the 7th European Congress on Computational
Methods in Applied Sciences and Engineering 1 (2016), 1046–1060.
D. Kandhai, A. Koponen, A.G. Hoekstra, M. Kataja, J. Timonen, and P.M.A. Sloot. 1998. Lattice-Boltzmann hydrodynamics
on parallel systems. Computer Physics Communications 111, 1–3 (1998), 14 – 26.
D. A. Mallón, A. Gómez, J. C. Mouriño, G. L. Taboada, C. Teijeiro, J. Touriño, B. B. Fraguela, R. Doallo, and B. Wibecan. 2009.
UPC performance evaluation on a multicore system. In Proceedings of the 3rd Conference on Partitioned Global Address
Space Programing Models. ACM, 9.
S. A. Manavski and G. Valle. 2008. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman
sequence alignment. BMC Bioinformatics 9, Suppl 2 (2008), S10.
S. Markidis and G. Lapenta. 2010. Development and performance analysis of a UPC Particle-in-Cell code. In Proceedings of
the 4th Conference on Partitioned Global Address Space Programming Model. ACM, 10.
J. C. Maxwell. 1860. Illustrations of the dynamical theory of gases. Philosophical Magazine Series 4 20, 130 (1860), 21–37.
C. McClanahan. 2010. History and evolution of GPU architecture. In A Paper Survey.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.
39:22 M. Szőke et al.
Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard. (September 2012).
A. A. Mohamed. 2011. Lattice Boltzmann method: Fundamentals and engineering applications with computer codes. Springer,
London.
J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40–53.
Robert W. Numrich and John Reid. 1998. Co-array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17, 2 (1998),
1–31.
PGAS. 2015. Partitioned Global Adress Space Consortium. (2015). http://www.pgas.org/ Accessed 15/02/2015.
B. Ren, C. Li, X. Yan, M. C. Lin, J. Bonet, and S.-M. Hu. 2014. Multiple-fluid SPH simulation using a mixture model. ACM
Transactions on Graphics 33, 5 (2014), 171.
P. R. Rinaldi, E. A. Dari, M. J. Vénere, and A. Clausse. 2012. A lattice-Boltzmann solver for 3D fluid simulation on GPU.
Simulation Modelling Practice and Theory 25 (2012), 163–171.
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-M. W. Hwu. 2008. Optimization principles and
application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN
Symposium on Principles and practice of parallel programming. ACM, 73–82.
J. Sanders and E. Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley
Professional.
S. S. Stone, J. P. Haldar, S. C. Tsao, W.-M. Hwu, B. P. Sutton, Z.-P. Liang, and others. 2008. Accelerating advanced MRI
reconstructions on GPUs. Journal of Parallel and Distributed Computing 68, 10 (2008), 1307–1318.
S. Succi. 2001. The lattice Boltzmann equation for fluid dynamics and beyond. Oxford.
G. L. Taboada, C. Teijeiro, J. Tourio, B. B. Fraguela, R. Doallo, J. C. Mourino, and D. A. Mallon. 2009. Performance evaluation
of unified parallel C collective communications. In 11th IEEE International Conference on High Performance Computing
and Communications. IEEE, 69–78.
J. Tölke. 2010. Implementation of a lattice Boltzmann kernel using the Compute Unified Device Architecture developed by
nVIDIA. Computing and Visualization in Science 13, 1 (2010), 29–39.
P. Valero-Lara and J. Jansson. 2015. LBM-HPC-An Open-source tool for fluid simulations. Case study: Unified Parallel C
(UPC-PGAS). In IEEE International Conference on Cluster Computing. IEEE, 318–321.
P. Welander. 1954. On the temperature jump in a rarefied gas. Arkiv Fysik 7 (1954).
W. Xian and A. Takayuki. 2011. Multi-GPU performance of incompressible flow computation by lattice Boltzmann method
on GPU cluster. Parallel Computing 37, 9 (2011), 521–535.
Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan
Graham, David Gay, Phil Colella, and Alex Aiken. 1998. Titanium: a high-performance Java dialect. Concurrency: Practice
and Experience 10, 11-13 (1998), 825–836.
J. Zhang, B. Behzad, and M. Snir. 2011. Optimizing the Barnes-Hut Algorithm in UPC. In Proceedings of 2011 International
Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 75:1–75:11.
Y. Zhang, Jo. Cohen, and J. D. Owens. 2010. Fast tridiagonal solvers on the GPU. ACM Sigplan Notices 45, 5 (2010), 127–136.
Q. Zou and X. He. 1997. On pressure and velocity boundary conditions for the lattice Boltzmann BGK model. Physics of
Fluids 9, 6 (1997), 1591–1598.
ACM Transactions on Mathematical Software, Vol. 99, No. 44, Article 39. Publication date: May 2017.