Witherden 2015 - On the Development and Implementation of High-Order Flux Reconstruction Schemes for Computational Fluid Dynamics
Witherden 2015 - On the Development and Implementation of High-Order Flux Reconstruction Schemes for Computational Fluid Dynamics
by
September 2015
Abstract
High-order numerical methods for unstructured grids combine the superior accuracy
of high-order spectral or finite difference methods with the geometric flexibility of
low-order finite volume or finite element schemes. The Flux Reconstruction (FR)
approach unifies various high-order schemes for unstructured grids within a single
framework. Additionally, the FR approach exhibits a significant degree of element
locality, and is thus able to run efficiently on modern streaming architectures, such
as graphics processing units (GPUs). The aforementioned properties of FR mean it
offers a promising route to performing affordable, and hence industrially relevant, scale-
resolving simulations of hitherto intractable unsteady flows within the vicinity of real-
world engineering geometries. In this thesis a formulation of the FR approach that is
suitable for solving non-linear advection-diffusion type problems on mixed curvilinear
grids is developed. Issues around aliasing are explored in detail and techniques for
mitigation outlined. A methodology for identifying symmetric quadrature rules inside
of a variety of domains is also presented and used to find several rules that appear to
be an improvement over those in literature. This methodology is also used to obtain
improved sets of solution points inside of triangular elements. PyFR, an open-source
Python based framework for solving the compressible Navier–Stokes equations using
the FR approach, is also developed. It is designed to target a range of hardware platforms
via use of an in-built domain specific language based on the Mako templating engine.
PyFR is able to operate on mixed grids in both two and three dimensions and can
target NVIDIA GPUs, AMD GPUs, and Intel CPUs. Results are presented for various
benchmark flow problems, single-node performance is discussed, heterogeneous multi-
node capabilities are analysed, and scalability is demonstrated on up to 2 000 NVIDIA
K20X GPUs for a sustained performance of 1.3 PFLOP/s.
2
Acknowledgements
I would like to begin by thanking my supervisor Dr Peter Vincent for giving me the
opportunity to undertake a PhD in his research group. Despite his busy schedule Peter
has always been available to discuss technical matters and to support me wherever
possible. He also gave me the freedom to pursue my own research interests and for
this I am extremely grateful. I would also like to thank my co-supervisor Prof. Spencer
Sherwin and Prof. Paul Kelly from the Department of Computing. I also have greatly
enjoyed the company of my colleagues in E256: Jovan, Charles, Jeremy, Abeed, Carla,
Leon, Edward, Paola, Ilan, Giorgio, Jingxuan, and Xingsi. They have been a wonderful
distraction from work and I will always remember the good times we had together.
Finally, I would like to thank my parents for their encouragement and support; this
thesis is dedicated to them.
3
Declaration of Originality
The work hereby presented is based on research carried out by the author at the
Department of Aeronautics of Imperial College London, and it is all the author’s own
work except where otherwise acknowledged. No part of the present work has been
submitted elsewhere for another degree or qualification.
4
Copyright Declaration
The copyright of this thesis rests with the author and is made available under a Creative
Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to
copy, distribute or transmit the thesis on the condition that they attribute it, that they do
not use it for commercial purposes and that they do not alter, transform or build upon
it. For any reuse or redistribution, researchers must make clear to others the licence
terms of this work
5
Contents
List of Figures 8
List of Tables 11
List of Algorithms 13
List of Publications 14
Nomenclature 17
1 Introduction 19
2 Flux Reconstruction 25
2.1 Formulation for Mixed Curvilinear Grids . . . . . . . . . . . . . . . 25
2.2 Time Stepping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Correction Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Governing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Quadrature Rules 47
3.1 Basis Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Symmetry Orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Reference Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6
Contents 7
5 Implementation 77
5.1 Definition of Operator Matrices . . . . . . . . . . . . . . . . . . . . . 77
5.2 Specification of State Matrices . . . . . . . . . . . . . . . . . . . . . 78
5.3 Matrix Multiplication Kernels . . . . . . . . . . . . . . . . . . . . . 78
5.4 Pointwise Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Distributed Memory Parallelism . . . . . . . . . . . . . . . . . . . . 83
6 Validation 85
6.1 Euler Vortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Couette Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Cylinder Flow at Re = 3 900 . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Single-Node Performance . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Multi-Node Heterogeneous Performance . . . . . . . . . . . . . . . . 109
6.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7 Conclusion 116
Bibliography 122
Colophon 131
List of Figures
1.1 Trends in the peak floating point performance and memory bandwidth
of Intel processors from 1994–2014. Data courtesy of Jan Treibig. . . 22
2.1 Solution points and flux points for a triangle and quadrangle in physical
space. For the top edge of the quadrangle normal vectors have been
plotted. Observe how the flux points at the interface between the two
elements are co-located. . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Three four-point collocation projections of f (x) ∈ P 6 . . . . . . . . . . 35
2.3 Packing methodologies for Nv = 2 and |Ωe | = 9. . . . . . . . . . . . . 40
8
List of Figures 9
6.1 L2 energy error and orders of accuracy for the Couette flow problem on
four mixed meshes. The mesh spacing was approximated as h ∼ NE−1/2
where NE is the total number of elements in the mesh. . . . . . . . . . 90
6.2 L2 energy errors and orders of accuracy for the Couette flow problem
on three extruded hexahedral meshes. On account of the extrusion
h ∼ NE−1/2 where NE is the total number of elements in the mesh. . . . 91
6.3 L2 energy errors and orders of accuracy for the Couette flow problem
on three unstructured hexahedral meshes. Mesh spacing was taken as
h ∼ NE−1/3 where NE is the total number of elements in the mesh. . . . 92
6.4 Approximate memory requirements of PyFR for the two cylinder meshes. 95
6.5 Comparison of quantitative values with experimental and DNS results. 100
11
12 List of Tables
6.6 Baseline attributes of the three hardware platforms. For the NVIDIA
Tesla K40c GPU Boost was left disabled and ECC was enabled. The
Intel Xeon E5-2697 v2 was paired with four DDR3-1600 DIMMs with
Turbo Boost enabled. . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.7 Time to evaluate ∇ · f normalised by the total number of DOFs. . . . . 110
6.8 Partition weights for the multi-node heterogeneous simulation. . . . . 112
6.9 Weak scalability of PyFR at ℘ = 4. . . . . . . . . . . . . . . . . . . . 114
6.10 Strong scalability of PyFR at ℘ = 4. . . . . . . . . . . . . . . . . . . 115
List of Algorithms
13
List of Publications
Parts of the work presented in this thesis have been disseminated through a number of
written publications and oral communications; these are listed below, as of September
2015.
Journal Papers
1. FD Witherden and PE Vincent. An Analysis of Solution Point Coordinates for
Flux Reconstruction Schemes on Triangular Elements. Journal of Scientific
Computing, 61(2):398–423, 2014.
2. FD Witherden, AM Farrington, and PE Vincent. PyFR: An Open Source Frame-
work for Solving Advection-Diffusion Type Problems on Streaming Architec-
tures Using the Flux Reconstruction Approach. Computer Physics Communica-
tions, 185(11):3028–3040, 2014.
3. FD Witherden and PE Vincent. On the Identification of Symmetric Quadrature
Rules for Finite Element Methods. Computers & Mathematics with Applications,
69(10):1232–1241, 2015.
4. FD Witherden, BC Vermeire, and PE Vincent. Heterogeneous computing on
mixed unstructured grids with PyFR. Computers & Fluids, 120:173–186, 2015.
5. PE Vincent, AM Farrington, FD Witherden, and A Jameson. An extended range
of stable-symmetric-conservative flux reconstruction correction functions. Com-
puter Methods in Applied Mechanics and Engineering, 296:248–272, 2015.
Conference Papers
1. G Mengaldo, D De Grazia, FD Witherden, AM Farrington, PE Vincent, SJ
Sherwin, and J Peiro. A Guide to the Implementation of Boundary Conditions in
14
15
Book Chapters
1. J Enkovaara, M Klemm, and FD Witherden. High Performance Python Offload-
ing. High Performance Parallelism Pearls Volume 2 pp. 246–269, edited by J
Jeffers and J Reinders. Morgan Kaufmann, 2015.
Oral Presentations
1. FD Witherden, AM Farrington, and PE Vincent. PyFR: An Open Source Python
Framework for High-Order CFD on Many-Core Platforms. 4th International
Congress on Computational Engineering and Sciences, 19–24 May 2013, Las
Vegas, Nevada, USA.
2. FD Witherden and PE Vincent. PyFR: Technical Challenges of Bringing Next
Generation Computational Fluid Dynamics to GPU Platforms. NVIDIA GPU
Technology Conference, 24–27 March 2014, San Jose, California, USA.
3. FD Witherden, BC Vermeire, and PE Vincent. Heterogeneous Computing on
Mixed Unstructured Grids with PyFR. UK Many-Core Developer Conference
2014, 15 December 2014, Cambridge, UK.
16 List of Publications
Poster Presentations
1. FD Witherden, AM Farrington, and PE Vincent. PyFR: An Open Source Python
Framework for High-Order CFD on Many-Core Platforms. 4th International
Congress on Computational Engineering and Sciences, 19–24 May 2013, Las
Vegas, Nevada, USA.
2. FD Witherden, AM Farrington, and PE Vincent. PyFR: An Open Source Python
Framework for Solving Advection-Diffusion Type Problems on Streaming Ar-
chitectures. UK Manycore Developer Conference 2013, 16–17 December 2013,
Oxford, UK.
3. FD Witherden, BC Vermeire, and PE Vincent. PyFR: An Open Source Python
Framework for High-Order CFD on Heterogeneous Platforms. SC14, 16–21
November 2014, New Orleans, Louisiana, USA.
Nomenclature
Throughout this thesis a convention is adopted in which dummy indices on the right
hand side of an expression are summed. For example Ci jk = Ai jl Bilk ≡ l Ai jl Bilk
P
where the limits are implied from the surrounding context. Unless otherwise stated all
indices are assumed to be zero-based.
Functions. Expansions.
δi j Kronecker delta ℘ Polynomial order
det A Matrix determinant ND Number of spatial dimensions
dim A Matrix dimensions NV Number of field variables
deg p Polynomial degree P̂i Normalised Legendre polynomial i
(α,β)
P̂i Normalised Jacobi polynomial i
Indices. `eρ Nodal basis polynomial ρ for ele-
e Element type ment type e
n Element number ψeρ Orthonormal basis polynomial ρ
α Field variable number for element type e
i, j, k Summation indices x, y, z Physical coordinates
ρ, σ, ν Summation indices x̃, ỹ, z̃ Transformed coordinates
Men Transformed to physical mapping
Domains.
Ω Solution domain Adornments and suffixes.
Ωe All elements in Ω of type e ˜ A quantity in transformed space
Ω̂e A standard element of type e ˆ A vector quantity of unit
∂Ω̂e Boundary of Ω̂e magnitude
Ωen Element n of type e in Ω T Transpose
|Ωe | Number of elements of type e (u) A quantity at a solution point
(q) A quantity at a solution quadrature
point
17
18 Nomenclature
Introduction
19
20 Chapter 1 Introduction
FD FV ENO FE FR
Complex geometries ∅ F F F F
High-order accurate F ∅ F F F
Explicit semi-discrete form F F F ∅ F
Conservation laws F F F F
Elliptic problems F F
F = yes = yes, with modifications ∅ = no
FR approach. As such several, authors have adopted the moniker ‘corrections procedure
via reconstruction’ (CPR) as a means of referring to both FR and LCP. Furthermore,
Allaneau and Jameson [15] have showed that it is possible to cast some FR schemes as a
filtered nodal DG schemes. On account of the large degree of numerical interoperability
between these schemes they are herein all referred to as ‘FR type’ schemes.
A comparison of the various schemes can be seen in Table 1.1. Given that the focus
of this work is on solving the compressible Navier–Stokes equations—a conservation
law—in the vicinity of complex geometries and are interested in schemes that are
high-order accurate it can be seen from the table the most promising candidates are
ENO/WENO schemes and the FR approach.
Modern hardware. Over the past two decades improvements in the arithmetic
capabilities of processors have significantly outpaced advances in random access
memory. Algorithms which have traditionally been compute bound—such as dense
matrix-vector products—are now limited instead by the bandwidth to/from memory.
This is epitomised in Figure 1.1. Whereas the processors of two decades ago had
FLOP/s-per-byte of ∼0.2 more recent chips have ratios upwards of ∼4. This disparity is
not limited to just conventional CPUs. Massively parallel accelerators and co-processors
22 Chapter 1 Introduction
106
Measure
105
MFLOP/s, MiB/s
Peak FLOP/s
4
10 Peak bandwidth
103
102
Figure 1.1. Trends in the peak floating point performance and memory bandwidth of Intel
processors from 1994–2014. Data courtesy of Jan Treibig.
such as the NVIDIA K20X and Intel Xeon Phi 5110P have ratios of 5.24 and 3.16,
respectively.
A concomitant of this disparity is that modern hardware architectures are highly
dependent on a combination of high speed caches and /or shared memory to maintain
throughput. However, for an algorithm to utilise these efficiently its memory access
pattern must exhibit a degree of either spatial or temporal locality. To a first-order
approximation the spatial locality of a method is inversely proportional to the amount
of memory indirection. On an unstructured grid indirection arises whenever there is
coupling between elements. This is potentially a problem for discretisations whose
stencil is not compact. Coupling also arises in the context of implicit time stepping
schemes. Implementations are therefore very often bound by memory bandwidth. A
secondary trend is that the manner in which FLOP/s are realised has also changed.
In the early 1990s commodity CPUs were predominantly scalar with a single core of
execution. However, in 2015 processors with fourteen or more cores are not uncommon.
Moreover, the cores on modern processors almost always contain vector processing
units. Vector lengths up to 512-bits, which permit up to eight double precision values to
be operated on at once, are rapidly becoming commonplace. It is therefore imperative
that compute-bound algorithms are amenable to both multithreading and vectorisation.
23
Motivation. Heretofore the majority of the scholarly literature has been concerned
with the development of FR schemes that are linearly stable for advection and advection-
diffusion problems in a variety of domains. There has been comparatively less work on
the non-linear stability of FR schemes and their efficient implementation.
The objective of this work is therefore to help realise the promise of high-order
methods for unstructured grids within a real-world setting. A shortcoming of FR, as
oft presented in the literature, is that it is generally assumed that the elements are
all of the same type, are straight sided, and that the flux is linear. Furthermore, the
individual steps of the approach are given in an order which emphasises mathematical
clarity over computational efficiency. This can result in implementations which are
needlessly restricted in their functionality and perform sub-optimally. Moreover, the
majority of treatments forego any discussion of aliasing driven instabilities. However,
such instabilities have been found to be a major stumbling block that prevents FR
from being used effectively to model unsteady flow phenomena. Additionally, many of
the FR codes that have been presented in the literature lack su fficient verification and
validation; especially in three dimensions. It is not uncommon for the extension of a
piece of work into the third dimension or the efficient implementation thereof to be left
as an exercises for the reader.
All of these issues severely inhibit the adoption of these schemes by industry. The
24 Chapter 1 Introduction
resolution of these issues is hence the primary motivation for this thesis.
Flux Reconstruction
where q is an auxiliary variable. Here, as with ∇u, q has been taken in its unsubscripted
form to refer to the gradients of all of the field variables.
Take E to be the set of available element types in ND dimensions. Examples include
quadrilaterals and triangles in two dimensions and hexahedra, prisms, pyramids and
tetrahedra in three dimensions. Consider using these various elements types to construct
a conformal mesh of the domain such that
[ |Ω[
e |−1 \ |Ω\
e |−1
where Ωe refers to all of the elements of type e inside of the domain, |Ωe | is the number
of elements of this type in the decomposition, and n is an index running over these
25
26 Chapter 2 Flux Reconstruction
elements with 0 ≤ n < |Ωe |. Inside each element Ωen it is required that
∂uenα
+ ∇ · fenα = 0, (2.3a)
∂t
qenα − ∇uenα = 0. (2.3b)
∂Meni
Jen = Jeni j = , Jen = det Jen ,
∂ x̃ j
∂M−1 1
en = Jeni j = = det J−1
en =
eni
J−1 −1
, −1
Jen .
∂x j Jen
∂uenα
+ Jen ∇ · f̃enα = 0,
−1 ˜
(2.5a)
∂t
˜ enα = 0,
q̃enα − ∇u (2.5b)
2.1 Formulation for Mixed Curvilinear Grids 27
Figure 2.1. Solution points and flux points for a triangle and quadrangle in physical
space. For the top edge of the quadrangle normal vectors have been plotted.
Observe how the flux points at the interface between the two elements are co-
located.
as required. Observe here the decision to multiply the first equation through by a factor
−1 . Doing so has the effect of taking ũ 7→ u which allows us to work in terms
of Jen en en
of the physical solution. This is more convenient from a computational standpoint.
The next step in the procedure is to associate a set of solution points with each
standard element. For each type e ∈ E take {x̃(u) eρ } to be the chosen set of points where
(u) (u)
0 ≤ ρ < Ne (℘). These points can then be used to construct a nodal basis set {`eρ (x̃)}
(u) (u)
with the property that `eρ (x̃eσ ) = δρσ . To obtain such a set first take ψeσ (x̃) to be an
orthonormal basis which spans a selected order℘ polynomial space defined inside Ω̂e .
Next, compute the elements of the generalised Vandermonde matrix as Veρσ = ψeρ (x̃(u) eσ ).
(u)
With these a nodal basis set can be constructed as `eρ (x̃) = Veρσ −1 ψ (x̃). Along with
eσ
the solution points inside of each element a set of flux points on ∂Ω̂e are also defined.
(f) (f)
These are denoted for a particular element type as {x̃eρ } where 0 ≤ ρ < Ne (℘). Let the
(f)
set of corresponding normalised outward-pointing normal vectors be given by {ñˆ eρ }.
It is critical that each flux point pair along an interface share the same coordinates in
physical space. For a pair of flux points eρn and e0 ρ0 n0 at a non-periodic interface this
(f) (f)
can be formalised as Men (x̃eρ ) = Me0 n0 (x̃e0 ρ0 ). A pictorial illustration of this can be
seen in Figure 2.1.
The first step in the FR approach is to go from the discontinuous solution at the
28 Chapter 2 Flux Reconstruction
where u(u)
eρnα is an approximate solution of field variable α inside of the nth element of
type e at solution point x̃(u)
eρ . This can then be used to compute a common solution
where Cα (uL , uR ) is a scalar function that given two values at a point returns a com-
mon value. Here eg ρn has been taken to be the element type, flux point number and
element number of the adjoining point at the interface. Since grids in FR are per-
mitted to be unstructured the relationship between eρn and eg ρn is indirect. This ne-
cessitates the use of a lookup table. As the common solution function is permitted
to perform upwinding or downwinding of the solution it is in general the case that
(f) (f) (f) (f)
Cα (ueρnα , ueg
ρnα
) 6= Cα (ueg ,u
ρnα eρnα
). Hence, it is important that each flux point pair only
(f)
be visited once with the same common solution value assigned to both Cα ueρnα and
(f)
Cα ueg
ρnα
.
(f)
Further, associated with each flux point is a vector correction function geρ (x̃)
constrained such that
(f) (f) (f)
ñˆ eσ · geρ (x̃eσ ) = δρσ , (2.8)
with a divergence that sits in the same polynomial space as the solution. Using these
fields the solution to (2.5b) can be expressed as
q̃(u) ˆ(f) ˜ (f) (f) (f) (u) ˜ (u)
eσnα = ñeρ · ∇ · geρ (x̃) Cα ueρnα − ueρnα + ueνnα ∇`eν (x̃) , (2.9)
(u)
x̃=x̃eσ
where the term inside the curly brackets is the ‘jump’ at the interface and the final
term is an order ℘ − 1 approximation of the gradient obtained by differentiating the
discontinuous solution polynomial. Following the approaches of Kopriva [16] and Sun
et al. [9] the physical gradients can now be computed as
−T (u) (u)
q(u)
eσnα = Jeσn q̃eσnα , (2.10)
(f) (u) ( f ) (u)
qeσnα = `eρ (x̃eσ )qeρnα , (2.11)
2.1 Formulation for Mixed Curvilinear Grids 29
where J−T
eσn
(u)
= J−T (u)
en (x̃eσ ). Having solved the auxiliary equation it is now possible to
evaluate the transformed flux
(u) (u) −1 (u) (u) (u)
f̃eρnα = Jeρn Jeρn fα (ueρn , qeρn ), (2.12)
(u)
where Jeρn = det Jen (x̃(u)
eρ ). This can be seen to be a collocation projection of the flux.
With this the normal transformed flux at each of the flux points can be computed as
( f⊥ ) (u) ( f ) ˆ ( f ) (u)
f˜eσnα = `eρ (x̃eσ )ñeσ · f̃eρnα . (2.13)
which can then be used to obtain a semi-discretised form of the governing system
∂u(u)
eρnα −1 (u) ˜
= −Jeρn (∇ · f̃)(u)
eρnα , (2.19)
∂t
−1 (u) (u) (u)
where Jeρn en (x̃eρ ) = 1/Jeρn .
= det J−1
problems. When solving these ODEs the time step is more often than not restricted by
stability requirements as opposed to those of accuracy—the system exhibits a degree of
stiffness. Moreover, as there is an ODE associated with each spatial degree of freedom
the system can also become extremely large. A consequence of this is that retaining
the various intermediate ki stages in memory can become prohibitively expensive.
For a generic explicit RK scheme it is necessary to allocate storage, termed registers
in the literature, for y(t) and each of the s intermediate stages for a total register count of
s + 1. By exploiting the structure of the scheme it is often possible to reduce the register
count somewhat. For example, assuming it is possible to evaluate g(t, y) in-place, the
RK4 scheme of Table 2.2 can be implemented with just three registers of storage as
opposed to five. There exists a significant body of work related to the derivation of
low-storage RK schemes [19, 20]. With care it is possible to obtain schemes that require
just two registers of storage. Of the schemes the fourth order five stage RK45[2R+]
method of Kennedy et al. [20] is notable for its particularly large stability region.
Step size control. In comparison to linear multistep methods each RK time step
depends only on the solution at t. It is therefore trivial to change the ∆t between steps.
Hence, given an approximation of the truncation error ξ(t + ∆t) it is possible to adapt
the step size to both ensure stability and bound the local temporal error. Such control
can be used to automatically find the maximum stable step size—eliminating the need
for manual bisection.
The most common means of obtaining an approximation to the truncation error is
32 Chapter 2 Flux Reconstruction
through an alternative set of bi coefficients b?i that give a q−1 order approximation of the
solution. Using this the truncation error can be approximated as ξ(t +∆t) ≈ ∆t(bi −b?i )ki .
To be meaningful it is first necessary to normalise the error with respect to a predefined
tolerance. Following Hairer et al. [17] an error can be defined as
ξ(t + ∆t)
σcurr = , (2.22)
τa + τr max(|y(t)|, |y(t + ∆t)|)
where τa is an absolute error tolerance, and τr is a relative error tolerance. When
marching a system of equations this expression should be evaluated pointwise for each
equation in the system and the root mean square taken. A step should be rejected and
retaken with a smaller ∆t if σcurr > 1. Otherwise the step should be accepted. The
objective is to control ∆t such that the error incurred during the next step, σnext , is
approximately unity. Assuming that the solution is sufficiently smooth it is known
that modifying ∆t by a factor of f will result in σnext ≈ f q σcurr . Hence, to keep the
−1/q
error around unity the adjustment factor should be chosen to be f ≈ σcurr . This is
known as an “I” type controller. For reasons of computational efficiency it is desirable
to minimise the number of rejected time steps. The incidence of such steps can be
reduced by firstly introducing a safety factor, fsafe ≈ 0.8, and secondly by restricting
the overall adjustment such that fmin ≤ fsafe f ≤ fmax where fmin ≈ 0.3 and fmax ≈ 2.5.
One problem with I type controllers is that they are prone to spurious oscillations.
This issue can be avoided through the use of a “PI” type controller which uses the
values of σ from both the current and the previous time steps in order to update ∆t. In
a PI controller the adjustment factor is calculated as [21, 22]
−α/q β/q
f ≈ σcurr σprev , (2.23)
where α ≈ 0.4 and β ≈ 0.7. Complete pseudocode for a PI controller can be seen in
Algorithm 2.1.
The utility of step size control in combination with its ease of implementation has
made it ubiquitous within the ODE community. As a result the majority of RK schemes
tabulate in literature come with embedded pairs—including the low storage schemes.
To evaluate the error it is necessary to have access to both the previous solution y(t)
2.3 Correction Functions 33
Algorithm 2.1. PI step-size control algorithm. Descriptions of fmax , fmin , fsafe , α, and β can
be found in the text.
1: procedure IntegrateWithPIControl(∆tinit , ∆tmin , tend )
2: t ← 0, f ← 1, σprev ← 0, ∆t ← ∆tinit
3: while t < tend do
4: ∆t ← f ∆t . Adjust step size
5: ∆t ← max(min(t − tend , ∆t), ∆tmin )
6: σcurr ← Step(t, ∆t, . . .) . Take step and compute error
−α/q β/q
7: f ← fsafe σcurr σprev . Compute new step adjustment factor
8: f ← min( fmax , max( fmin , f )) . Ensure fmin ≤ f ≤ fmax
and the error terms ξ(t + ∆t). For low storage schemes, which usually operate in-place
overwriting y(t) with g(t, y), this requires that two extra registers be allocated.
parameter family of correction functions that are provably stable for one dimensional
linear advection problems. These are commonly referred to as the Vincent-Castonguay-
Jameson-Huynh (VCJH) correction functions. The stability of the VCJH functions was
subsequently extended by Castonguay et al. [26] to linear advection-diffusion problems.
The FR approach can be extended to quadrilateral and hexahedral elements through
a tensor product construction [10]. However, beyond the case of recovering DG, it is
an open question if the resulting schemes are linearly stable or not. Further work by
Castonguay et al. [27] and Williams et al. [28] has led to the identification of VCJH
like schemes inside of triangular elements. These schemes are observed to be distinct
from those identified by Huynh [29] in his extension of FR onto triangular elements.
Using a similar procedure to Castonguay et al. Williams and Jameson [30] were able to
identify a family of VCJH-like schemes inside of tetrahedra. More recently, Vincent
et al. [31] developed a procedure for obtaining an extended range of energy stable
one dimensional schemes. These schemes are found to be a super-set of the existing
one-parameter VCJH schemes.
Here a methodology is presented for obtaining the correction function correspond-
ing to a nodal DG scheme inside of an arbitrary domain. When considering the correc-
(f)
tion function associated with a flux point geρ (x̃) it is often more convenient to use a
face-local numbering scheme in which ρ ↔ (i j) where i denotes the face number and j
the local index on this face. Let {Γ̃ei } refer to faces of the reference element Ω̂e . With
these the divergences of the DG correction functions can be expressed as [30, 32]
Z
(f) (f)
∇ · ge(i j) (x̃) = ψek (x̃) ñˆ · ge(i j) (s̃)ψek (s̃) ds̃ (2.24)
Γ̃ei
Z
= ψek (x̃) `ei j (s̃)ψek (s̃) ds̃, (2.25)
Γ̃ei
where ñˆ is the outward pointing unit normal vector, and `ei j (s̃) is the nodal basis
function associated with point j on face i of the reference element e. In the second
(f)
step the fact that (2.8) fixes ñˆ · ge(i j) (s̃) at each of the flux points on the face has
been utilised to enable it to be substituted for an equivalent nodal basis function.
Heretofore this formulation has only been employed for simplex elements—triangles
and tetrahedra—however it is valid for any element type.
2.4 Aliasing 35
2.0
1.5
y
1.0
0.5
0.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
x
2.4 Aliasing
In the FR approach it is necessary to obtain a suitable approximation of the transformed
flux at each of the solution points. The most direct means of accomplishing this is
to simply evaluate the transformed flux function at each of the solution points as
per (2.12). When f is non-linear or the grid is curved the transformed flux sits in a
different—perhaps even non-polynomial—space to the solution. A consequence of this
is that within (2.12) there is an implicit collocation projection.
and evaluating this expression for the aforementioned three polynomials results in
errors of 0.68, 0.60, and 0.40, respectively. These differences in the L2 error highlight
the importance of the choice of solution points when using the FR approach. Having
quantified the error it is now possible to minimise it. Expanding p(x) as
p(x) = γi Υi (x),
where {γi } are a set of expansion coefficients and {Υi } are a set of basis functions that
span P 3 . Differentiating (2.26) with respect to γk and equating this to zero it is found
that Z 1
∂γ j σ2 = 2p(x)∂γ j p(x) − 2∂γ j p(x) f (x) dx
−1
Z 1
= 2γi Υi (x)Υ j (x) − 2Υ j (x) f (x) dx
−1 (2.27)
Z 1 Z 1
= γi Υi (x)Υ j (x) dx − Υ j (x) f (x) dx
−1 −1
= 0,
where in the second step the constant factor of two has been dropped. This can be seen
to be a linear system of the form Ax = b which are, in general, expensive to solve.
However should the basis functions satisfy an orthonormality condition such that
Z 1
Υi (x)Υ j (x) dx = δi j ,
−1
then (2.27) reduces to Z 1
γj = Υ j (x) f (x) dx,
−1
which is significantly easier to evaluate. In one dimension the set of orthonormal
polynomials are the normalised Legendre polynomials which are denoted by P̂i (x).
The L2 optimal polynomial p? (x) ∈ P 3 of f (x) is therefore given by
Z 1
?
p (x) = P̂i (y) f (y) dy P̂i (x), (2.28)
−1
L2
which has an error of 0.33. This represents a 17.5% decrease in error compared with
a collocation projection using Gauss-Legendre points and a 51.5% decrease compared
with a collocation projection using equispaced points.
2.4 Aliasing 37
where ξ(x) contains the modes of f (x) that are not in the space of p? (x). From the
numerical experiments performed above it is known that, if a collocation projection is
utilised, the resulting polynomial p(x) may be different from p? (x). If the polynomials
are different then this implies that the modal expansion coefficients must also be
different. This suggests that when using anything but the L2 optimal expansion that
there is the potential for modes which are not being resolved to impact—or alias—those
which are being resolved.
where α ≈ log , ηc < ηm is a cutoff parameter, s is the strength of the filter, and
ηm = maxσ deg ψeσ . Each entry indicates the amount of damping that should
be applied to the ψeσ mode of the solution. Damping is only applied to modes
whose degree is greater than the cutoff with higher modes receiving progressively
more damping. The rate at which this ramps up is controlled by s. Using the
definition of the Vandermonde matrix the exponentially filtered solution in nodal
space can be expressed as
Fe ueρnα = Veρν
−1
Λeνµ Veµσ ueσnα . (2.30)
where f (x̃) is the transformed function being projected. These integrals are
usually evaluated using Gaussian quadrature in which
(q) (q) (q)
γeσ ≈ ωeρ ψeσ (x̃eρ ) f (x̃eρ ), (2.31)
(q) (q) (q)
where {x̃eρ } is a set of Ne abscissa and {ωeρ } the set of associated weights. In
order to be effective it is important that the chosen quadrature rule be of sufficient
strength to accurately approximate the integrals. Otherwise, the quadrature itself
2.4 Aliasing 39
and using these the transformed flux at each quadrature point can be computed
as
(q) (q) −1 (q) (q) (q)
f̃eσnα = Jeσn Jeσn fα (ueσn , qeσn ),
(q) (q) −1 (q) (q)
where Jeσn = det Jen (x̃eσ ) and Jeσn = J−1
en (x̃eσ ). With this the transformed flux
at the solution points can be obtained by evaluating
(u) (q) (q) (q)
f̃eρnα = ψeν (x̃(u)
eρ )ωeσ ψeν (x̃eσ )f̃eσnα . (2.34)
onto the flux points. Here, however, the projection is performed on a per-face
basis.
where c1,2 are constants, A is a constant operator matrix, and B and C are state matrices.
To accomplish this the following constant operator matrix is introduced
(f) (f)
(u)
M0e = `eρ (x̃eσ ), dim M0e = Ne × Ne(u) ,
σρ
In specifying the solution matrices there is a degree of freedom regarding how the
field variables of the various elements are packed along a row. The packing of field
variables can be characterised by considering the distance, ∆ j, in columns between two
subsequent field variables for a given element. The case of ∆ j = 1 corresponds to the
array of structures (AoS) packing whereas the choice of ∆ j = |Ωe | leads to the structure
of arrays (SoA) packing. A hybrid approach wherein ∆ j = k with k being constant
results in the AoSoA(k) approach. Illustrations of these three approaches can be seen in
2.5 Matrix Representation 41
Figure 2.3. An implementation is free to chose between any of these counting patterns
so long as it is consistent. Using these matrices (2.6) can be reformulated as
(f) (u)
Ue = M0e Ue . (2.36)
hence
(f) (u)
Qe = M5e Qe , (2.38)
where M5e can be seen to be block diagonal. This is a direct consequence of the above
42 Chapter 2 Flux Reconstruction
( f⊥ )
and after substitution of (2.13) for f˜eσnα obtain
(u) n (f) (u) o (u)
R̃e = M3e D̃e − M2e F̃e + M1e F̃e
(f) n o (u) (2.39)
= M3e D̃e + M1e − M3e M2e F̃e .
hence
(q) (u)
Ue = M7e Ue . (2.40)
Similarly, (2.33) can be rewritten by letting
(q)
(u)
e = diag(Me , . . . , Me ),
M10 e = ND Ne × ND Ne ,
7 7
dim M10
(q) (q) (q) (q)
Qe = qeσnα , dim Qe = ND Ne × NV |Ωe |,
σ(nα)
where M10 5
e is observed to have a block diagonal structure similar to that of Me . Using
these it follows that
(q) (u)
Qe = M10 e Qe . (2.41)
2.5 Matrix Representation 43
The projection of the transformed flux in (2.34) can be brought into this framework by
defining
(q) (q) (q)
M9e = ψeν (x̃(u)
eρ )ωeσ ψeν (x̃eσ ), dim M9e = Ne(u) × Ne ,
ρσ
(q) (q) (q) (q)
F̃e = f̃eρnα , dim F̃e = ND Ne × NV |Ωe |,
ρ(nα)
hence
(u) (u)
Ve = M11
e Ue . (2.43)
44 Chapter 2 Flux Reconstruction
with u and f together satisfying (2.1). In the above ρ is the mass density of the fluid,
v = (v x , vy , vz )T is the fluid velocity vector, E is the total energy per unit volume, and p
is the pressure. For a perfect gas the pressure and total energy can be related by the
ideal gas law
p 1
E= + ρkvk2 , (2.45)
γ−1 2
where γ = c p /cv . Observe here the presence of terms of the form ρvi v j in f. Evaluating
these terms as a function of the conservative variables requires taking the quotient of
ρv j /ρ. However, in general, the quotient of two polynomials is not itself a polynomial.
Hence, the Euler flux function is not just non-linear but also non-polynomial.
With the fluxes specified all that remains is to prescribe a method for computing the
common normal flux Fα at interfaces as defined in (2.15). This can be accomplished
using an approximate Riemann solver for the Euler equations. There exist a variety of
such solvers as detailed in [41] and Appendix A.
Presentation in two dimensions. The above prescriptions of the Euler and Navier–
Stokes equations are valid for the case of ND = 3. The two dimensional formulation
can be recovered by deleting the fourth rows in the definitions of u, f (inv) and f (vis)
along with the third columns of f (inv) and f (vis) . Vectors are now two dimensional with
the velocity being given by v = (v x , vy )T .
Chapter 3
Quadrature Rules
where f (x) is the function to be integrated, {xi } are a set of N p points, and {ωi } the set
of associated weights. The points and weights are said to define a quadrature rule. A
rule is said to be of strength φ if it is capable of exactly integrating any polynomial of
maximal degree φ over Ω. A degree φ polynomial p(x) with x ∈ Ω can be expressed as
a linear combination of basis polynomials
|P φ | Z
φ φ
X
p(x) = αi Pi (x), αi = p(x)Pi (x) dx, (3.2)
i Ω
47
48 Chapter 3 Quadrature Rules
can be expressed as
P φ = xi y j | 0 ≤ i ≤ φ, 0 ≤ j ≤ φ − i ,
n o
(3.3)
where φ is the maximal degree. Unfortunately, at higher degrees the monomials become
extremely sensitive to small perturbations in the inputs. This gives rise to polynomial
systems which are poorly conditioned and hence difficult to solve numerically [47, 52].
A solution to this is to switch to an orthonormal basis set defined in two dimensions as
P φ = ψ(i j) (x) | 0 ≤ i ≤ φ, 0 ≤ j ≤ φ − i ,
n o
(3.4)
where x = (x, y)T and ψ(i j) (x) satisfies the familiar orthonormality property ∀µ, ν
Z
ψ(i j) (x)ψ(µν) (x) dx = δiµ δ jν . (3.5)
Ω
In addition to being exceptionally well conditioned, orthonormal polynomial bases
have other useful properties. Taking the constant mode of the basis to be ψ(00) (x) = 1/c
it follows that
Z Z
ψ(i j) (x) dx = c ψ(00) (x)ψ(i j) (x) dx = cδi0 δ j0 , (3.6)
Ω Ω
from which it can be deduced that all non-constant modes of the basis integrate up
to zero. Following Witherden and Vincent [37] this property is used to define the
truncation error associated with an N p point rule
Np
X(X )2
ξ (φ) =
2
ωk ψ(i j) (xk ) − cδi0 δ j0 , (3.7)
i, j k
This definition is convenient as it is free from both integrals and normalisation factors.
The task of constructing an N p point quadrature rule of strength φ is synonymous with
finding a set of points and weights that minimise ξ(φ).
Although the above discussion has been presented primarily in two dimensions all
of the ideas and relations carry over into three dimensions.
50 Chapter 3 Quadrature Rules
defined inside of a square domain with vertices (−1, −1) and (1, 1). From these defini-
tions it is evident that p1 (x, y) = p2 (y, x). As this is a symmetry which is expressed by
3.3 Reference Domains 51
the domain it is clear that any symmetric quadrature rule capable of integrating p1 is
also capable of integrating p2 . Further, when the index i is odd p1 (x, y) = −p1 (−x, y).
Similarly, when j is odd p1 (x, y) = −p1 (x, −y). In both cases it follows that the integral
of p1 is zero over the domain. More importantly, it also follows that any set of sym-
metric points are also capable of obtaining this result. This is due to terms on the right
hand side of (3.1) pairing up and cancelling out. A consequence of this is that not all of
the equations in the system specified by (3.1) are independent. Having identified such
polynomials for a given domain it is legitimate to exclude them from the definition of
ξ(φ). Although this exclusion does change the value of ξ(φ) in the case of a non-zero
truncation error the effect is not significant. The set of basis polynomials which are
included is termed the objective basis, and shall denote this by P̃ φ .
Triangle. The reference triangle can be seen in Figure 3.1 and has an area given
R 1 R −y
by −1 −1 dx dy = 2. A triangle has six symmetries: two rotations, three reflections,
and the identity transformation. A simple means of realising these symmetries is to
52 Chapter 3 Quadrature Rules
λ = (λ1 , λ2 , λ3 )T 0 ≤ λi ≤ 1, λ1 + λ2 + λ3 = 1, (3.8)
where the columns of the matrix can be seen to be the vertices of the reference triangle.
The utility of barycentric coordinates is that the symmetric counterparts to a point λ
are given by its unique permutations. The number of unique permutations depends on
the number of distinct components of λ and leads us to the following three symmetry
orbits
S1 = ( 13 , 13 , 31 ), |S1 | = 1,
S2 (α) = Perm(α, α, 1 − 2α), |S2 | = 3,
S3 (α, β) = Perm(α, β, 1 − α − β), |S3 | = 6,
where α and β are suitably constrained as to ensure that the resulting coordinates are
inside of the domain.
It can be easily verified that the orthonormal polynomial basis inside of the reference
triangle is given by
√
ψ(i j) (x) = 2P̂i (a)P̂(2i+1,0)
j (b)(1 − b)i , (3.10)
where a = 2(1 + x)/(1 − y) − 1, and b = y with the objective basis being given by
P̃ φ = ψ(i j) (x) | 0 ≤ i ≤ φ, i ≤ j ≤ φ − i .
n o
(3.11)
In the asymptotic limit the cardinality of the objective basis is half that of the complete
basis. However, the modes of this objective basis are known not to be completely
independent. Several authors have investigated the derivation of an optimal quadrature
basis on the triangle. Details can be found in the papers of Lyness [ 42] and Dunavant
[43].
3.3 Reference Domains 53
Quadrilateral. The reference quadrilateral can be seen in Figure 3.1. The area
R1 R1
is simply −1 −1 dx dy = 4. A square has eight symmetries: three rotations, four
reflections and the identity transformation. Applying these symmetries to a point (α, β)
with 0 ≤ (α, β) ≤ 1 will yield a set χ(α, β) containing its counterparts. The cardinality
of χ depends on if any of the symmetries give rise to identical points. This can be seen
to occur when either β = α or β = 0. Enumerating the possible combinations of the
above conditions gives rise to the following four symmetry orbits
λ = (λ1 , λ2 , λ3 , λ4 )T 0 ≤ λi ≤ 1, λ1 + λ2 + λ3 + λ4 = 1, (3.14)
(−1, −1, 1)
(−1, −1, 1)
(−1, 1, 1)
(1, −1, 1)
(−1, −1, 1)
(0, 0, 1) (−1, 1, 1)
(1, −1, 1)
(1, 1, 1)
where as with the triangle the columns of the matrix correspond to vertices of the
reference tetrahedron. Similarly the symmetric counterparts of λ are given by its unique
permutations. This leads us to the following five symmetry orbits
S1 = ( 41 , 14 , 41 , 41 ), |S1 | = 1,
S2 (α) = Perm(α, α, α, 1 − 3α), |S2 | = 4,
S3 (α) = Perm(α, α, − α, − α), 1
2
1
2 |S3 | = 6,
S4 (α, β) = Perm(α, α, β, 1 − 2α − β), |S4 | = 12,
S5 (α, β, γ) = Perm(α, β, γ, 1 − α − β − γ), |S5 | = 24,
where α, β, and γ are constrained to ensure that all of the resulting coordinates are
inside of the domain.
3.3 Reference Domains 55
With some manipulation it can be verified that the orthonormal polynomial basis
inside of the reference tetrahedron is given by
√ (2i+2 j+2,0)
ψ(i jk) (x) = 8P̂i (a)P̂(2i+1,0)
j (b)P̂k (c)(1 − b)i (1 − c)i+ j , (3.16)
where a = −2(1 + x)/(y + z) − 1, b = 2(1 + y)/(1 − z), and c = z. The objective basis is
given by
Prism. Extruding the reference triangle along the z-axis gives the reference prism
R 1 R 1 R −y
of Figure 3.2. It follows that the volume is −1 −1 −1 dx dy dz = 4. There are a total
of 12 symmetries. On account of the extrusion the most natural coordinate system
is a combination of barycentric and Cartesian coordinates: (λ1 , λ2 , λ3 , z). Let Perm3
generate all of the unique permutations of its first three arguments. Using this the six
symmetry groups of the prism can be expressed as
S1 = ( 13 , 31 , 31 , 0), |S1 | = 1,
S2 (γ) = ( 31 , 31 , 31 , ±γ), |S2 | = 2,
S3 (α) = Perm3 (α, α, 1 − 2α, 0), |S3 | = 3,
S4 (α, γ) = Perm3 (α, α, 1 − 2α, ±γ), |S4 | = 6,
S5 (α, β) = Perm3 (α, β, 1 − α − β, 0), |S5 | = 6,
S6 (α, β, γ) = Perm3 (α, β, 1 − α − β, ±γ), |S6 | = 12,
where the constraints on α and β are identical to those in a triangle and 0 < γ ≤ 1.
Combining the orthonormal polynomial bases for a right-triangle and line segment
yields the orthonormal prism basis
√
ψ(i jk) (x) = 2P̂i (a)P̂(2i+1,0)
j (b)P̂k (c)(1 − b)i , (3.18)
Pyramid. The reference pyramid can be seen in Figure 3.2 with a volume determined
R 1 R (1−z)/2 R (1−z)/2
by −1 (z−1)/2 (z−1)/2 dx dy dz = 8/3. The symmetries are identical to those of a
quadrilateral. Extending the notation employed for the quadrilateral the following
symmetry orbits are obtained
Hexahedron. The chosen reference hexahedron can be seen in Figure 3.2. The vol-
R1 R1 R1
ume is, trivially, −1 −1 −1 dx dy dz = 8. A hexahedron exhibits octahedral symmetry
with a symmetry number of 48. The procedure for determining the orbits is similar to
that used for the quadrilateral. Consider applying these symmetries to a point (α, β, γ)
with 0 ≤ (α, β, γ) ≤ 1 and let the resulting set of points by given by Ξ(α, β, γ). When α,
β, and γ are all distinct and greater than zero the set has a cardinality of 48, as expected.
However, when one or more parameters are either identical to one another or equal to
zero some symmetries give rise to equivalent points. This reduces the cardinality of the
3.4 Methodology 57
set. Enumerating the various combinations, seven symmetry orbits are obtained
3.4 Methodology
The methodology employed for identifying symmetric quadrature rules is a refinement
of that described by Witherden and Vincent [37] for triangles. This method is, in turn,
a refinement of that of Zhang et al. [47].
To derive a quadrature rule, four input parameters are required: the reference
domain, Ω, the number of quadrature points N p , the target rule strength, φ, and a
desired runtime, t. The algorithm begins by computing all of the possible symmetric
decompositions of N p . The result is a set of vectors satisfying the relation ∀i
Ns
X
Np = ni j |S j |, (3.24)
j=1
where N s is the number of symmetry orbits associated with the domain Ω, and ni j is the
number of orbits of type j in the ith decomposition. Finding these involves solving the
58 Chapter 3 Quadrature Rules
constrained linear Diophantine equation outlined in §3.3. It is possible for this equation
to have no solutions. Consider for example the case of N p = 44 for a triangular domain.
From the symmetry orbits
subject to the constraint that n1 ∈ {0, 1}. This restricts N p to be either a multiple of three
or one greater. Since 44 is neither of these the equation is found to have no solutions.
Therefore, it is concluded that there can be no symmetric quadrature rules inside of a
triangle with 44 points.
Given a decomposition the goal is to find a set of orbital parameters and weights
that minimise the error associated with integrating the objective basis on Ω. This is
an example of a non-linear least squares problem. A suitable method for solving such
problems is the Levenberg-Marquardt algorithm (LMA). The LMA is an iterative
procedure for finding a set of parameters that correspond to a local minima of a set
of functions. The minimisation process is not always successful and is dependent on
an initial guess of the parameters. Within the context of quadrature rule derivation
minimisation can be regarded as successful if ξ(φ) ∼ where represents machine
precision.
Let us denote the number of parameters associated with symmetry orbit Si as ~Si .
Using this the total number of degrees of freedom associated with decomposition i can
be expressed as
XNs n o
ni j ~Si + ni j , (3.25)
j=1
with the second term accounting for the presence of one quadrature weight associated
with each symmetry orbit. From the list of orbits given in §3.3 the weights are ex-
pected to contribute approximately one third of the degrees of freedom. This is not an
insignificant fraction. One way of eliminating the weights is to treat them as depen-
dent variables. When the points are prescribed the right hand side of (3.1) becomes
linear with respect to the unknowns—the weights. In general, however, the number
of weights will be different from the number of polynomials in the objective basis.
3.4 Methodology 59
It is therefore necessary to obtain a least squares solution to the system. Linear least
squares problems can be solved directly through a variety of techniques. Perhaps the
most robust numerical scheme is that of singular value decomposition (SVD). Thus,
at the cost of solving a small linear least squares problem at each LMA iteration it is
possible to reduce the number of free parameters to
Ns
X
ni j ~Si . (3.26)
j=1
Such a modification has been found to greatly reduce the number of iterations required
for convergence. This reduction more than offsets the marginally greater computational
cost associated with each iteration.
Previous works [47, 48, 54] have emphasised the importance of picking a ‘good’
initial guess to seed the LMA. To this end several methodologies for seeding orbital
parameters have been proposed. The degree of complexity associated with such strate-
gies is not insignificant. Further, it is necessary to devise a separate strategy for each
symmetry orbit. Experience, however, suggests that the choice of decomposition is
far more important than the initial guess in determining whether minimisation will
be successful. Orbits can therefore be seeded independently using uniform random
deviates. Let U1 be a deviate between [0, 1], U2 between [0, 1/2], U3 between [0, 1/3],
U4 between [0, 1/4], U11 between [−1, 1] and, Uγ between [0, (1 − γ)/2]. Using these
the various orbital parameters can be seeded as follows.
Triangle.
S2 (α ∼ U2 )
S3 (α ∼ U2 , β ∼ U3 )
Quadrilateral.
S2 (α ∼ U1 )
S3 (α ∼ U1 )
S4 (α ∼ U1 , β ∼ U1 )
60 Chapter 3 Quadrature Rules
Tetrahedron.
S2 (α ∼ U3 )
S3 (α ∼ U2 )
S4 (α ∼ U3 , β ∼ U3 )
S5 (α ∼ U4 , β ∼ U4 , γ ∼ U4 )
Prism.
S2 (γ ∼ U1 )
S3 (α ∼ U2 )
S4 (α ∼ U2 , γ ∼ U1 )
S5 (α ∼ U2 , β ∼ U2 )
S6 (α ∼ U2 , β ∼ U2 , γ ∼ U1 )
Pyramid.
S1 (γ ∼ U11 )
S2 (α ∼ Uγ , γ ∼ U11 )
S3 (α ∼ Uγ , γ ∼ U11 )
S4 (α ∼ Uγ , β ∼ Uγ , γ ∼ U11 )
Hexahedron.
S2 (α ∼ U1 )
S3 (α ∼ U1 )
S4 (α ∼ U1 )
S5 (α ∼ U1 , β ∼ U1 )
S6 (α ∼ U1 , β ∼ U1 )
S7 (α ∼ U1 , β ∼ U1 , γ ∼ U1 )
3.5 Implementation
The algorithms outlined above have been implemented in a C++11 program called
polyquad. The program is built on top of the Eigen template library [60] and is
parallelised using MPI. It is capable of searching for quadrature rules on triangles,
quadrilaterals, tetrahedra, prisms, pyramids, and hexahedra. All rules are guaranteed
to be symmetric having all points inside of the domain. Polyquad can also, optionally,
filter out rules possessing negative weights. Further, functionality exists, courtesy of
MPFR [61], for refining rules to an arbitrary degree of numerical precision and for
evaluating the truncation error of a ruleset.
As a point of reference the case of using polyquad to identify φ = 10 rules with
81 points inside of a tetrahedra is considered. Running polyquad for one hour on a
quad-core Intel i7-4820K CPU a total of 50 distinct rules were found. These rules were
split across four distinct orbital decompositions.
The source code for polyquad is available under the terms of the GNU General
Public License v3.0 and can be downloaded from https://github.com/vincentlab/
Polyquad.
3.6 Rules
Using polyquad a set of quadrature rules for each of the reference domains in §3.3
have been derived. All rules are completely symmetric, possess only positive weights,
and have all points inside of the domain. It is customary in the literature to refer to
quadratures with the last two attributes as being “PI” rules. As polyquad attempts to
62 Chapter 3 Quadrature Rules
Algorithm 3.1. Procedure for generating symmetric quadrature rules of strength φ with N p
points inside of a domain.
1: procedure FindRules(N p , φ, t)
2: for all decompositions of N p do
3: t0 ← CurrentTime()
4: repeat
5: R ← SeedOrbits() . Initial guess of points
6: ξ ← LMA(RuleResid, R)
7: if ξ ∼ then . If minimisation was successful
8: save R
9: end if
10: until CurrentTime() − t0 > t
11: end for
12: end procedure
Np
φ Tri Quad Tet Pri Pyr Hex
1 1 1 1 1 1 1
2 3 4 4 5 5 6
3 6 4 8 8 6 6
4 6 8 14 11 10 14
5 7 8 14 16 15 14
6 12 12 24 28 24 34
7 15 12 35 35 31 34
8 16 20 46 46 47 58
9 19 20 59 60 62 58
10 25 28 81 85 83 90
11 28 28
12 33 37
13 37 37
14 42 48
15 49 48
16 55 60
17 60 60
18 67 72
19 73 72
20 79 85
Chapter 4
4.1 Requirements
From the one dimensional numerical experiments of §2.4 it is apparent that the choice
of solution/flux points can have a significant impact on the quality of the resulting
interpolating polynomial. The first question that must be considered, however, is that
of existence: given a set of points inside of an element {x̃eρ } is it possible to construct a
nodal basis set?
In one dimension, closed form expressions exist for the nodal basis set with
Y x̃ − x̃σ
`ρ ( x̃) = , (4.1)
σ6=ρ
x̃ρ − x̃ σ
where it can be readily verified that `ρ ( x̃σ ) = δρσ . By inspection it is clear that the only
requirement on the points is that they all be distinct. Beyond one dimension the set of
nodal basis functions is defined through the inverse of the generalised Vandermonde
matrix as
`eρ (x̃) = Veρσ
−1
ψeσ (x̃), (4.2)
where Veρσ = ψeρ (x̃eσ ). To be able to build the nodal basis set it is therefore necessary
for Ve to be invertible. This is equivalent to requiring that det Ve 6= 0. It is known that in
two and three dimensions that only certain sets of points fulfil this requirement which is
termed unisolvency [6, 62]. To illustrate this take {x̃eρ } to be a set of distinct points with
a non-singular Vandermonde matrix and consider arbitrarily relabelling a pair of points.
The effect of this relabelling is to interchange two columns in the Vandermonde matrix.
From the properties of the determinant this will cause the sign of the determinant to flip.
If this interchange is performed continuously, with the two points following di fferent
65
66 Chapter 4 Solution and Flux Points
Figure 4.1. Origins of non-unisolvency. Figure 4.2. A set points that are not uni-
solvent.
non-intersecting paths as shown in Figure 4.1, it is evident from the intermediate value
theorem that there must be an intermediate arrangement where the determinant is zero.
Hence, while the points are all distinct they can not be used to construct a nodal basis
set.
The next requirement is that of symmetry. This is essential for flux points as
otherwise there can exist topological configurations where pairs of flux points do
not align in physical space. Although there is no formal requirement for the solution
points to be symmetric it is nevertheless desirable for them to respect the underlying
symmetries of the element.
The quality of the interpolating polynomial p(x) arising from a collocation projec-
tion of a function f (x) can be measured by taking a norm of f (x) − p(x) over the region
of interest. Traditionally, the majority of nodal finite element codes have eschewed
collocation type projections in lieu of full L2 projections using quadrature. Within this
framework the role of solution points is reduced to that of polynomial interpolation.
Hence, the main criterion used to assess the suitability of a set of points is the con-
ditioning of the resulting nodal basis set. This property is most naturally assessed by
considering the L∞ norm. Minimising this yields the so-called minmax polynomial;
that with the smallest maximum deviation. Denoting this polynomial as p? (x) and
4.1 Requirements 67
where X
Λ = max |`i (x)|, (4.4)
x∈Ω
i
where
f (n+1) (c) Y
ξ(x) = (x − xi ), (4.6)
(n + 1)! i
where c is an unknown constant and n is the number of nodal basis functions. Using
this definition a least squares error can be introduced over the domain Ω as
Z 1 Z 1Y
σ2 = ξ2 (x) dx = A2 (x − xi )2 dx, (4.7)
−1 −1 i
where A = f (n+1) (c)/(n + 1)!. Under the assumption that A has no dependence on the
choice of nodes this can be minimised as
Z 1Y Y
∂ xk σ2 = −2 (x − xi ) (x − x j ) dx = 0. (4.8)
−1 i j6=k
| {z } | {z }
degree n degree n−1
This equation can be solved by requiring that the first term inside of the integral, of
degree n, be orthogonal to all polynomials of degree n − 1. The simplest means of
satisfying this requirement is to let the first term be a Legendre polynomial of degree n
Y
Pn (x) = (x − xi ), (4.9)
i
with the solution points being given as the roots of Pn (x). This is the very definition
of the abscissa of the n point Gauss-Legendre quadrature rule. Hence, in the absence
of any additional information about the form of f (x), it has been shown that when
performing a collocation projection that the Gauss-Legendre points are optimal in
the least squares sense. This result is in good agreement with those of §2.4 where in
4.3 Triangles 69
the one dimensional numerical experiments the Gauss-Legendre points were found
to outperform the Gauss-Legendre-Lobatto and equispaced points. Further, it is also
consistent with the theoretical arguments of Jameson et al. [33].
Through a tensor product construction this result can be extended to both quadri-
laterals and hexahedra. As the nodal basis functions inside these domains are simply
a product of one dimensional Lagrange nodal basis functions the unisolvency of the
basis follows immediately from the uniqueness of the Gauss-Legendre points. However,
from chapter 3 it is known there exist symmetric arrangements of points inside of
quadrilaterals and hexahedra that do not correspond to any tensor product construction.
Thus, the resulting point sets can not be considered globally optimal.
4.3 Triangles
Beyond tensor product elements it is not possible to obtain an analytic expression
for the truncation error. This precludes any direct numerical analysis or optimisation.
However, from the previous work of Catonguay et al. [34] and Williams at al. [35]
there is strong empirical evidence to suggest that solution points should be placed at
the abscissa of strong Gaussian quadrature rules. In this section further consideration
is given to this notion by analysing the impact of solution point placement inside of
triangles [37].
℘ 3 4 5 5 6 6 7
φ 5 7 8 9 10 11 12
φ + 6 9 10 10 12 12 14
Nr 95 66 722 473 412 12 136
Nr (det V =
6 0) 24 64 452 2 236 0 100
effects the analytic solution of the system at a time t is simply a translation of the initial
conditions.
Using the analytical solution an L2 error can be defined as
Z 2Z 2h i2
σ(t) =
2
ρδ (x + ∆y (t)ŷ, t) − ρ(x + ∆y (t)ŷ, t) d2 x, (4.13)
−2 −2
where ρδ (x, t) is the numerical mass density, ρ(x, t) the analytic mass density, and
∆y (t) is the ordinate corresponding to the centre of the vortex at t and accounts for the
fact that the vortex is translating in a free stream of velocity unity in the y direction.
Restricting the region of consideration to a small box centred around the vortex serves
to further mitigate against the effects of vortices coupling together. The initial mass
density along with the [−2, −2] × [2, 2] region used to evaluate the error can be seen
in Figure 4.3. For an arbitrary triangular mesh the evaluation of (4.13) is somewhat
cumbersome. However, if the mesh is constructed such that there are times, tc , when
the region of integration does not straddle any mesh elements then the error can be
computed by simply integrating over each element and summing the residuals
ZZ h i2
σ(tc ) =
2
ρδi (x̃, tc ) − ρ(Mi (x̃), tc ) Ji (x̃) d2 x̃
Ω̂e (4.14)
i2
δ
h
≈ ρi (x̃ j , tc ) − ρ(Mi (x̃ j ), tc ) Ji (x̃ j )ω j ,
where, ρδi (x̃, tc ) is the approximate mass density inside of the ith element in the box,
and Ji (x̃) the associated Jacobian. In the second step the integral has been approximated
using a quadrature rule with abscissa {x̃ j } and weights {ω j }. The requirement that there
exist times when the grid and bounding box conform has been satisfied by using the
mesh of Figure 4.4.
To completely specify the proposed numerical experiment it is also necessary to
specify the time-marching algorithm/time-step, the approximate Riemann solver, and
the choice of flux points along each edge. In this study a classical fourth-order Runge–
Kutta (RK4) scheme is chosen with ∆t = 0.0005. For computing the numerical fluxes
at element interfaces a Rusanov type Riemann solver, as presented in [9] and detailed
in §A.1, is employed. Finally, at the edges of the triangles, the flux points are taken to
be at Gauss-Legendre points.
72 Chapter 4 Solution and Flux Points
10
ρ
1.0
5
0.9
0 0.8
y
0.7
-5 0.6
0.5
-10
-10 -5 0 5 10
x
Figure 4.3. Initial density profile for the vortex in Ω. The black box shows the area where
the error is calculated at t = 0. This box remains centred on the vortex as it
progress in the +y direction.
Figure 4.4. The mesh used for the vortex test case consisting of 800 triangles.
4.3 Triangles 73
Results and discussion. For each order, all derived point sets were subjected to the
Euler vortex test case. Simulations were run until t = 100; equivalent to five passes
of the vortex through the domain. Measurements of the error were made every time
unit with the simulation being terminated should NaNs be encountered. For each rule
there are three direct metrics: the Lebesgue constant, Λ, the truncation error, ξ, and
the binary measure of whether the simulation made it to t = 100 or not. Further, for
those rules where the vortex does not break down it is possible to compute the L2 error
at t = 100, σ, and the average L2 error over the entire simulation, hσi. Denote the
rule with the smallest L2 error at t = 100 as being the σ-optimal point set and the rule
with the smallest average L2 error as being the hσi-optimal set. The range of Lebesgue
constants and truncation errors at each order can be seen in Figure 4.5. Plots of the L2
error against time are shown in Figure 4.6.
Starting with the Lebesgue-truncation plots, it is evident that for all orders ex-
cept ℘ = 4 the σ- and hσi-optimal point sets—along with those of Williams et al.
[36]—feature both low Lebesgue constants and low truncation errors. At higher or-
ders it is evident that points with either high Lebesgue constants or high truncation
errors are more likely to either become unstable before t = 100 or perform poorly.
A good example of this is the Λ-optimal points at orders ℘ = 3, 5, 6. At these orders
the Λ-optimal points all have truncation errors within the upper-quartile and exhibit
markedly worse performance than the σ-optimal or WS points. Conversely, at ℘ = 7
when the Λ-optimal point set has a truncation error which lies in the lower-quartile of
the distribution the performance of the set is extremely good. These three results all
serve to reaffirm the dual role that solution points play in FR schemes.
From the error-time plots it is observed that for all polynomial orders the perfor-
mance of the α-optimised points is significantly worse than those which are good
quadrature rules. This is in good agreement with Castonguay et al. [34]. It is also
observed from Table 4.2 that at orders ℘ = 4, 6, 7 the σ-optimal rule sets outperform
the WS points by 73%, 33%, and 34% respectively.
74 Chapter 4 Solution and Flux Points
3.5
1.5 3.0
2.5
ξ
ξ
1.0
2.0
0.5
1.5
10 10
Λ Λ
(a) ℘ = 3. (b) ℘ = 4.
3.0
3.0
2.5
2.5
2.0 2.0
ξ
1.5
1.5
1.0
1.0
100 10000 100 10000
Λ Λ
(c) ℘ = 5. (d) ℘ = 6.
3.0
2.5
ξ
2.0
1.5
100 10000
Λ
(e) ℘ = 7.
Figure 4.5. Semi-log plots of the Lebesgue constant Λ against truncation error ξ for all point
sets. Rules which do not make it to t = 100 are indicated by hollow markers.
4.3 Triangles 75
3 3
σ × 10−2
σ × 10−3
2 2
1 1
0 0
0 25 50 75 100 0 25 50 75 100
t t
5 5
4 4
σ × 10−4
σ × 10−5
3 3
2 2
1 1
0 0
0 25 50 75 100 0 25 50 75 100
t t
1.25
1.00
σ × 10−5
0.75
0.50
0.25
0.00
0 25 50 75 100
t
Figure 4.6. L2 error against time for the α-optimised, Λ-optimal, ξ -optimal, σ-optimal,
〈σ〉-optimal, and WS points.
76 Chapter 4 Solution and Flux Points
σ(t = 100)
Points ℘=3 ℘=4 ℘=5 ℘=6 ℘=7
Λ-opt 2.58 × 10−2 1.20 × 10−3 2.64 × 10−3 2.02 × 10−4 5.95 × 10−6
ξ-opt 8.20 × 10−3 2.09 × 10−3 1.36 × 10−4 4.16 × 10−5 1.10 × 10−5
σ-opt 8.20 × 10−3 6.59 × 10−4 9.69 × 10−5 2.38 × 10−5 5.95 × 10−5
hσi-opt 8.61 × 10−3 6.76 × 10−4 9.96 × 10−5 2.42 × 10−5 5.95 × 10−6
WS 8.27 × 10−3 1.15 × 10−3 6.92 × 10−5 3.16 × 10−5 8.00 × 10−6
Chapter 5
Implementation
The implementation of the FR approach presented in this thesis is called PyFR. Written
in Python, PyFR is designed to be compact, efficient, scalable, and performance portable
across a range of platforms. Key functionality is summarised in Table 5.1.
As outlined in §2.5 the majority of operations within an FR step can be cast in
terms of matrix-matrix multiplications between a constant operator matrix and a state
matrix. All remaining operations, e.g. flux evaluations, are pointwise and concern
themselves with either a single solution point inside of an element or two collocating
flux points at an interface. Hence, in broad terms there are five salient aspects of an
FR implementation, (i) definition of the constant operator matrices, (ii) specification
of the state matrices, (iii) implementation of the matrix multiplication kernels, (iv)
implementation of the pointwise kernels and, finally (v) handling of distributed memory
parallelism and scheduling.
77
78 Chapter 5 Implementation
Dimensions 2D, 3D
Elements Triangles, Quadrilaterals, Hexahedra,
Tetrahedra, Prisms, Pyramids
Spatial orders Arbitrary
Time steppers Euler, RK4, RK45
Precisions Single, Double
Backends C/OpenMP, CUDA, OpenCL
Communication MPI
File format Parallel HDF5
Governing systems Euler, compressible Navier–Stokes
vendors themselves, e.g. cuBLAS for NVIDIA GPUs. This approach greatly facilitates
development of efficient and platform portable code. However, it is important to
note that the matrix sizes encountered in PyFR are not necessarily optimal from a
GEMM perspective. Specifically, GEMM is optimised for the multiplication of large
square matrices, whereas the constant operator matrices in PyFR are of the block-by-
panel variety as illustrated in Figure 5.1. Moreover, the constant operator matrices
are know a priori, and do not change in time. This knowledge could, in theory, be
leveraged to design bespoke matrix multiply kernels that are more efficient than GEMM.
Development of such bespoke kernels will be a topic of future research.
As an example of a pointwise kernel consider the final evaluation of the the semi-
discretised form of the system being solved
∂u(u)
eρnα −1 (u) ˜
= −Jeρn (∇ · f̃)(u) (u)
eρnα + S eρnα ,
∂t
(u)
where S eρnα is a source term that is permitted to vary in both space and time. Figure 5.2
shows how such a kernel can be expressed in the domain specific language of PyFR.
There are several points of note. Firstly, the kernel is purely scalar in nature; choices
such as how to vectorise a given operation or how to gather data from memory are all
delegated to the backend-specific templating engine. Secondly, it is possible to utilise
Python when generating the main body of kernels. This capability is used to loop over
each of the field variables to generate the body of the kernel. Since kernels are generated
at runtime it is trivial to support user-defined source terms. Expressions may be written
in the input configuration file and, after some validation, are substituted directly into
the kernel as it is being generated. The resulting kernels in the case ofNV = 4 with no
source terms for the C/OpenMP, CUDA, and OpenCL backends can be seen in Figures
5.3, 5.4, and 5.5 respectively. Observe here the somewhat unconventional structure
of the C/OpenMP kernel which is markedly different from the CUDA and OpenCL
kernels. This structure is necessary to ensure that the kernel is properly vectorised
across a range of compilers.
5.4 Pointwise Kernels 81
#define PYFR_ALIGN_BYTES 32
#define PYFR_NOINLINE __attribute__ ((noinline))
void
negdivconf(int _ny, int _nx,
const fpdtype_t* __restrict__ rcpdjac_v, int lsdrcpdjac,
fpdtype_t* __restrict__ tdivtconf_v, int lsdtdivtconf)
{
#pragma omp parallel
{
int align = PYFR_ALIGN_BYTES / sizeof(fpdtype_t);
int rb, re, cb, ce;
loop_sched_2d(_ny, _nx, align, &rb, &re, &cb, &ce);
Figure 5.3. Generated OpenMP annotated C code for the negdivconf kernel.
82 Chapter 5 Implementation
__global__ void
negdivconf(int _ny, int _nx,
const fpdtype_t* __restrict__ rcpdjac_v, int lsdrcpdjac,
fpdtype_t* __restrict__ tdivtconf_v, int lsdtdivtconf)
{
int _x = blockIdx.x*blockDim.x + threadIdx.x;
__kernel void
negdivconf(int _ny, int _nx,
__global const fpdtype_t* restrict rcpdjac_v, int lsdrcpdjac,
__global fpdtype_t* restrict tdivtconf_v, int lsdtdivtconf)
{
int _x = get_global_id(0);
Figure 5.6. Flow diagram showing the stages required to compute ∇· f . Symbols correspond
to those of §2.5. For simplicity arguments referencing constant data have been
omitted. Memory indirection is indicated by red underlines. Synchronisation
points are signified by black horizontal lines.
Chapter 6
Validation
85
86 Chapter 6 Validation
20
ρ
1.0
10
0.9
0 0.8
y
0.7
-10 0.6
0.5
-20
-20 -10 0 10 20
x
the four data points. The order is given by the gradient of the fit. A plot of order of
accuracy against time for the three schemes can be seen in Figure 6.2. Here the order
of accuracy is observed to change as a function of time. This is due to the fact that
the error is actually of the form σ(t) = σp + σso (t) where σp is a constant projection
error and σso is the time-dependent spatial operator error of (4.14). The projection
error arises as a consequence of the collocation projection of the initial conditions
onto the mesh. Over time the spatial operator error grows in magnitude and eventually
dominates. Only when σso (t) σp can the true order of the method be observed. The
results here can be seen to be in excellent agreement with those of [23].
6 Scheme
DG
Order
5 SD
HU
4
3
0 300 600 900 1200 1500 1800
Time
Figure 6.2. Spatial super accuracy observed for a ℘ = 3 simulation using DG, SD and HU
schemes. Results obtained using PyFR v0.1.0.
where φ = y/H and pc is a constant pressure. The total energy is given by the ideal gas
law of (2.45). On a finite domain the Couette flow problem can be modelled through the
imposition of periodic boundary conditions. For a two dimensional mesh periodicity
is enforced in x whereas for three dimensional meshes it is enforced in both x and
z. For the purposes of this experiment the initial conditions were taken as γ = 1.4,
Pr = 0.72, µ = 0.417, c p = 1005 J K−1 , H = 1 m, T w = 300 K, pc = 1 × 105 Pa, and
vw = 69.445 m s−1 . These values correspond to a Mach number of 0.2 and a Reynolds
number of 200. The plates were modelled as no-slip isothermal walls as detailed in
§B.3 of Appendix B. A plot of the resulting energy profile can be seen in Figure 6.3.
Constant initial conditions are taken as ρ = h ρ(φ) i, v = vw x̂, and p = pc . Using the
88 Chapter 6 Validation
where Ω is the computational domain, E δ (x, t) is the numerical total energy, and E(x)
the analytic total energy. In the third step each integral has been approximated by using
a quadrature rule with abscissa {x̃e j } and weights {ωe j } inside of an element type e.
Couette flow is a steady state problem and so in the limit of t → ∞ the numerical total
energy should converge to a solution. Using PyFR v0.1.0 and starting from a constant
initial condition the L2 error was computed every 0.1 time units. The simulation was
said to have converged when σ(t)/σ(t + 0.1) ≤ 1.01 where σ is the L2 error. The time
at which this occurs is denoted as t∞ .
Once the system has converged for a range of meshes it is possible to compute
the order of accuracy of the scheme. For a given ℘ this is the slope of a linear least
squares fit of log h ∼ log σ(t∞ ) where h is an approximation of the characteristic grid
spacing. The expected order of accuracy is ℘ + 1. In all simulations inviscid fluxes
were computed using the Rusanov approach and the LDG parameters were taken to be
β = 1/2 and τ = 0.1. All simulations were performed with DG correction functions and
at double precision. Inside tensor product elements Gauss-Legendre solution and flux
points were employed. Triangular elements utilised Williams-Shunn solution points
and Gauss-Legendre flux points.
Two dimensional unstructured mixed mesh. For the two dimensional test cases
the computational domain was taken to be [−1, 1]×[0, 1]. This domain was then meshed
with both triangles and quadrilaterals at four di fferent refinement levels. The Couette
flow problem described above was then solved on each of these meshes. Experimental
L2 errors and orders of accuracy can be seen in Table 6.1. In all cases the expected
order of accuracy is obtained.
6.2 Couette Flow 89
0.5 2.52
y
2.51
0.0
-1.0 -0.5 0.0 0.5 1.0 2.50
x
Figure 6.3. Converged steady state energy profile for the two dimensional Couette flow
problem.
(a) (b)
(c) (d)
Figure 6.4. Unstructured mixed element meshes used for the two dimensional Couette flow
problem.
90 Chapter 6 Validation
Table 6.1. L2 energy error and orders of accuracy for the Couette flow problem
on four mixed meshes. The mesh spacing was approximated as h ∼
−1/2
NE where NE is the total number of elements in the mesh.
σ(t∞ ) / J m−3
Tris Quads ℘=1 ℘=2 ℘=3 ℘=4
2 8 1.26 × 102 5.77 × 10−1 5.54 × 10−3 6.62 × 10−5
6 22 3.56 × 101 1.40 × 10−1 6.72 × 10−4 3.91 × 10−6
10 37 2.08 × 101 4.35 × 10−2 2.54 × 10−4 8.16 × 10−7
16 56 1.46 × 101 3.52 × 10−2 1.09 × 10−4 4.62 × 10−7
Order 2.21 ± 0.12 2.99 ± 0.32 3.97 ± 0.05 5.20 ± 0.38
Three dimensional extruded hexahedral mesh. For this three dimensional case
the computational domain was taken to be [−1, 1] × [0, 1] × [0, 1]. Meshes were
constructed through first generating a series of unstructured quadrilateral meshes in the
x-y plane. A three layer extrusion was then performed on these meshes to yield a series
of hexahedral meshes. Experimental L2 errors and orders of accuracy for these meshes
can be seen in Table 6.2.
σ(t∞ ) / J m−3
Hexes ℘=1 ℘=2 ℘=3
78 3.35 × 101 5.91 × 10−2 7.28 × 10−4
195 1.23 × 101 1.87 × 10−2 1.15 × 10−4
405 6.15 × 100 5.49 × 10−3 2.72 × 10−5
Order 2.06 ± 0.08 2.87 ± 0.24 3.99 ± 0.03
Figure 6.5. Cutaway of the unstructured hexahedral mesh with 1 004 elements.
92 Chapter 6 Validation
σ(t∞ ) / J m−3
Hexes ℘=1 ℘=2 ℘=3
96 1.91 × 101 4.32 × 10−2 5.83 × 10−4
536 8.20 × 100 9.11 × 10−3 6.89 × 10−5
1004 3.82 × 100 3.22 × 10−3 2.04 × 10−5
Order 1.93 ± 0.46 3.19 ± 0.48 4.16 ± 0.44
In the present study [85] flow over a circular cylinder at Re = 3 900 with an
effectively incompressible Mach number of 0.2 is considered. This case sits in the
shear-layer transition regime identified by Williamson [84], and contains several com-
plex flow features, including separated shear layers, turbulent transition, and a fully
turbulent wake. This test case has been the focus of a number of previous studies, both
experimental and numerical [76–80]. Recently, Lehmkuhl et al. [86] demonstrated that
the wake profile for this test case can be classified as one of two modes, a low-energy
mode (Mode-L) and a high-energy mode (Mode-H). Specifically, via analysis of a
very long period simulation of over 2 000 convective times, they showed that the wake
fluctuates between these two modes.
Domain. In the present study a computational domain with dimensions [−9D, 25D];
[−9D, 9D]; and [0, πD] in the stream-, cross-, and span-wise directions, respectively,
is used. The cylinder is centred at (0, 0, 0). The span-wise extent was chosen based
on the results of Norberg [76], who found no significant influence on statistical data
when the span-wise dimension was doubled from πD to 2πD. Indeed, a span of πD has
been used in the majority of previous numerical studies [77–79], including the recent
DNS study of Lehmkuhl et al. [86]. The stream-wise and cross-wise dimensions are
comparable to the experimental and numerical values used by Parnaudeau et al. [ 80],
whose results will be directly compared with those computed by PyFR. The overall
domain dimensions are also comparable to those used for DNS studies by Lehmkuhl et
al. [86]. The domain is periodic in the span-wise direction, with the no-slip isothermal
wall boundary condition of §B.3 applied at the surface of the cylinder and a Riemann
invariant boundary condition, as detailed in §B.4, applied at the far-field.
Meshes. The domain was meshed in two ways. The first mesh consisted of entirely
structured hexahedral elements, whilst the second was unstructured, consisting of
prismatic elements in the near wall boundary layer region, and tetrahedral elements in
the wake and far-field. Both meshes employed quadratically curved elements, and were
designed to fully resolve the near wall boundary layer region when ℘ = 4. Specifically,
the maximum skin friction coefficient was estimated a priori as C f ≈ 0.075 based
94 Chapter 6 Validation
on the LES results of Breuer [78]. The height of the first element was then specified
such that when ℘ = 4 the first solution point from the wall sits at y+ ≈ 1, where
non-dimensional wall units are calculated in the usual fashion as y+ = uτ y/ν with
uτ = C f /2u∞ .
p
The hexahedral mesh had 104 elements in the circumferential direction, and 16
elements in the span-wise direction, which when ℘ = 4 achieves span-wise resolution
comparable to that used in previous studies, as discussed by Breuer [78] and the
references contained therein. The prism/tetrahedral mesh has 116 elements in the
circumferential direction, and 20 elements in the span-wise direction, these numbers
being chosen to help reduce face aspect ratios at the edges of the prismatic layer; which
facilitates transition to the fully unstructured tetrahedral elements in the far-field. In
total the hexahedral mesh contained 119 776 elements, and the prism/tetrahedral mesh
contained 79 344 prismatic elements and 227 298 tetrahedral elements. Both meshes
are shown in Figure 6.6.
6.3 Cylinder Flow at Re = 3 900 95
Hex ℘
1
Mesh
2
3
Pri/tet 4
Figure 6.7. Computational effort required for the 119 776 element hexahedral mesh and the
mixed mesh with 79 344 prims and 227 298 tetrahedra.
Instantaneous surfaces of iso-density are shown in Figure 6.8 for both simulations
at similar phases of the shedding cycle. Laminar flow is observed at the leading edge of
the cylinder for both test cases with a turbulent transition occurring near the separation
points, and finally fully turbulent flow is found in the wake region. These are the
characteristic features of the shear-layer transition regime, as described by Williamson
[84]. The wake is composed of large vortices, alternately shedding off from the upper
and lower surfaces of the cylinder, and smaller scale turbulent structures.
Plots of the averaged stream-wise wake profiles are shown in Figure 6.9 and
Figure 6.10 for Mode-H and Mode-L, respectively. Both meshes show excellent agree-
ment with the numerical results of Lehmkuhl et al. [86] for both modes and with the
experimental results of Parnaudeau et al. [80], which is available for Mode-L. The
Mode-H cases exhibit relatively shorter separation bubbles and the Mode-L cases have
characteristic inflection points in the wake profile near x/D ≈ 1.
Plots of the averaged pressure coefficient C p on the surface of the cylinder are
shown in Figure 6.11 and Figure 6.12 for both extracted modes and both meshes. The
Mode-H results are shown alongside the Mode-H numerical results of Lehmkuhl et
al. [86] and the results from Case I of Ma et al. [77]. The Mode-L results are shown
alongside the Mode-L numerical results of Lehmkuhl et al. [86] and the experimental
6.3 Cylinder Flow at Re = 3 900 97
Data set
PyFR pri/tet
0.6
PyFR hex
Lehmkuhl et al.
u/u∞
0.3
0.0
-0.3
2 4 6
x /D
Figure 6.9. Averaged wake profiles for Mode-H compared with the numerical results of
Lehmkuhl et al. [86].
0.8
Data set
PyFR prism/tet
PyFR structured
Lehmkuhl et al.
0.4
Parnaudeau et al.
u/u∞
0.0
2 4 6
x /D
Figure 6.10. Averaged wake profiles for Mode-L compared with the numerical results of
Lehmkuhl et al. [86] and experimental results of Parnaudeau et al. [80].
6.3 Cylinder Flow at Re = 3 900 99
1.0
Data set
PyFR pri/tet
0.5 PyFR hex
Lehmkuhl et al.
Ma et al.
0.0
Cp
-0.5
-1.0
0 50 100 150
θ
Figure 6.11. Averaged pressure coefficient for Mode-H compared with the numerical results
of Ma et al. [77] and Lehmkuhl et al. [86].
results of Norberg et al. at a similar Re = 4 020 [76], which were extracted from
Kravchenko and Moin [79]. Both modes have similar pressure coefficient distributions
at the leading face of the cylinder, while the Mode-H case has stronger suction on the
trailing face adjacent to the separation bubble. Both modes extracted using both meshes
show excellent agreement with their corresponding reference data sets.
The averaged pressure coefficient at the base of the cylinder C pb , and the averaged
separation angle θ s measured from the leading stagnation point are tabulated in Ta-
ble 6.5 for both modes and meshes. These are shown along with measurements from
the experimental results of Norberg et al. [76], experimental data from Parnaudeau
et al. [80], and DNS data from Lehmkuhl et al. [86] for both modes. Both measured
quantities agree well with the reference data sets for both modes and meshes. The
difference in separation angle is less than ∼1◦ between the current and reference results.
The pressure coefficient at the base of the cylinder shows that the high-energy Mode-H
case has stronger recirculation in the wake, characterised by greater suction at the wall
adjacent to the recirculation bubble.
100 Chapter 6 Validation
1.0
Data set
PyFR pri/tet
0.5 PyFR hex
Lehmkuhl et al.
Norberg et al.
0.0
Cp
-0.5
-1.0
0 50 100 150
θ
Figure 6.12. Averaged pressure coefficient for Mode-L compared with the numerical results
of Lehmkuhl et al. [86] and experimental results of Norberg et al. [76].
Mode-H Mode-L
−C pb θ s /◦ −C pb θ s /◦
PyFR hex 0.987 88.28 0.880 87.71
PyFR pri/tet 0.974 87.13 0.882 86.90
Parnaudeau et al. [80] 88.00
Lehmkuhl et al. [86] 0.980 88.25 0.877 87.80
Norberg et al. [76, 79] 0.880
6.4 Single-Node Performance 101
Plots of averaged stream-wise velocity at x/D = 1.06, 1.54, and 2.02 are shown
in Figure 6.13 and Figure 6.14 for the Mode-H and Mode-L simulations, respectively.
These results are shown alongside the experimental results of Parnaudeau et al. [80]
for Mode-L, the numerical results of Ma et al. [ 77] for Mode-H, and the DNS results
of Lehmkuhl et al. [86] for both modes. Both the simulations show the V-shaped
velocity profile for Mode-H at x/D = 1.06 and the U-shaped profile for Mode-L, also
at x/D = 1.06. Both modes on both meshes agree well with both their corresponding
reference data sets. Plots of averaged cross-wise velocity at x/D = 1.06, 1.54, and
2.02 are shown in Figure 6.15 and Figure 6.16, respectively. These cross-wise velocity
profiles also show excellent agreement with their corresponding reference data sets.
x /D = 1.06
1.0
0.5
0.0
x /D = 1.54
1.0
u/u∞
0.5
0.0
x /D = 2.02
1.0
0.5
0.0
-2 -1 0 1 2
y /D
Figure 6.13. Time-span-average stream-wise velocity profiles for Mode-H compared with
the numerical results of Lehmkuhl et al. [86] and Ma et al. [77].
6.4 Single-Node Performance 103
x /D = 1.06
1.0
0.5
0.0
x /D = 1.54
1.0
u/u∞
0.5
0.0
x /D = 2.02
1.0
0.5
0.0
-2 -1 0 1 2
y /D
Figure 6.14. Time-span-average stream-wise velocity profiles for Mode-L compared with
the numerical results of Lehmkuhl et al. [86] and experimental results of Par-
naudeau et al. [80].
104 Chapter 6 Validation
x /D = 1.06
0.2
0.0
-0.2
x /D = 1.54
0.2
v /u∞
0.0
-0.2
x /D = 2.02
0.2
0.0
-0.2
-2 -1 0 1 2
y /D
Figure 6.15. Time-span-average cross-stream velocity profiles for Mode-H compared with
the numerical results of Lehmkuhl et al. [86].
6.4 Single-Node Performance 105
x /D = 1.06
0.3
0.2
0.1
0.0
-0.1
-0.2
-0.3
x /D = 1.54
0.3
0.2
0.1
v /u∞
0.0
-0.1
-0.2
-0.3
x /D = 2.02
0.3
0.2
0.1
0.0
-0.1
-0.2
-0.3
-2 -1 0 1 2
y /D
Figure 6.16. Time-span-average cross-stream velocity profiles for Mode-L compared with
the numerical results of Lehmkuhl et al. [86] and experimental results of Par-
naudeau et al. [80].
106 Chapter 6 Validation
for the W9100 were obtained using DGEMM from clBLAS v2.0 with version 1411.4
of the AMD APP OpenCL runtime.
On the K40c ECC is implemented in software and hence when enabled error-
correction data is stored in global memory. A consequence of this is that when ECC is
enabled there is a reduction in available memory and memory bandwidth. This partially
accounts for the discrepancy observed between the theoretical and reference memory
bandwidths for the K40c. For both the K40c and the E5-2697, reference peaks for
double precision arithmetic are in excess of 80% of their theoretical values. However,
for the W9100 the reference peak for double precision arithmetic is only 34% of its
theoretical value. This value is not significantly improved via the auto-tuning utility
that ships with clBLAS. It is hoped that this figure will improve with future releases of
clBLAS.
In preparing Table 6.6 the decision has been made to deliberately omit the number
of ‘cores’ available on each platform. This is on account of the term being both ill-
defined and routinely subject to abuse in the literature. For example, the E5-2697 is
presented by Intel as having 12 cores, whereas the K40c is described by NVIDIA
as having 2880 ‘CUDA cores’. However, whereas the cores in the E5-2697 can be
considered linearly independent those in the K40c can not. The rough equivalent of
a CPU core in NVIDIA parlance is a ‘streaming multiprocessor’, or SMX, of which
the K40c has 15. Additionally, the E5-2697 has support for two-way simultaneous
multithreading—referred to by Intel as Hyper-Threading—permitting two threads to
execute on each core. At any one instant it is therefore possible to have up to 24
independent threads resident on a single E5-2697. The AMD equivalent of a CUDA
core is a ‘stream processor’ of which the W9100 has 2816. This is not to be confused
with the aforementioned streaming multiprocessor of NVIDIA; for which the AMD
equivalent is a ‘Compute Unit’. Practically, both CUDA cores and stream processors
are closer to the individual vector lanes of a traditional CPU core. Given this minefield
of confusing nomenclature the choice has instead been made to just state the peak
floating point capabilities of the hardware.
6.4 Single-Node Performance 107
Platform
K40c W9100 E5-2697
Arithmetic / GFLOP/s
theoretical peak 1430 2620 280
reference peak 1192 890 231
Memory bandwidth / GB/s
theoretical peak 288 320 51.2
reference peak 190 261 37.1
Thermal design power / W 235 275 130
Memory / GiB 12 16
Clock / MHz 745 930 3000
Transistors / Billion 7.1 6.2 4.3
Results and discussion. By measuring the wall clock time required for PyFR to take
500 RK45[2R+] time-steps, and utilising the operation counts per time-step detailed in
Figure 6.7, one can calculate the sustained performance of PyFR in GFLOP/s when
running with the meshes detailed in Figure 6.6 with ℘ = 1, 2, 3, 4.
Sustained performance of PyFR for the various hardware platforms is shown in
Figure 6.17. From the figure it is clear that the computational efficiency of PyFR
increases with the polynomial order. This is consistent with higher order simulations
having an increased compute intensity per degree of freedom. This additional intensity
results in larger operator matrices that are better suited to the tiling schemes employed
by BLAS libraries. The OpenCL implementation shipped by NVIDIA as part of CUDA
only supports the use of 32-bit memory pointers. As such a single context is limited to
108 Chapter 6 Validation
Hex Pri/tet
600
Sustained GFLOP/s
400
200
0
1 2 3 4 1 2 3 4
℘
Figure 6.17. Sustained performance of PyFR in GFLOP/s for the various pieces of hard-
ware. The backend used by PyFR is given in parentheses. For the OpenCL
backend the initial of the vendor is suffixed. As the NVIDIA OpenCL plat-
form is limited to 4 GiB of memory no results are available for ℘ = 3, 4.
4 GiB of memory, cf. Table 6.4. It was therefore not possible to perform the third and
fourth order simulations for either of the two meshes using the OpenCL backend with
the K40c.
The Intel and AMD implementations of OpenCL, when used in conjunction with
clBLAS, are only competitive with the C/OpenMP backend when ℘ = 1 for the
hexahedral mesh, and ℘ = 1, 2 for the prism/tetrahedral mesh. This is also the case
when comparing performance between the CUDA backend and the NVIDIA OpenCL
backend on the K40c. Prior analysis by Witherden et al. [89] suggests that at these orders
a reasonable proportion of the wall clock time will be spent in the bandwidth-bound
pointwise kernels as opposed to DGEMM. On account of being bandwidth-bound such
kernels do not extensively test the optimisation capabilities of the compiler. When
℘ = 4 both implementations of OpenCL on the E5-2697 are delivering between one
6.5 Multi-Node Heterogeneous Performance 109
third and one quarter of the performance of the native backend. This highlights the
lack of performance portability associated with OpenCL in this context, confirming
the initial contention that, at the time of writing, performance portability can only
be achieved effectively via native paradigms. Further, it also justifies the approach to
multi-platform computing that has been adopted within PyFR.
Performance of the K40c culminates at 649 GFLOP/s for the ℘ = 4 hexahedral
mesh. This represents some 45% of the theoretical peak and 54% of the reference peak.
By comparison the E5-2697 obtains 132 GFLOP/s for the same simulation equating to
47% and 57% of the theoretical and reference peaks, respectively. Performance does
improve slightly to 140 GFLOP/s for the ℘ = 4 prism/tetrahedral mesh, however. On
this same mesh at ℘ = 4 the W9100 can be seen to sustain 657 GFLOP/s of throughput.
Although, in absolute terms, this observation represents the highest sustained rate of
throughput it corresponds to just 25% of the theoretical peak. However, working in
terms of realisable peaks, PyFR is found to obtain some 74% of the reference value.
The wall clock time required per degree of freedom (DOF) to evaluate ∇ · f for
each simulation can be seen in Table 6.7. The DOF count is inclusive of the factor of
five arising from there being five distinct field variables at each solution point. This
quantity can be used to evaluate the efficiency of PyFR relative to other codes. With
the exception of OpenCL on the E5-2697 the time per DOF reaches a minima for the
hexahedral mesh at ℘ = 3. This shows that as ℘ is raised from one to three the increasing
number of floating point operations required to update each DOF is being offset by the
improving efficiency of PyFR. The pattern is similar for the prism/tetrahedral mesh
except that for the E5-2697 (C/OpenMP) and the K40c (CUDA) the minima is at ℘ = 4.
Mesh partitioning. In order to distribute a simulation across the nodes of the het-
erogeneous system it is first necessary to partition the mesh. High quality partitions
can be readily obtained using a graph partitioning package such as METIS [90] or
SCOTCH [91].
When partitioning a mixed element mesh for a homogeneous cluster it is necessary
to suitably weight each element type according to its computational cost. This cost
depends both upon the platform on which PyFR is running and the order at which the
simulation is being performed. In principle it is possible to measure this cost; however
in practice the following set of weights have been found to give satisfactory results
across most polynomial orders and platforms
where larger numbers indicate a greater computational cost. One subtlety that arises
here, is that from a graph partitioning standpoint there is no penalty associated with
placing a sole vertex (element) of a given weight inside of a partition. Computationally,
however, there is a very real penalty incurred from having just a single element of a
certain type inside of the partition. It is therefore desirable to avoid mesh partitions
where any one partition contains less than around a thousand elements of a given type.
An exception is when a partition contains no elements of such a type—in which case
zero overheads are incurred.
When partitioning a mesh with one type of element for a heterogeneous cluster it
is necessary to weight the partition sizes in line with the performance characteristics
of the hardware on each node. However, in the case of a mixed element mesh on a
heterogeneous cluster the weight of an element is no longer static but rather depends on
the partition that it is placed in—a significantly richer problem. Solving such a problem
is currently beyond the capabilities of most graph partitioning packages. Accordingly,
mixed element meshes that are partitioned for heterogeneous clusters often exhibit
inferior load balancing than those partitioned for homogeneous systems. Moreover, for
consistent performance it is necessary to dedicate a CPU core to each accelerator in
112 Chapter 6 Validation
the system. The amount of useful computation that can be performed by the host CPU
is therefore reduced in accordance with this.
Given the single-node performance numbers of Figure 6.17 it comports to pair the
E5-2697 with the C/OpenMP backend, the K40c with the CUDA backend, and the
W9100 with the OpenCL backend, in order to achieve optimal performance. Employing
these results, in conjunction with some light experimentation, a set of partitioning
weights were obtained and are tabulated in Table 6.8.
Hex Pri/tet
1600
Sustained GFLOP/s
1200
FLOP/s
800 Achieved
Lost
400
0
1 2 3 4 1 2 3 4
℘
Figure 6.18. Sustained performance of PyFR on the multi-node heterogeneous system for
each mesh with ℘ = 1, 2, 3, 4. Lost FLOP/s represent the difference between
the achieved FLOP/s and the sum of the E5-2697 (C/OpenMP), K40c (CUDA),
and W9100 (OpenCL A) bars in Figure 6.17.
6.6 Scalability
The scalability of PyFR v1.0.0 has been evaluated on the Piz Daint supercomputer [92].
Housed at the Swiss National Supercomputing Centre (CSCS) it is based around the
Cray X30 platform and has 5 272 NVIDIA K20X GPUs. Each GPU has a theoretical
peak of 1 311 GFLOP/s for a total of 7.8 PFLOP/s. The raw memory capacity of each
GPU is 6 GiB however this decreases to ∼5.25 GiB when ECC is enabled.
When examining the scalability of a code there are two commonly used metrics.
The first of these is weak scalability in which the size of the target problem is increased
in proportion to the number of ranks N. A code is said to have perfect weak scalability
if the runtime remains unchanged as more ranks are added. The second metric is strong
scalability wherein the problem size is fixed and the speedup compared to a starting
number of ranks, N0 is assessed. Perfect strong scalability implies that the runtime
scales as N0 /N.
114 Chapter 6 Validation
To evaluate the scalability of PyFR a NACA 0021 aerofoil was meshed in two
dimensions using 51 632 unstructured quadrilateral elements. The grid was then ex-
truded to give a one layer NL = 1 hexahedral grid. When in double precision with
the Navier–Stokes solver in PyFR at ℘ = 4 using full anti-aliasing—consisting of
a 216 point rule in the volume and a 36 point rule on each face—this results in a
working set of ∼6.4 GiB. For the purposes of performance evaluation the problem can
be scaled arbitrarily by increasing the number of layers NL in the extrusion. Before any
simulations can be run it is necessary to first partition the domain into N pieces. This is
accomplished using METIS [90]. An important consequence of this is that the metrics
being measured are a function of both the inherit scalability of PyFR and of the quality
of the domain decomposition.
Weak scalability. Starting with two K20X GPUs and a single layer, for a working
set of ∼3.2 GiB/GPU, the weak scalability of PyFR was evaluated up to N = 2 000.
The resulting runtimes, normalised to that of the N = 2 case, are tabulated in Table 6.9.
In the case of N = 2 000 the simulation is observed to consist of 32 × 109 degrees of
freedom with a total working set of ∼6.25 TiB. The resulting sustained performance of
1.3 PFLOP/s represents 50.8% of the theoretical FLOP/s. This is extremely impressive
for a high-order code running on an automatically partitioned unstructured grid.
Strong scalability. Starting with 50 K20X GPUs and forty layers, for a net working
set of ∼256 GiB, the strong scalability of PyFR was evaluated up to N = 400. The
resulting speedups compared to the initial N = 50 case are tabulated in Table 6.10.
From the table it is observed that an eight-fold increase in GPUs results in a speed up
6.6 Scalability 115
of 6.26. Although this is not perfect it is important to note that in this case the working
set of each GPU is just ∼640 MiB.
Chapter 7
Conclusion
A formulation of the FR approach has been developed for solving non-linear advection
diffusion type problems on mixed curvilinear grids. It has also been demonstrated how
the majority of operations within this formulation can be cast as large matrix-matrix
multiplications. Furthermore, a methodology has been presented for automatically
determining the maximum stable step size for a simulation. Techniques for mitigating
and controlling the impact of aliasing driven instabilities have also been investigated.
As part of this a methodology for identifying symmetric quadrature rules on a
variety of domains in two and three dimensions was presented. Using this methodology
a set of rules tuned towards the requirements of finite element methods, including
anti-aliased FR, were presented. Many of these rules appear to be new and represent
and improvement over those tabulated in the literature. The impact of solution point
placement on the nonlinear stability of FR schemes has also been studied extensively.
Theoretical results confirming the optimal nature of Gauss-Legendre points were
presented. A new class of Lebesgue and truncation optimised solution points were also
derived for triangular elements and shown to represent an improvement over existing
point sets.
PyFR, an open source Python based framework for solving the Euler and compress-
ible Navier–Stokes equations on mixed unstructured grids, has also been presented.
The structure and ethos of PyFR has been explained including the approaches taken to
support multiple hardware platforms. It is shown how runtime code generation can be
used to improve both the performance and portability of the code. Extensive validation
of PyFR has also been performed. Spatial super accuracy is demonstrated when solving
the two dimensional Euler equations along with the expected orders of accuracy for
the Couette flow problem on a range of grids in two and three dimensions. The long
116
117
time dynamics of flow over a cylinder at Re = 3 900 were also assessed with PyFR
successfully resolving both the L and H modes. Results demonstrating the performance
portability of PyFR across a range of hardware platforms were also presented. The
heterogeneous capabilities of PyFR were also demonstrated. The scalability of PyFR
has been demonstrated in the weak sense up to 2 000 NVIDIA K20X GPUs when
solving the three dimensional Navier–Stokes equations around an extruded NACA
0021 aerofoil. On an unstructured grid sustained performance in excess of 1.3 PFLOP/s
is observed.
Appendix A
In the following section uL and uR are taken to be the two discontinuous solution
states at an interface and n̂L to be the normal vector associated with the first state.
For convenience fL(inv) = f (inv) (uL ), and fR(inv) = f (inv) (uR ) with inviscid fluxes being
prescribed by (2.44).
A.1 Rusanov
Also known as the local Lax-Friedrichs method a Rusanov type Riemann solver
imposes inviscid numerical interface fluxes according to
n̂L n (inv) o s
F(inv) = · fL + fR(inv) + (uL − uR ), (A.1)
2 2
where s is an estimate of the maximum wave speed
s
γ(pL + pR ) 1
s= + n̂L · (vL + vR ) . (A.2)
ρ L + ρR 2
118
Appendix B
Boundary Conditions
ghost solution state and B(b) qL is the ghost solution gradient. It is straightforward to
extend this prescription to allow for the provisioning of different ghost solution states
for Cα and Fα and to permit B(b) qL to be a function of uL in addition to qL .
ρ
f
B uL = B uL =
(inv) (ldg)
ρ
, (B.1)
v
f f
p f /(γ − 1) + ρ f kv f k2 /2
B(ldg) qL = 0. (B.2)
119
120 Appendix B Boundary Conditions
ρ
L
B uL = B uL =
(inv) (ldg)
ρ
, (B.3)
v
L L
p f /(γ − 1) + ρL kvL k2 /2
B(ldg) qL = 0, (B.4)
1
B uL = ρL
(inv)
, (B.5)
−
2v w v L
c p T w /γ +k2vw − vL k /22
1
B uL = ρL
(ldg)
, (B.6)
v w
c p T w /γ +kvw k /2
2
B(ldg) qL = qL . (B.7)
Using these the density, velocity, and pressure at the boundary can be defined as
γ
2 2 ρ f /p f if vL · n̂L < 0
(γ − 1) (RL − R f )
γ−1
ρb = (B.10)
16γ γ
ρ /p otherwise,
L L
n̂L v f − n̂L (v f · n̂L ) if vL · n̂L < 0
vb = (RL + R f )
(B.11)
2 vL − n̂L (vL · n̂L ) otherwise,
(γ − 1)2 (RL − R f )2 ρb
pb = , (B.12)
16γ
with the final boundary states being given by
ρ
b
B uL = B uL =
(inv) (ldg)
ρ
, (B.13)
b v b
pb /(γ − 1) + ρb kvb k /2
2
B(ldg) qL = 0. (B.14)
Bibliography
122
Bibliography 123
[21] JC Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley
& Sons, Ltd, 2008. i s b n: 9780470753767.
[22] E Hairer and G Wanner. Solving Ordinary Differential Equations II. 2nd ed.
Springer-Verlag, 1996.
[23] PE Vincent, P Castonguay, and A Jameson. Insights from von Neumann analysis
of high-order flux reconstruction schemes. Journal of Computational Physics
230(22), 2011, pp. 8134–8154.
[24] PE Vincent, P Castonguay, and A Jameson. A new class of high-order energy
stable flux reconstruction schemes. Journal of Scientific Computing 47(1), 2011,
pp. 50–72.
[25] A Jameson. A proof of the stability of the spectral difference method for all
orders of accuracy. Journal of Scientific Computing 45(1-3), 2010, pp. 348–358.
[26] P Castonguay, DM Williams, PE Vincent, and A Jameson. Energy Stable Flux
Reconstruction Schemes for Advection-Diffusion Problems. Computer Methods
in Applied Mechanics and Engineering, 2013.
[27] P Castonguay, PE Vincent, and A Jameson. A new class of high-order energy
stable flux reconstruction schemes for triangular elements.Journal of Scientific
Computing 51(1), 2012, pp. 224–256.
[28] DM Williams, P Castonguay, PE Vincent, and A Jameson. Energy stable flux
reconstruction schemes for advection-diffusion problems on triangles. Journal
of Computational Physics, 2013.
[29] HT Huynh. High-order methods including discontinuous Galerkin by recon-
structions on triangular meshes. AIAA Paper 44, 2011.
[30] DM Williams and A Jameson. Energy Stable Flux Reconstruction Schemes for
Advection-Diffusion Problems on Tetrahedra. Journal of Scientific Computing,
2013, pp. 1–39.
Bibliography 125
[51] R Cools and A Haegemans. Another step forward in searching for cubature
formulae with a minimal number of knots for the square. Computing 40(2),
1988, pp. 139–146.
[52] L Shunn and F Ham. Symmetric quadrature rules for tetrahedra based on a
cubic close-packed lattice arrangement. Journal of Computational and Applied
Mathematics 236(17), 2012, pp. 4348–4364.
[53] P Keast. Moderate-degree tetrahedral quadrature formulas. Computer Methods
in Applied Mechanics and Engineering 55(3), 1986, pp. 339–348.
[54] EJ Kubatko, BA Yeager, and AL Maggi. New computationally efficient quadra-
ture formulas for triangular prism elements. Computers & Fluids 73, 2013,
pp. 187–201.
[55] EJ Kubatko, BA Yeager, and AL Maggi. New computationally efficient quadra-
ture formulas for pyramidal elements. Finite Elements in Analysis and Design
65, 2013, pp. 63–75.
[56] AH Stroud. Approximate calculation of multiple integrals. Prentice-Hall, 1971.
[57] DA Dunavant. Efficient symmetrical cubature rules for complete polynomials of
high degree over the unit cube. International journal for numerical methods in
engineering 23(3), 1986, pp. 397–407.
[58] R Cools and KJ Kim. Rotation invariant cubature formulas over the n-dimensional
unit cube. Journal of computational and applied mathematics 132(1), 2001,
pp. 15–32.
[59] FWJ Olver. NIST handbook of mathematical functions. Cambridge University
Press, 2010.
[60] G Guennebaud, B Jacob, et al. Eigen v3. 2010. http://eigen.tuxfamily.org.
[61] L Fousse, G Hanrot, V Lefèvre, P Pélissier, and P Zimmermann. MPFR: A
Multiple-Precision Binary Floating-Point Library with Correct Rounding. ACM
Transactions on Mathematical Software 33(2), 2007, 13:1–13:15.
128 Bibliography
[62] L Bos. On certain configurations of points in Rn which are unisolvent for poly-
nomial interpolation. Journal of approximation theory 64(3), 1991, pp. 271–
280.
[63] TJ Rivlin. An introduction to the approximation of functions. Dover, 2003.
[64] T Warburton. An explicit construction of interpolation nodes on the simplex.
Journal of engineering mathematics 56(3), 2006, pp. 247–262.
[65] MA Taylor, BA Wingate, and RE Vincent. An algorithm for computing Fekete
points in the triangle. SIAM Journal on Numerical Analysis 38(5), 2000, pp. 1707–
1720.
[66] Q Chen and I Babuška. The optimal symmetrical points for polynomial in-
terpolation of real functions in the tetrahedron. Computer methods in applied
mechanics and engineering 137(1), 1996, pp. 89–94.
[67] H Luo and C Pozrikidis. A Lobatto interpolation grid in the tetrahedron. IMA
journal of applied mathematics, 2006.
[68] J Chan and T Warburton. A Comparison of High Order Interpolation Nodes for
the Pyramid. arXiv preprint arXiv:1412.4138, 2014.
[69] F Johansson et al. mpmath: a Python library for arbitrary-precision floating-
point arithmetic (version 0.18). 2013. http://mpmath.org/.
[70] M Bayer. Mako: Templates for Python. 2013. http://www.makotemplates.
org/.
[71] A Klöckner, N Pinto, Y Lee, B Catanzaro, P Ivanov, and A Fasih. PyCUDA
and PyOpenCL: A scripting-based approach to GPU run-time code generation.
Parallel Comput. 38(3), 2012, pp. 157–174. i s s n: 0167-8191.
[72] L Dalcin. mpi4py: MPI for Python. 2013. https://bitbucket.org/mpi4py.
[73] A Collette. Python and HDF5. O’Reilly Media, 2013.
[74] BC Vermeire and S Nadarajah. Adaptive IMEX time-stepping for ILES using
the correction procedure via reconstruction scheme. AIAA paper 2687, 2013.
Bibliography 129
[75] BC Vermeire and S Nadarajah. Adaptive IMEX schemes for high-order unstruc-
tured methods. Journal of Computational Physics 280, 2015, pp. 261–286.
[76] C Norberg. LDV measurements in the near wake of a circular cylinder.Interna-
tional Journal for Numerical Methods in Fluids 28(9), 1998, pp. 1281–1302.
[77] X Ma, GS Karamanos, and GE Karniadakis. Dynamics and low-dimensionality
of a turbulent near wake. Journal of Fluid Mechanics 310, 2000, pp. 29–65.
[78] M Breuer. Large eddy simulation of the subcritical flow past a circular cylinder.
International Journal for Numerical Methods in Fluids 28(9), 1998, pp. 1281–
1302.
[79] AG Kravchenko and P Moin. Numerical studies of flow over a circular cylinder
at ReD = 3 900. Physics of Fluids 12, 2000, pp. 403–417.
[80] P Parnaudeau, J Carlier, D Heitz, and E Lamballais. Experimental and numerical
studies of the flow over a circular cylinder at Reynolds number 3900. Physics of
Fluids 20(8), 2008.
[81] A Roshko. On the development of turbulent wakes from vortex streets. Technical
Report No. NACA TR 1191, California Institute of Technology, 1953.
[82] MS Bloor. The transition to turbulence in the wake of a circular cylinder. Journal
of Fluid Mechanics 19, 1964, pp. 290–304.
[83] CHK Williamson. The existence of two stages in the transition to three dimen-
sionality of a cylinder wake. Physics of Fluids 31, 1988, pp. 3165–3168.
[84] CHK Williamson. Vortex dynamics in the cylinder wake. Annual Review of
Fluid Mechanics 28, 1996, pp. 477–539.
[85] FD Witherden, BC Vermeire, and PE Vincent. Heterogeneous computing on
mixed unstructured grids with PyFR. Computers & Fluids 120, 2015, pp. 173–
186.
[86] O Lehmkuhl, I Rodriguez, R Borrell, and A Oliva. Low-frequency unsteadiness
in the vortex formation region of a circular cylinder. Physics of Fluids 25(8),
2013, pp. 3165–3168.
130 Bibliography
[87] BC Vermeire, JS Cagnone, and S Nadarajah. ILES using the correction procedure
via reconstruction scheme. AIAA paper 1001, 2013.
[88] BC Vermeire, S Nadarajah, and PG Tucker. Canonical test cases for high-order
unstructured implicit large eddy simulation. AIAA paper 0935, 2014.
[89] FD Witherden, AM Farrington, and PE Vincent. PyFR: An open source frame-
work for solving advection–diffusion type problems on streaming architectures
using the flux reconstruction approach. Computer Physics Communications
185(11), 2014, pp. 3028–3040.
[90] G Karypis and V Kumar. A fast and high quality multilevel scheme for parti-
tioning irregular graphs. SIAM Journal on Scientific Computing 20(1), 1998,
pp. 359–392.
[91] F Pellegrini and J Roman. Scotch: A software package for static mapping by dual
recursive bipartitioning of process and architecture graphs. High-Performance
Computing and Networking. Springer. 1996, pp. 493–498.
[92] PE Vincent, FD Witherden, AM Farrington, G Ntemos, BC Vermeire, JS Park,
and AS Iyer. PyFR: Next-Generation High-Order Computational Fluid Dynam-
ics on Many-Core Hardware. AIAA paper 3050, 2015.
[93] A Jameson and T Baker. Solution of the Euler equations for complex configura-
tions. AIAA Paper 1929, 1983.
Colophon
The original source for this document was typeset by the author in LATEX 2ε using the
KOMA-Script bundle. Diagrams and illustrations were created using the TikZ package.
Graphs were generated in R using the ggplot2 package. All of the source was written
in GNU Emacs with the AUCTEX package. The final document was generated using
LuaTEX and optimised using pdfsizeopt.
The title and captioning font is URW-Garamond while the main body font is Times
at 11pt. Sans-serif elements are typeset in Helvetica.
131