PyFR 1
PyFR 1
A R T I C L E I N F O A B S T R A C T
The review of this paper was arranged by PyFR is an open-source cross-platform computational fluid dynamics framework based on the high-order Flux
Prof. Andrew Hazel Reconstruction approach, specifically designed for undertaking high-accuracy scale-resolving simulations in the
vicinity of complex engineering geometries. Since the initial release of PyFR v0.1.0 in 2013, a range of new
Keywords:
capabilities have been added to the framework, with a view to enabling industrial adoption. In this work, we
High-order accuracy
provide details of these enhancements as released in PyFR v2.0.3, including improvements to cross-platform
Flux reconstruction
Computational fluid dynamics performance (new backends, extensions of the DSL, new matrix multiplication providers, improvements to the
data layout, use of task graphs) and improvements to numerical stability (modal filtering, anti-aliasing, artficial
viscosity, entropy filtering), as well as the addition of prismatic, tetrahedral and pyramid shaped elements,
improved domain decomposition support for mixed element grids, improved handling of curved element meshes,
the addition of an adaptive time-stepping capability, the addition of incompressible Euler and Navier-Stokes
solvers, improvements to file formats and the development of a plugin architecture. We also explain efforts to
grow an engaged developer and user community and provided a range of examples that show how our user
base is applying PyFR to solve a wide range of fundamental, applied and industrial flow problems. Finally, we
demonstrate the accuracy of PyFR v2.0.3 for a supersonic Taylor-Green vortex case, with shocks and turbulence,
and provided latest performance and scaling results on up to 1024 AMD Instinct MI250X accelerators of Frontier
at ORNL (each with two GCDs) and up to 2048 Nvidia GH200 GPUs of Alps at CSCS. We note that absolute
performance of PyFR accounting for the totality of both hardware and software improvements has, conservatively,
increased by almost 50× over the last decade.
Program summary
Program Title: PyFR
CPC Library link to program files: https://doi.org/10.17632/vmgh4kfjk6.1
Developer’s repository link: https://github.com/PyFR/PyFR
Licensing provisions: BSD 3-clause
Programming language: Python (generating C/OpenMP, CUDA, OpenCL, HIP, Metal)
* Corresponding author.
E-mail address: [email protected] (P.E. Vincent).
https://doi.org/10.1016/j.cpc.2025.109567
Received 21 August 2024; Received in revised form 7 February 2025; Accepted 25 February 2025
Available online 28 February 2025
0010-4655/© 2025 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
1. Introduction where 𝑥̃ are coordinates in the reference element, and geometric Jaco
bian matrices can be dfined from the mapping functions as
Computational Fluid Dynamics (CFD) is used by high-value indus 𝜕ni
𝐉n = 𝐽nij = , 𝐽n = det 𝐉n ,
tries across the world to reduce costs and improve product performance. 𝜕 𝑥̃ 𝑗
The majority of industrial CFD is undertaken using Reynolds-Averaged 𝜕−1
ni 1
Navier–Stokes (RANS) simulations, which time-average unsteady phe 𝐉−1
n = 𝐽nij =
−1
, 𝐽n−1 = det 𝐉−1
n = .
𝜕 𝑥̃ 𝑗 𝐽n
nomena, including turbulence, and replace the ‘missing physics’ with a
model. However, it is well established that RANS approaches have lim For each 𝛀𝑛 , these Jacobian matrices can be used to transform Eq. (1)
ited applicability when flow is separated and unsteady. To overcome this into reference element space as
limitation, highe-fidelity scale-resolving methods can be used, such as 𝜕𝑢𝑛𝛼
+ 𝐽n−1 ∇ ̃ ⋅ 𝐟̃𝑛𝛼 = 0 and
Large Eddy Simulations (LES), Implicit Large Eddy Simulations (ILES) 𝜕𝑡
and Direct Numerical Simulations (DNS), with DNS being the most ac 𝐟̃n𝛼 = 𝐟̃n𝛼 (̃𝐱, 𝑡) = 𝐽n (̃𝐱)𝐉−1 𝐱))𝐟n𝛼 (n (̃𝐱), 𝑡),
n (n (̃ (3)
curate; fully resolving all physics of the governing Navier–Stokes equa where 𝑢𝑛𝛼 and 𝐟n𝛼 are the solution and flux in 𝛀𝑛 , respectively, and
tions. However, the cost of a scale-resolving simulation is typically ̃ = 𝜕∕𝜕 𝑥̃ 𝑖 .
∇
orders-of-magnitude higher than that of a RANS simulation, and thus (𝑢)
We can proceed to dfine a set of solution points 𝐱̃ 𝜁 in the reference
the use of scale-resolving methods in industry has until recently been element (see Fig. 1), where 𝜁 is the solution point index which satifies
considered intractable. 0 ⩽ 𝜁 < 𝑁 (𝑢) and 𝑁 (𝑢) is the number of solution points in the reference
Our vision with PyFR has been to develop and deliver an open-source (𝑢)
element. Now a nodal basis set 𝓁𝜁 (̃𝐱) can be dfined in the reference
Python framework based on the high-order accurate Flux Reconstruc (𝑢) (𝑢) (𝑢)
tion (FR) approach [1] that enables real-world scale-resolving simula element, where the nodal basis polynomials 𝓁𝜁 satisfy 𝓁𝜁 (̃𝐱𝜎 ) = 𝛿𝜁𝜎
tions to be undertaken in tractable time, at scale, by both academic where 𝛿𝑖𝑗 is the Kronecker delta. We can also dfine a set of flux points
and industrial practitioners—helping advance industrial CFD capabili 𝐱̃ 𝜁(𝑓 ) on the surface of the reference element (see Fig. 1), where 𝜁 is now
ties from their current ‘RANS plateau’. It is one of several international the flux point index which satifies 0 ⩽ 𝜁 < 𝑁 (𝑓 ) and 𝑁 (𝑓 ) is the number
efforts to enable industrial adoption of scale-resolving simulations, in of flux points on the surface of the reference element. These flux points
cluding Nektar++ [2,3] and hpMusic [4] which employ high-order are constrained such that flux points from adjoining elements always
spectral element methods similar to PyFR, HiPSTAR [5,6] which em conform at element interfaces.
ploys a compact finite-difference scheme, and CharLES (now acquired For a given element, the first step in the FR approach is to obtain the
(𝑓 )
and distributed for industrial usage by as Fidelity LES Solver by Ca discontinuous solution at each flux point 𝑢𝜎 n𝛼 from the solution at the
dence Inc) [7,8] which employs a second-order accurate cell-centered (𝑢)
solution points 𝑢𝜁 n𝛼 as
finite volume scheme.
𝑁∑
(𝑢) −1
The first version of PyFR—v0.1.0-was released in late 2013 [9]. It
supported solving the compressible Euler and Navier–Stokes equations 𝑢(𝑓 )
𝜎 n𝛼 = 𝑢(𝑢) 𝓁 (𝑢) (̃𝐱𝜎(𝑓 ) ).
𝜁 n𝛼 𝜁
(4)
𝜁=0
on unstructured grids of hexahedral elements, and was able to target
both conventional CPUs and NVIDIA GPUs via a novel domain specific The second step is to obtain a transformed common normal interface
𝐶(f )
language based on Mako. In the past decade, there have been over 30 flux at each flux point 𝑓̃𝜎 n𝛼⟂ from the discontinuous solution at the flux
subsequent releases of PyFR, adding a wide range of new capabilities (f )
point 𝑢𝜎 n , the discontinuous solution at the conforming flux point in the
with a view to enabling industrial adoption, culminating in the current ′ (f )
relevant adjoining element 𝑢𝜎 n , and the surface normal at the flux point
release of PyFR v2.0.3 which is described in this paper. (f )
𝐧𝜎 n as
𝐶(f )
2. Flux reconstruction 𝑓̃𝜎 n𝛼⟂ = F𝛼 (𝑢𝜎(fn) , 𝑢′𝜎(nf ) , 𝐧̂ 𝜎(fn) ), (5)
where F𝛼 is e.g. an appropriate Riemann solver.
PyFR implements the Flux Reconstruction (FR) approach [1], which The third step is to calculate the transformed discontinuous flux at
is a form of discontinuous spectral element method [10,11]. As a brief the solution points 𝐟̃𝜎 n𝛼 from the solution at solution points 𝐮𝜎 n𝛼 using
(u) (u)
overview, consider the first-order hyperbolic conservation law the system flux function. These values can then be used to calculate the
transformed normal discontinuous flux at the flux points 𝑓̃𝜎 n⟂𝛼 as
(f )
𝜕𝑢𝛼
+ ∇ ⋅ 𝐟𝛼 = 0, (1)
𝜕𝑡 𝑁∑
(𝑢) −1
∗+ 𝐟̃𝜈(un)𝛼 ⋅ ∇𝓁
̃ (u) (̃𝐱(𝑢) ),
𝜈 (7)
𝐱 = n (̃𝐱), 𝐱̃ = −1
n (𝐱),
𝜁
𝜈=0
2
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 1. (a) Example of an unstructured curved-element tetrahedral mesh around a sphere, (b) fourth-order 𝛼 -optimised flux points [11] for a tetrahedron, where
yellow and red indicate doubly and triply collocated points, respectively, and (c) fourth-order 𝛼 -optimised solution points [11] for a tetrahedron. (For interpretation
of the colours in the figure(s), the reader is referred to the web version of this article.)
which can then be used to update the solution at the solution points Matrix multiplications Many operations within an FR time-step can be
𝑢(𝑢)
𝜁 n𝛼
via a suitable explicit time integration scheme. For further details cast in the form of
on the FR approach and its implementation in PyFR see [9,12].
𝗖 ← 𝗔𝗕 + 𝛽𝗖,
3. New capabilities where 𝗔 is a constant operator matrix, 𝗕 is an input state matrix, and 𝗖
is an output state matrix. PyFR v0.1.0 simply offloaded these operations
3.1. Cross-platform performance to a platform-specific dense BLAS library such as cuBLAS or OpenBLAS.
However, when operating on elements with a tensor-product structure,
The cross-platform performance of PyFR is enabled through backends the operator matrices can exhibit a significant degree of sparsity. This
which can utilise various matrix multiplication kernels and a Mako de can lead to suboptimal performance in cases where the arithmetic in
rived domain specific language (DSL) which achieves complete feature tensity of the operation is beyond that of the underlying hardware.
parity across all backends, as per Fig. 2. This issue has been addressed by incorporating additional matrix
multiplication providers into PyFR v2.0.3. When running on with the
Backends PyFR v0.1.0 had a C/OpenMP backend for CPUs and a CUDA CUDA and HIP backends, PyFR v2.0.3 will use the GiMMiK [13] li
backend for NVIDIA GPUs. However, since 2013 additional vendors brary to generate a suite of bespoke fully-unrolled kernels for each 𝗔.
have entered the high-end GPU market including AMD, Intel, and even During the code generation process, GiMMiK automatically elides multi
Apple. As such, PyFR v2.0.3 now contain an additional HIP backend for plications through by zero, thus reducing the arithmetic intensity of the
AMD GPUs, an OpenCL backend for all GPUs, and a Metal backend for operation. The generated kernels are then competitively benchmarked
Apple GPUs. against those provided by the dense BLAS library, with PyFR automat
ically selecting the fastest kernel for each operation. This auto-tuning
DSL The capabilities and performance of the DSL have been improved is performed autonomously by PyFR at run-time and does not require
in PyFR v2.0.3. In particular, there is language-level support for per any direction from the user. When running on CPUs, a similar result is
forming reduction operations—for example min—across a data set. accomplished through the use of libxsmm [14] which includes its own
A kernel argument can now be annotated according to reduce(op) built-in support for automatically choosing between dense and sparse
where op is a reduction operator. Whatever value the kernel assigns to kernels.
this argument will then be automatically and safely reduced with its Finally, in situations where the flux points are a strict subset of the so
current value in memory using atomic operations. On the performance lution points, additional logic has been incorporated into PyFR to avoid
side, the DSL is now capable of detecting situations where read-only ker the need for multiplications entirely. This leads to further memory and
nel arguments are likely to be subject to reuse. When running on GPU memory bandwidth savings.
platforms, the DSL will automatically take care of loading these argu
ments into shared memory. This helps to reduce pressure on the L1 and Data layout The primary data structure in PyFR is an 𝑚 by 𝑛 row-major
L2 caches. matrix where, up to padding, 𝑚 is proportional to the number of solu
3
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 2. Overview of how PyFR achieves cross-platform performance. New functionality is marked in blue.
Fig. 3. Data layout methodologies for packing a pair of field variables (denoted by red and blue, respectively), into the rows of a matrix.
tion/flux points and 𝑛 is equal to the product of the number of elements 2. It enables a substantial reduction in interface overhead since, once
and the number of field variables. It follows that there is a degree of constructed, task graphs can be launched with just a single function
freedom regarding how these field variables are packed along a row. call as opposed to one function call per kernel.
This can be characterised by the stride Δ𝑗 between two subsequent field
variables. The choice of Δ𝑗 = 1 results in an array of structures (AoS) An example of a task graph for a 2D Navier–Stokes simulation on
arrangement, Δ𝑗 = 𝑁𝐸 where 𝑁𝐸 is the number of elements results in a mixed-element grid can be seen in Fig. 4. Looking at the graph, we
a structure of arrays (SoA) arrangement, and Δ𝑗 = 𝑘 results in a hybrid observe that there are four root nodes. As these root nodes are—by
array of structure of arrays (AoSoA) approach. An illustration of these definition—independent, it is possible for all four of the kernels to be
arrangements can be seen in Fig. 3. executed in parallel. Similarly, we observe that the three bcconu bound
For simplicity, PyFR v0.1.0 used an SoA approach. However, al ary condition kernels are also independent, such that these can also be
though this structure is readily amenable to vectorisation, it has some executed in parallel. Indeed, careful inspection of the graph shows that
limitations. Firstly, the large stride between field variables decreases there are always at least two kernels which may be executed at the same
the efficiency of caches since adjacent field variables are unlikely to time. When exploited by a backend, this parallelism can improve GPU
reside in the same cache line. Secondly, it is not friendly to hardware utilisation which, in turn, leads to improved strong scaling.
pre-fetchers: an SoA structure with 𝑣 field variables appears to a CPUs On the CUDA backend, PyFR task graphs map directly onto native
pre-fetcher like 𝑣 separate arrays. Since CPUs are only capable of pre CUDA graphs. Since the NVIDIA A100 generation, there is hardware
fetching a finite number of data streams, this can lead to stalls. Finally, support for task graph acceleration, enabling further reductions in over
given a pointer to one field variable at one point, it is not possible to head. While HIP does support task graphs, preliminary studies show
access the next field variable unless one also knows 𝑁𝐸 . To avoid these their performance to be inferior to launching kernels directly. As such,
issues, PyFR v2.0.3 employs the more sophisticated AoSoA packing. The task graphs on the HIP backend are emulated by submitting kernels to
value of 𝑘 is chosen automatically by the backend based on the vector a stream in a serial fashion. A similar approach is used on Metal. On
length of the underlying hardware. OpenCL, task graphs are emulated using out-of-order queues and events.
Additionally, when running on CPUs, PyFR v2.0.3 incorporates an This enables the runtime to identify and exploit inter-kernel parallelism
additional level of blocking. Rather than allocating a single row-major but does not decrease API overhead.
matrix with 𝑛 columns, the C/OpenMP backend instead allocates 𝑞 To demonstrate the benfits of native task graphs, we consider the 2D
smaller matrices each with ∼𝑛∕𝑞 columns. The value of 𝑞 is chosen to Incompressible Cylinder Flow test case 2d-inc-cylinder available
ensure that an entire block can easily remain resident in local caches from the PyFR Test Case repository on GitHub. This is small mixed
and serves to further improve data locality. element case has a high API overhead, and without task graphs, the
run-time on an NVIDIA V100 GPUs is 256 s. However, with task graphs,
Task graphs Strong scaling has also been improved by adding first-class this reduces to 122 s.
support for task graphs. The idea is to exploit the fact that PyFR, as with
many scientific codes, repeatedly calls the same sequence of kernels. Cache blocking A powerful means of reducing the memory bandwidth
By treating each kernel as a vertex in a graph and the dependencies requirements of a code on conventional CPUs is cache blocking [15]. The
between kernels as edges, it is possible—on an a priori basis—to form idea is to improve data locality by changing the order in which kernels
a task graph corresponding to a single right-hand side evaluation. This are called. An example of this can be seen in Fig. 5 which shows how
has two key advantages: a pair of array addition kernels can be rearranged to reduce bandwidth
requirements. A key advantage of cache blocking compared with alter
1. It presents the underlying runtime with extra opportunities for ex native approaches, such as kernel fusion, is that the kernels themselves
tracting parallelism by enabling it to safely identify kernels which do not require modfication; all that changes is the arguments to the
can be run in parallel. kernels.
4
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 4. A task graph generated by the NVIDIA cuGraphDebugDotPrint API for a 2D mixed-element Navier–Stokes test case. Blue shading indicates kernels for
quadrilateral elements, red shading kernels for triangular elements, and green shading for interface kernels. The four root nodes of the graph are marked with red
borders.
Historically, cache blocking has not been viable for high-order codes Multi-node capabilities Distributed memory parallelism is accomplished
due to the size of the intermediate arrays which are generated by ker via MPI using the mpi4py wrappers [17,18]. As the message format is
nels. For example, an Intel Ivy Bridge CPU core from 2013 only has standardised across all backends, it is possible for different ranks to em
256 KiB of L2 cache which is shared between executable code and data. ploy different backends, thus enabling heterogeneous computing from a
As a point of reference, for the Euler equations, storing the solution and homogeneous codebase [19]. In order to improve scalability, the back
flux for just eight ℘ = 4 hexahedra at double precision requires 160 kB.
end interface in PyFR v2.0.3 has been enhanced to allow backends to
Since 2016, however, there has been a marked increase in the size of
directly pass GPU device pointers to MPI routines. As such, PyFR v2.0.3
private caches, with Intel Golden Cove CPU cores having 2 MiB. The
is fully capable of exploiting GPUDirect RDMA on NVIDIA platforms via
specifics involved in cache blocking FR are detailed in [15,16] and can
improve performance by a factor of two. Within PyFR, cache blocking is CUDA Aware MPI, along with its analogue on AMD platforms via HIP
accomplished by calling auxiliary methods on task graphs stating which Aware MPI. The impact of this technology depends on both the under
kernels in the graph are suitable for blocking transformations. The inter lying hardware and the degree to which a simulation is strong scaled. In
face also contains support for eliminating temporary arrays which can the most extreme cases, twofold performance improvements have been
further improve performance. observed when running on clusters of NVIDIA A100 GPUs [20].
5
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 5. Example of how cache blocking can be applied to a pair of array addition kernels. In the blocked version when the second kernel updates a[i] it will hit in
cache, thus saving a write to and read back from main memory.
3.2. Numerical stability the positivity constraints and entropy thresholds are directly dfined. In
practice, a small fixed tolerance in the entropy constraint is allowed to
PyFR is often used to conduct under-resolved DNS (uDNS), also re handle finite precision effects, but this default tolerance is not varied be
ferred to as ILES of turbulent flow. On account of this under-resolution, tween problems and does not affect the positivity-preserving properties
the FR scheme is subject to aliasing-driven instabilities, which can cause of the shock capturing technique.
the simulation to diverge [21]. Additionally, FR schemes exhibit in The utility of the approach is demonstrated in the 2D Double Mach
stabilities when solutions contain discontinuities such as shocks. PyFR Rflection test case 2d-double-mach-reflection and the 2D Vis
v0.1.0 had no specialised capabilities for handling either scenario, be cous Shock Tube test case 2d-viscous-shock-tube available from
yond simply increasing grid resolution. However, as of PyFR v2.0.3, the PyFR Test Case repository on Github.
there are now four separate stabilisation strategies available.
3.3. Mixed elements and domain decomposition
Modal filtering The simplest stabilisation technique in PyFR v2.0.3 is
modal filtering, wherein high-order modes of the solution are period PyFR v0.1.0 only included support for three element types: quadri
ically filtered as outlined in [11]. This approach is conservative and laterals and triangles in two dimensions and hexahedra in three dimen
numerically inexpensive. However, it is an indiscriminate approach- sions. Given the difficulties of all-hexahedral meshing around complex
with the filtering applied uniformly across the domain irrespective of geometries, this represented a significant limitation. PyFR v2.0.3 ad
whether it is required, and it exposes several free parameters including dresses this limitation by adding in complete support for prisms and
the filter strength, filter frequency and cut-off modes. tetrahedra and partial support for pyramids. Specifically, the pyramid
support requires that the quadrilateral base be a˙ine. Given that a major
Anti-aliasing As noted in [21], the origin of aliasing-driven instabili application for pyramids is as a transition layer between a tetrahedral
ties is the use of a collocation-type projection of the fluxes. PyFR v2.0.3 nea-field and a hexahedral fa-field, this restriction is relatively minor.
resolves this issue by using quadrature to perform a least-squares projec One practical complication which arises when running on mixed
tion of the flux instead. To do this, PyFR employs a series of state-of-the grids is domain decomposition. The relative performance of different
art quadrature rules generated using Polyquad [22], which out-perform element types is affected by around half a dozen simulation parameters
those in literature. Studies have shown the results with anti-aliasing to including: the polynomial order, location of solution and flux points
be markedly superior to those produced by modal filtering [23]. This and use of anti-aliasing, to name but three. A consequence of this is that
does, however, come with an associated computational cost. For a three when partitioning a grid, it is not possible to employ a single set of ele
dimensional simulation where the anti-aliasing degree is one higher ment weighting factors. Employing incorrect weighting factors can lead
than that of the solution, the run-time of the simulation is typically dou to load imbalances which negatively impact strong scaling. Compared
bled. with v0.1.0, PyFR v2.0.3 contains two major improvements in this area.
Firstly, whereas v0.1.0 required grids to be partitioned by the mesh
Artficial viscosity Primarily intended for shock capturing, artficial vis generation software, v2.0.3 includes built-in support for partitioning
cosity is another stabilisation approach provided by PyFR v2.0.3, which and re-partitioning both mesh and solution files. This is accomplished by
dynamically adds extra viscosity into elements whose solutions are ex having PyFR call out to the METIS [29] and SCOTCH [30] libraries. Us
hibiting Gibbs-type phenomena. Based around the widely adopted ap ing this functionality, it is relatively simple to experiment with different
proach of [24], the method is functional but, as with modal filtering, weightings and change them in concert with the simulation parameters;
requires a degree of parameterisation. Additionally, whilst the addi for example, when restarting a simulation at a higher polynomial or
tional kernels are not particularly expensive—at least within the con der, this functionality can be used to appropriately re-weight the mesh.
text of an advection-diffusion type problem such as the Navier–Stokes Moreover, to aid this process, PyFR also includes support for tracking
equations—the process of adding viscosity can have a negative impact MPI wait times. This information can be used to identify load imbal
on the maximum stable explicit time step. As such, the overall cost of ances between domains, which the user can then employ to derive more
the approach can be high. appropriate weights.
Secondly, there is also support for balanced partitioning wherein
Entropy filtering The final approach provided by PyFR v2.0.3 for stabil PyFR attempts to assign the same number of elements of each type to
isation and shock capturing is entropy filtering [25--28]. This is based each domain. This ensures optimal load balancing irrespective of the
around selectively applying a modal filter to elements which violate pos relative performance differential between element types. However, as
itivity of density, positivity of pressure or a minimum entropy condition. element types are not uniformly distributed throughout the domain-
The minimum entropy threshold is chosen as the minimum discrete en for example one might have a prismatic boundary layer and a tetra
tropy of the element and its Voronoi neighbours at the previous time step hedral wake—balanced partitioning can lead to partitions becoming
i.e., the local domain of ifluence of the element. As such, the method non-contiguous when the number of partitions is large. Examples of
does not explicitly require any problem-dependent parameterisation as weighted and balanced partitioning can be seen in Fig. 6 for the 2D
6
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 6. Partitionings of a mixed grid with a quadrilateral boundary layer and triangular fa-field partitioned into eight parts. For the weighted strategy, each
quadrilateral was assigned 3∕2 the weight of a triangle.
Fig. 7. Simulation of free-stream flow using second-order solution polynomials on a cubically-curved tetrahedral mesh with cross-product metric as used in PyFR
v0.1.0 (a), and conservative metric as used in PyFR v2.0.3 (b).
Incompressible Cylinder Flow test case 2d-inc-cylinder available this fact by identifying linear elements and, in lieu of computing metric
from the PyFR Test Case repository on GitHub. terms for each solution point on an a priori basis, instead determines
them on the fly based off the geometry of the element. This can lead
3.4. Curved elements to a substantial saving in memory bandwidth. For example, in a ℘ = 4
hexahedral element there are (℘ + 1)3 = 125 solution points and hence
To realise the benfits of high-order schemes, it is necessary to em 32 × 125 = 1, 125 metric terms. However, if the element is linear, then
ploy grids which, by finite volume standards, are relatively coarse. In the metric terms are entirely determined by the corner vertices which
order to still accurately represent the underlying geometry, it is there only involve 3 × 8 = 24 terms.
fore important for the elements themselves to be curved. This is accom
plished by associating metric terms—which take the form of a 2 × 2 or 3.5. Adaptive time stepping
3 × 3 matrix—within each element.
PyFR v0.1.0 employed the so-called cross-product metric. However, When using explicit time stepping, the run-time of a simulation is
with this approach, the polynomial order of the spatial metric is twice directly proportional to the time step size. However, for non-linear prob
that of the shape function for curved grids—possibly exceeding the or lems, it can be challenging to accurately estimate the maximum stable
der of the solution basis. If this is the case, then the metric terms may step size. PyFR v2.0.3 avoids these issues by including support for low
not be discretised accurately due to truncation and aliasing errors. In storage Runge–Kutta methods with embedded pairs. These make it possi
particular, the divergence of the approximated metric terms may be ble to inexpensively obtain an estimate for the numerical error incurred
come non-zero, which results in a lack of free-stream preservation, i.e. when taking a time step [33]. Once suitably normalised, this error is
the solver cannot maintain a uniform free-stream flow solution. then used to decide if a time step should be accepted or rejected. More
To overcome this issue, PyFR v2.0.3 instead employs the conserva over, it is also be used to adapt the step size; increasing for accepted
tive metric [31,32], which preserves a uniform free-stream flow even steps and decreasing for rejected steps.
when discretised. Specifically, this approach constructs the metric terms
as the curl of the function, thus ensuring they are always divergence-free 3.6. Incompressible Euler and Navier–Stokes
irrespective of any errors in the function approximation. The approach
greatly increases the robustness of PyFR when running on curved grids. PyFR v0.1.0 included a compressible Euler and Navier–Stokes solver.
An example of its impact can be seen in Fig. 7. In PyFR v2.0.3, support has also been added for the incompressible
Furthermore, in many real-world grids, only elements in and around Navier–Stokes equations. This is accomplished via a combination of the
the boundary layer are actually curved. PyFR v2.0.3 takes advantage of artficial compressibility method of [34] with the dual-time approach
7
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
of [35]. The result is an iterative scheme which builds extensively upon Turbulence generation The solver-plugin-turbulence implem-
the fast residual evaluation capability of PyFR. Convergence is acceler ents the synthetic eddy method of [40,41], which allows turbulence
ated through a combination of polynomial multigrid [36] and variable to be injected into any portion of the domain. Specifically, isotropic ed
local time stepping [37]. dies are injected via a source term formulation, where the turbulence
intensity and length scale of the eddies can be specfied. The implemen
3.7. File formats tation is designed to scale efficiently and minimise memory bandwidth
requirements. Specifically, by pre-computing and caching element in
The mesh and solution file formats for PyFR v0.1.0 were based tersections of every injected eddy before the simulation starts, the cost of
around NumPy .npy and .npz files. Although simple to read and write the implementation is able to scale as the number of eddies intersecting
from Python, they are difficult to access from other environments. More an element at a given time—which for sensible grid resolutions will re
over, the formats themselves had no provisions for parallel I/O. For main small—as opposed to scaling with the overall number of injected
these reasons, PyFR v2.0.3 employs a new set of file formats based eddies, which can be substantial for large domains and small turbulent
around the industry-standard HDF5 [38,39]. A key advantage of the length scales. Additionally, the implementation also innovates by pass
HDF5 format is its hierarchical nature and the ability to attach arbi ing single unsigned 32-bit integer seeds to dfine multiple characteristics
trary attributes to most data sets. Moreover, data arrays stored by PyFR of a given eddy. These are then unrolled by a device side implementa
use 64-bit rather than 32-bit integers. This enables a partition to have tion of a PCG random number generator [42] to produce the actual
in excess of four billion elements and serves to further future-proof the
random characteristics of the eddy. This saves memory bandwidth cf.
format for at least the next decade.
pre-computing random eddy characteristics a priori and passing them
To aid in reproducibility, PyFR v2.0.3 solution files embed all of the
as an array of (potentially 64-bit) floats.
cofiguration files that have been employed in the simulation up until
the current time. This makes it possible to account for the common sit
uation wherein a simulation is started with one cofiguration file, run In-situ visualisation The soln-plugin-ascent provides in-situ visu
for a period of time, and then restarted with a different cofiguration. alisation capabilities and is powered by the lightweight Ascent library
Output format support has also been enhanced since PyFR v0.1.0. [43]. Using the plugin, it is possible to produce complex renderings
Specifically, PyFR v2.0.3 now supports exporting high-order VTU files. of the current simulation state without having to write any interme
This enables the high-order nature of the solution to be preserved diate files to disk. This enables efficient visualisation of large-scale
throughout more of the post-processing pipeline. Furthermore, there is simulations, for which writing solutions to disk for after-the-fact post
also support for generating parallel VTU files which can be more effi processing is unfeasible.
cient in multiprocessing environments.
4. Developer and user community
3.8. Plugin architecture
Over the past decade, an international community of developers and
The PyFR plugin infrastructure provides a lightweight means of users has grown around PyFR, drawn from across academia and in
adding new capabilities to the code base. Written in pure Python, plug dustry. Key to this growth has been the open-source nature of PyFR,
ins are capable of adding new command line arguments, periodically which removes many international and inter-institutional barriers to
post-processing the solution, and adding source-terms to the solver. collaborative code development practices. Development has also been
Since they are written in pure Python plugins are an accessible means for supported at a technical level by hosting the code base in a Git repos
users to customise PyFR. In their most direct form plugins are given read itory on GitHub, which provides a wide range of tooling and helps
only access to the high-order solution along with—optionally-spatial dfine best-practice collaborative processes. The PyFR Github reposi
temporal gradients thereof. We note, however, that the plugin architec tory has currently been forked over 200 times. Also of importance has
ture makes no attempt to abstract away the mixed-element nature of
been maintaining comprehensive and up-to-date documentation using
grids or the fact the solution is distributed across multiple ranks. While
Sphinx, which is auto-deployed to Read the Docs on each release, as
this does represent an increase in the degree of sophistication required
well as providing developer and user support via a forum hosted on Dis
to author a plugin, it is necessary in order to ensure that plugins are able
course. Finally, in 2020, we launched a virtual PyFR seminar series on
to scale to leadership-class simulations with billions of elements.
Cassyni, which comprises invited talks and discussions on a range of top
Examples of a selection of plugins provided with PyFR v2.0.3 are
ics related to the theory of high-order FR schemes, their implementation
detailed below.
in PyFR and their application to industrially relevant flow problems. The
PyFR seminar series currently has over 500 subscribers.
Point sampling The soln-plugin-sampler is capable of periodically
Our user base has successfully applied PyFR to a wide range of fun
sampling a set of points in the domain. At start-up, the plugin automat
damental, applied, and industrial flow problems, including: DNS of flow
ically determines which element each sample point is inside, and then
performs a series of Newton iterations to invert the physical-to-reference over low pressure turbine (LPT) cascades [44,45] with MTU Aero En
space mapping. gines (see Fig. 8), which demonstrated how high-order accurate GPU
accelerated DNS could enable reliable virtual testing of new LPT de
Time averaging The soln-plugin-tavg computes the time-average signs, ILES of flow over high-rise buildings [46] with Arup (see Fig. 9),
of one or more arbitrary functions. The functions, which are specfied capturing for the first time experimentally-observed low-pressure sur
in the cofiguration file, can be parameterised by both the primitive face suction peaks which have implications for loading and design of
variables and gradients thereof. Beyond computing time-averages, the cladding, DNS of flow over Martian rotorcraft aerofoils [47,48] with
plugin also computes variances, which can be used for the purposes of NASA (see Fig. 10), flow over supersonic re-entry capsules [49] led by
uncertainty quantification. Command line support is also included for NASA, flow over projectiles [50] led by the Agency for Defense Devel
merging together multiple time-average files. This is particularly useful opment in South Korea, flow over wind turbines [51,52], and flow in
in environments where there is a limit on job run time. thermoacoustic engines [53,54], as well as studies of airfoil noise reduc
tion [55,56], flow control [57,58], wall roughness [59,60], the Coanda
Force calculation The soln-plugin-fluidforce can be used to effect [61], surrogate model development [62,63] and fundamental as
compute the net force on boundaries which, in turn, can be used to ob pects of channel flow [58,64], identifying for the first time Eigenmodes
tain aerodynamic quantities such as lift and drag. The plugin breaks out of averaged small-amplitude perturbations to a turbulent base flow. Fi
separately the pressure and viscous components of the recorded forces. nally, PyFR has recently been used to enable ILES-based optimisation of
8
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 8. Instantaneous snapshot of a Q-criteria iso-surface coloured by velocity Fig. 10. Instantaneous snapshot of a Q-criteria iso-surface coloured by velocity
magnitude above the suction side of an MTU-T161 low pressure turbine blade magnitude around a triangular aerofoil for a Martian helicopter obtained using
obtained using the compressible Navier–Stokes solver in PyFR. the compressible Navier–Stokes solver in PyFR.
Image is from Fig. 12 of Iyer et al. [45]. Copyright Iyer et al. Reused with per Image is from Fig. 10 of Caros et al. [67]. Copyright Caros et al. Reused with
mission. permission.
𝑤(𝑡 = 0, 𝐱) = 0, (8c)
1
𝑝(𝑡 = 0, 𝐱) = 𝑝0 + [cos(2𝑥) + cos(2𝑦)] [2 + cos(2𝑧)] , (8d)
16
𝜌(𝑡 = 0, 𝐱) = 𝑝(𝑡 = 0, 𝐱)∕𝑝0 , (8e)
where 𝑝0 and the reference dynamic viscosity were selected to achieve
the desired Mach and Reynolds numbers based on length, velocity and
density scales of unity, and a dynamic viscosity computed using Suther
land’s law with a reference temperature of 273K [70]. For comparison
with the results in Chapelier et al. [69], the simulations were performed
using computational meshes consisting of 163 , 323 , 643 and 1283 hex
ahedral elements, with a third-order polynomials approximating the
solution within each element. These correspond to 643 , 1283 , 2563 and
5123 degrees of freedom (DoFs), respectively. Gauss–Legendre--Lobatto
flux and solution points were used, and entropy filtering was employed
as a shock capturing approach.
Fig. 11 shows solenoidal dissipation (enstrophy), dfined as
𝜋 𝜋 𝜋
1
𝜀𝑠 = 𝜇 𝜔 ⋅ 𝜔 d𝑥d𝑦d𝑧, (9)
(2𝜋)3 ∫ ∫ ∫
Fig. 9. Instantaneous snapshot of a Q-criteria iso-surface coloured by velocity −𝜋 −𝜋 −𝜋
magnitude around a model high-rise building obtained using the incompressible as a function of time, as well as a Schlieren-type representation of the
Navier–Stokes solver in PyFR.
density gradient norm at 𝑡 = 6 for a case computed with 𝑁 = 1283
Image is from Fig. 9 of Giangaspero et al. [46]. Copyright Giangaspero et al.
hexahedral elements (5123 DoFs). The predicted enstrophy prfiles are
Reused with permission.
found to be in excellent agreement with the reference data of Chapelier
et al. [69], computed using a highly resolved (20483 DoFs) high-order
turbine cascades [65] and, for the first time, DNS-based optimisation of
finite difference targeted ENO (TENO) scheme, indicating that PyFR
Martian rotorcraft aerofoils [66].
can accurately resolve small-scale turbulent flow structures in super
sonic flows. This test case allows for further comparison against the
5. Accuracy, performance, and scaling
results of various solvers presented in Chapelier et al. [69]. In partic
ular, we compare to similar discontinuous finite element-type schemes
5.1. Accuracy
(e.g., Discontinuous Galerkin, spectral difference, etc.) with identical
mesh resolution and approximation order, the details of which are sum
To demonstrate the accuracy of PyFR v2.0.3 for high-speed flows,
marized in Table 1, as well as the high-order finite difference TENO
we consider a supersonic Taylor–Green vortex test case at a Mach num
scheme which was used to compute the reference results.
ber of 1.25 and a Reynolds number of 1, 600, which was studied in
Fig. 12 shows dilatational dissipation, dfined as
Lusher and Sandham [68] and used to benchmark the accuracy and
shock-resolving capabilities of several solvers in Chapelier et al. [69]. 𝜋 𝜋 𝜋
Specifically, we solve the compressible Navier-Stokes equations in a do 4
𝜀𝑑 = 𝜇 (𝛁 ⋅ 𝐮)2 d𝑥d𝑦d𝑧, (10)
main −𝜋 ≤ 𝑥, 𝑦, 𝑧 ≤ 𝜋 , subject to the following initial conditions, 3(2𝜋)3 ∫ ∫ ∫
−𝜋 −𝜋 −𝜋
9
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 11. Plot of enstrophy as a function of time (left) and a Schlieren-type representation of the density gradient norm at 𝑡 = 6 (right) for the supersonic Taylor–Green
vortex case computed with 𝑁 = 1283 hexahedral elements (5123 DoFs), along with the reference data of Chapelier et al. [69].
Table 1
Summary of the solvers and numerical methods used for comparison.
as a function of time for cases computed using 643 , 1283 , 2563 and 5123 Table 2 and Table 3 present strong scaling of the test case on Frontier
DoFs in comparison to the results of the solvers in Chapelier et al. [69]. and Alps, respectively, where we note that on Frontier PyFR was run
The simulations from PyFR generally show less shock dissipation, result without HIP-Aware MPI, whereas on Alps, PyFR was run with CUDA
ing in dilatational dissipation prfiles which are closer to the reference Aware MPI. In terms of scaling, we observe that at 2048 ranks, both
data for a given resolution, indicating that the entropy filtering shock platforms deliver similar scalability numbers. However, on account of
capturing approach does not introduce excessive numerical dissipation the superior baseline performance, the NVIDIA system is transferring
and can sharply resolve shock prfiles. ∼3.7 times more data over the interconnect. Given both systems make
use of the Cray Slingshot interconnect, this suggests that the network
is not the limiting factor on Frontier. Rather, it is more likely related to
5.2. Performance and scaling
our inability to run with HIP-aware MPI on Frontier, and the inability of
PyFR to employ native HIP graphs due to unresolved issues in the HIP
PyFR has previously been used to undertake petascale simulations runtime.
on a range of the world’s largest GPU supercomputers, including Piz In terms of absolute performance, we note that a single NVIDIA
Daint at CSCS and Titan at ORNL. Overall performance and strong and GH200 is ∼3.7 times faster than one GCD of an AMD MI250X. A sub
weak scaling has been demonstrated previously in this context, and in stantial portion of this can be explained by the differences in peak
deed simulations undertaken with PyFR were shortlisted for the Gordon memory bandwidth. A GH200 uses HBM3 memory with a peak band
Bell Prize in 2016 [75], achieving 13.7 DP-PFLOP/s (58% of theoretical width of 4 TiB / s, whereas the MI250X uses HBM2e memory with a
peak) using 18,000 NVIDIA Tesla K20X GPUs on Titan. peak bandwidth per-GCD of 1.6 TiB / s. This gives a ratio of 2.5. More
To demonstrate the performance and scaling characteristics of PyFR over, micro-benchmarks on the MI250X indicate that peak bandwidth is
v2.0.3, we consider a subsonic version of the Taylor–Green vortex test only reliably achieved for kernels with a 1:1 read-to-write ratio. Outside
case described above, run with double precision arithmetic on Frontier of this regime, bandwidths closer to ∼1.2 TiB / s are more commonly
at ORNL using AMD Instinct MI250X accelerators, and on Alps at CSCS observed. Such a discrepancy is not observed on NVIDIA hardware, how
using NVIDIA GH200 GPUs. For this case, the reference pressure was ever. Accounting for this gives us a revised performance ratio based on
modfied to achieve a Mach number of 0.08, and the dynamic viscosity memory bandwidth of 4∕1.2 ≈ 3.3 which is similar to what is actually
was set to be constant. The computational mesh consisted of 13, 891, 500 observed. The remaining performance differences are likely due to the
tetrahedral elements, with seventh-order solution polynomials used to superior caching setup of the NVIDIA GPU which has—in the absence
represent the solution within each element, and 𝛼 -optimised [11] flux of shared memory allocations—some 208 KiB available per SM, whereas
and solution points were employed. Fig. 13 plots enstrophy as a func AMD only provides 16 KiB per CU. Similarly, whereas NVIDIA provide
tion of time for a case run on 512 AMD Instinct MI250X accelerators 50 MiB of shared L2 cache, AMD only provide 8 MiB. These caches are
of Frontier (each with two GCDs). Results are found to be in excellent important for the interface kernels which have an irregular memory ac
agreement with the reference data of van Rees et al. [76]. Performance cess pattern.
is stated in terms of giga-degrees of freedom per second (GDoF/s) which Finally, we can make a comparison between absolute performance
considers the time required for a right-hand side evaluation and divides almost a decade ago using PyFR v0.2.2 on an NVIDIA K40c GPU
it by the total number of DoFs in the simulation. [19] with current absolute performance. Specifically, data from Table
10
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Fig. 12. Dilatational dissipation as a function of time for the supersonic Taylor–Green vortex case computed with varying mesh resolution in comparison to the
results of the solvers in Chapelier et al. [69].
Table 2
Strong scalability of PyFR on Frontier for the Taylor–Green vortex test case using compress
ible Navier–Stokes solver on a mesh with 12 × 1053 = 13, 891, 500 tetrahedral elements and
seventh-order solution polynomials used to represent the solution within each element. The
speedup is relative to 8 AMD Mi250X accelerators each with two GCDs. HIP-Aware MPI was
not employed.
6 of [19] for a tetrahedrally-dominated mesh with fourth-order solu of these enhancements as released in PyFR v2.0.3, including improve
tion polynomials in each element gives an absolute performance of ments to cross-platform performance (new backends, extensions of the
0.122 GDoF/s per K40c GPU, whereas the 16 rank case from Table 3 DSL, new matrix multiplication providers, improvements to the data
here gives an absolute performance of 6.004 GDoF/s per GH200 GPU. layout, use of task graphs) and improvements to numerical stability
This leads to an absolute performance improvement ratio of 49.2, ac (modal filtering, anti-aliasing, artficial viscosity, entropy filtering), as
counting for the totality of both hardware and software improvements well as the addition of prismatic, tetrahedral and pyramid shaped el
over the period, and where we note the ratio is conservative since use ements, improved domain decomposition support for mixed element
of seventh vs. fourth order solution polynomials necessitates substan grids, improved handling of curved element meshes, the addition of an
tially more FLOPs/DoF. This conservative estimate of an almost 50× adaptive time-stepping capability, the addition of incompressible Euler
performance increase over the last decade constitutes a substantial step and Navier-Stokes solvers, improvements to file formats and the devel
towards the industrial adoption of scale-resolving simulations. opment of a plugin architecture. We have also explained efforts to grow
an engaged developer and user community and provided a range of ex
6. Conclusions and outlook amples that show how our user base is applying PyFR to solve a wide
range of fundamental, applied and industrial flow problems. Finally,
Since the initial release of PyFR v0.1.0 in 2013 [9], a range of we have demonstrated the accuracy of PyFR v2.0.3 for a supersonic
new capabilities have been added to the framework, with a view to Taylor-Green vortex case, with shocks and turbulence, and provided lat
enabling industrial adoption. In this work, we have provided details est performance and scaling results on up to 1024 AMD Instinct MI250X
11
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Table 3
Strong scalability of PyFR on Alps for the Taylor–Green vortex test case using compressible
Navier-Stokes solver on a mesh with 12 × 1053 = 13, 891, 500 tetrahedral elements and seventh
order solution polynomials used to represent the solution within each element. The speedup is
relative to 16 GH200 GPUs. CUDA-Aware MPI was employed.
review & editing. Tarik Dzanic: Writing -- review & editing, Writing
– original draft, Visualization, Validation, Supervision, Software, Re
sources, Project administration, Methodology, Investigation, Funding
acquisition, Formal analysis, Data curation, Conceptualization. Giorgio
Giangaspero: Writing -- review & editing. Arvind S. Iyer: Writing --
review & editing. Antony Jameson: Writing -- review & editing, Super
vision, Funding acquisition. Marius Koch: Writing -- review & editing.
Niki Loppi: Writing -- review & editing. Sambit Mishra: Writing --
review & editing. Rishit Modi: Writing -- review & editing. Gonzalo
Sáez-Mischlich: Writing -- review & editing. Jin Seok Park: Writing --
review & editing. Brian C. Vermeire: Writing -- review & editing. Lai
Wang: Writing -- review & editing.
12
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
Data availability [29] G. Karypis, V. Kumar, METIS: a Software Package for Partitioning Unstructured
Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Ma
trices, 1997.
Data will be made available on request.
[30] Cédric Chevalier, François Pellegrini, PT-Scotch: a tool for efficient parallel graph
ordering, Parallel Comput. 34 (6--8) (2008) 318--331.
References [31] D.A. Kopriva, Metric identities and the discontinuous spectral element method on
curvilinear meshes, J. Sci. Comput. 26 (2006) 301--327.
[1] H.T. Huynh, A flux reconstruction approach to high-order schemes including discon [32] Y. Abe, T. Haga, T. Nonomura, K. Fujii, On the freestream preservation of high-order
tinuous Galerkin methods, in: 18th AIAA Computational Fluid Dynamics Conference, conservative flux-reconstruction schemes, J. Comput. Phys. 281 (2015) 28--54.
American Institute of Aeronautics and Astronautics, June 2007. [33] E. Hairer, S.P. Nørsett, G. Wanner, Solving Ordinary Differential Equations I: Nonstiff
[2] C.D. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo, D. De Problems, 2 edition, Springer, Berlin, 1993.
Grazia, S. Yakovlev, J.-E. Lombard, D. Ekelschot, et al., Nektar++: an open-source [34] A.J. Chorin, A numerical method for solving incompressible viscous flow problems,
spectral/hp element framework, Comput. Phys. Commun. 192 (2015) 205--219. J. Comput. Phys. 135 (2) (1997) 118--125.
[3] D. Moxey, C.D. Cantwell, Y. Bao, A. Cassinelli, G. Castiglioni, S. Chun, E. Juda, E. [35] A. Jameson, Time dependent calculations using multigrid, with applications to un
Kazemi, K. Lackhove, J. Marcon, et al., Nektar++: enhancing the capability and steady flows past airfoils and wings, in: 10th Computational Fluid Dynamics Confer
application of hig-fidelity spectral/hp element methods, Comput. Phys. Commun. ence, 1991, p. 1596.
249 (2020) 107110. [36] N.A. Loppi, F.D. Witherden, A. Jameson, P.E. Vincent, A high-order cross-platform
[4] Z.J. Wang, Y. Li, F. Jia, G.M. Laskowski, J. Kopriva, U. Paliath, R. Bhaskaran, To incompressible Navier–Stokes solver via artficial compressibility with application
wards industrial large eddy simulation using the FR/CPR method, Comput. Fluids to a turbulent jet, Comput. Phys. Commun. 233 (2018) 193--205.
156 (2017) 579--589. [37] N.A. Loppi, F.D. Witherden, A. Jameson, P.E. Vincent, Locally adaptive pseudo-time
[5] R.D. Sandberg, Development of a new compressible Navier-Stokes solver for numer stepping for high-order flux reconstruction, J. Comput. Phys. 399 (2019) 108913.
ical simulations of flows in turbomachinery, in: Progress Report for HPC Europa++ [38] M. Folk, G. Heber, Q. Koziol, E. Pourmal, D. Robinson, An overview of the HDF5
Transnational Access Project 1264, 2008. technology suite and its applications, in: Proceedings of the EDBT/ICDT 2011 Work
[6] Richard D. Sandberg, Vittorio Michelassi, Richard Pichler, Liwei Chen, Roderick shop on Array Databases, 2011, pp. 36--47.
Johnstone, Compressible direct numerical simulation of low-pressure turbines—part [39] A. Collette, Python and HDF5: Unlocking Scientific Data, O’Reilly Media, Inc., 2013.
i: methodology, J. Turbomach. (ISSN 0889-504X) 137 (5) (05 2015) 051011, https:// [40] G. Giangaspero, F.D. Witherden, P.E. Vincent, Synthetic turbulence generation for
doi.org/10.1115/1.4028731. high-order scale-resolving simulations on unstructured grids, AIAA J. 60 (2) (2022)
[7] G.A. Bres, S.T. Bose, M. Emory, F.E. Ham, O.T. Schmidt, G. Rigas, T. Colonius, Large 1032--1051.
eddy simulations of co-annular turbulent jet using a voronoi-based mesh generation [41] G. Giangaspero, Synthetic turbulence generation in PyFR, PyFR Semin. Ser. (2021),
framework, in: 2018 AIAA/CEAS Aeroacoustics Conference, 2018, p. 3302. https://doi.org/10.52843/cassyni.249z01.
[8] K.A. Goc, O. Lehmkuhl, G.I. Park, S.T. Bose, P. Moin, Large eddy simulation of air [42] Melissa E. O’Neill, Pcg: a family of simple fast space-e˙icient statistically good al
craft at affordable cost: a milestone in computational fluid dynamics, Flow 1 (2021) gorithms for random number generation, https://api.semanticscholar.org/CorpusID:
E14. 3489282, 2014.
[9] F.D. Witherden, A.M. Farrington, P.E. Vincent, PyFR: an open source framework for [43] M. Larsen, E. Brugger, H. Childs, C. Harrison, Ascent: a flyweight in situ library
solving advection–diffusion type problems on streaming architectures using the flux for exascale simulations, in: Situ Visualization for Computational Science, Springer,
reconstruction approach, Comput. Phys. Commun. 185 (11) (2014) 3028--3040. 2022, pp. 255--279.
[10] G. Karniadakis, S.J. Sherwin, Spectral/hp Element Methods for Computational Fluid [44] N.F. Afshar, High-order implicit large eddy simulation of flow over a low
Dynamics, Oxford University Press, USA, 2005. Reynolds turbine cascade, PyFR Semin. Ser. (2021), https://doi.org/10.52843/
[11] J.S. Hesthaven, T. Warburton, Nodal Discontinuous Galerkin Methods: Algorithms, cassyni.pw9418.
Analysis, and Applications, Springer Science & Business Media, 2007. [45] A.S. Iyer, Y. Abe, B.C. Vermeire, P. Bechlars, R.D. Baier, A. Jameson, F.D. Witherden,
[12] Freddie David Witherden, On the development and implementation of high-order P.E. Vincent, High-order accurate direct numerical simulation of flow over a mtu
flux reconstruction schemes for computational fluid dynamics, PhD thesis, Imperial t161 low pressure turbine blade, Comput. Fluids 226 (2021) 104989.
College London, 2015. [46] Giorgio Giangaspero, Luca Amerio, Steven Downie, Alberto Zasso, Peter Vincent,
[13] B.D. Wozniak, F.D. Witherden, F.P. Russell, P.E. Vincent, P.H.J. Kelly, GiMMiK- High-order scale-resolving simulations of extreme wind loads on a model high-rise
generating bespoke matrix multiplication kernels for accelerators: application to building, J. Wind Eng. Ind. Aerodyn. 230 (2022) 105169.
high-order computational fluid dynamics, Comput. Phys. Commun. 202 (2016) [47] L.C. Roca, DNS-based optimisation of airfoils for Martian helicopters using PyFR,
12--22. PyFR Semin. Ser. (2023), https://doi.org/10.52843/cassyni.6r5ry1.
[14] A. Heinecke, G. Henry, M. Hutchinson, H. Pabst, LIBXSMM: accelerating small ma [48] L.C. Roca, Martian aerodynamics with PyFR, PyFR Semin. Ser. (2021), https://doi.
trix multiplications by runtime code generation, in: SC’16: Proceedings of the In org/10.52843/47ly7q.
ternational Conference for High Performance Computing, Networking, Storage and [49] Rathakrishnan Bhaskaran, Eric C. Stern, Scale Resolving Simulations of Viking ’75
Analysis, IEEE, 2016, pp. 981--991. Reentry Capsule Wake Flow, AIAA, 2024, https://arc.aiaa.org/doi/abs/10.2514/6.
[15] S. Akkurt, F.D. Witherden, P.E. Vincent, Cache blocking strategies applied to flux 2024-4127.
reconstruction, Comput. Phys. Commun. 271 (2022) 108193. [50] J.S. Park, High-order implicit large-eddy simulations of flow around a projectile us
[16] S. Akkurt, Cache blocking strategies applied to flux reconstruction, PyFR Semin. Ser. ing PyFR, PyFR Semin. Ser. (2021), https://doi.org/10.52843/cassyni.536kkl.
(2021), https://doi.org/10.52843/gpnpwx. [51] T. Liang, Actuator line model for wind turbine wake prediction with PyFR, PyFR
[17] L. Dalcín, R. Paz, M. Storti, MPI for Python, J. Parallel Distrib. Comput. 65 (9) (2005) Semin. Ser. (2024), https://doi.org/10.52843/cassyni.p3fyks.
1108--1115. [52] Tianyang Liang, Changhong Hu, Numerical simulation of wind turbine wake char
[18] L. Dalcín, R. Paz, M. Storti, J. D’Elía, MPI for Python: performance improvements acteristics by flux reconstruction method, Renew. Energy (ISSN 0960-1481) (2024)
and MPI-2 extensions, J. Parallel Distrib. Comput. 68 (5) (2008) 655--662. 121092, https://doi.org/10.1016/j.renene.2024.121092.
[19] F.D. Witherden, B.C. Vermeire, P.E. Vincent, Heterogeneous computing on mixed [53] N. Blanc, Simulating a thermoacoustic engine with PyFR, PyFR Semin. Ser. (2023),
unstructured grids with PyFR, Comput. Fluids 120 (2015) 173--186. https://doi.org/10.52843/cassyni.dlsjz8.
[20] S. Mishra, F.D. Witherden, D. Chakravorty, L. Perez, F. Dang, Scaling study of flow [54] Nathan Blanc, Michael Laufer, Steven Frankel, Guy Z. Ramon, Hig-fidelity numeri
simulations on composable cyberinfrastructure, in: Practice and Experience in Ad cal simulations of a standing-wave thermoacoustic engine, Appl. Energy 360 (2024)
vanced Research Computing, 2023, pp. 221--225. 122817.
[21] A. Jameson, P.E. Vincent, P. Castonguay, On the non-linear stability of flux recon [55] Z. Wan, Noise reduction of serrated trailing edges with implicit large eddy simulation
struction schemes, J. Sci. Comput. 50 (2012) 434--445. using PyFR, PyFR Semin. Ser. (2022), https://doi.org/10.52843/cassyni.br5jn1.
[22] F.D. Witherden, P.E. Vincent, On the identfication of symmetric quadrature rules [56] Z. Yuan, Numerical simulations of aerofoil tonal noise reduction by roughness ele
for finite element methods, Comput. Math. Appl. 69 (10) (2015) 1232--1241. ments, PyFR Semin. Ser. (2023), https://doi.org/10.52843/cassyni.j59tff.
[23] J.S. Park, F.D. Witherden, P.E. Vincent, High-order implicit large-eddy simulations [57] M. Laufer, Implicit LES of NACA 0018 airfoil with active flow control, PyFR Semin.
of flow over a NACA0021 aerofoil, AIAA J. 55 (7) (2017) 2186--2197. Ser. (2021), https://doi.org/10.52843/cassyni.nd09lk.
[24] P.-O. Persson, J. Peraire, Sub-cell shock capturing for discontinuous Galerkin meth [58] H. Foysi, Active control of compressible supersonic wall-bounded flow using direct
ods, in: 44th AIAA Aerospace Sciences Meeting and Exhibit, 2006, p. 112. numerical simulations with spanwise velocity modulation at the walls using PyFR,
[25] T. Dzanic, F.D. Witherden, Positivity-preserving entropy-based adaptive filtering for PyFR Semin. Ser. (2023), https://doi.org/10.52843/cassyni.2g7fx6.
discontinuous spectral element methods, J. Comput. Phys. 468 (2022) 111501. [59] K. Cengiz, Use of high-order curved elements for direct and large eddy simulation
[26] T. Dzanic, F.D. Witherden, Positivity-preserving entropy filtering for the ideal mag of flow over rough surfaces, PyFR Semin. Ser. (2023), https://doi.org/10.52843/
netohydrodynamics equations, Comput. Fluids 266 (2023) 106056. cassyni.hkqms2.
[27] T. Dzanic, Positivity-preserving entropy-based adaptive filtering for shock capturing, [60] Kenan Cengiz, Sebastian Kurth, Lars Wein, Joerg R. Seume, Use of high-order
PyFR Semin. Ser. (2022), https://doi.org/10.52843/cassyni.pvy6c0. curved elements for direct and large eddy simulation of flow over rough sur
[28] Will Trojak, Tarik Dzanic, Positivity-preserving discontinuous spectral element faces, Tech. Mech. - Eur. J. Eng. Mech. 43 (1) (Feb. 2023) 38--48, https://doi.org/
methods for compressible multi-species flows, Comput. Fluids 280 (August 2024) 10.24352/UB.OVGU-2023-043, https://journals.ub.ovgu.de/index.php/techmech/
106343, https://doi.org/10.1016/j.compfluid.2024.106343. article/view/2100.
13
F.D. Witherden, P.E. Vincent, W. Trojak et al. Computer Physics Communications 311 (2025) 109567
[61] T. Regev, GPU-accelerated hig-fidelity implicit LES of Coanda cylinder flow insta [70] William Sutherland, III. The viscosity of gases and molecular force, Lond. Edinb.
bilities, PyFR Semin. Ser. (2023), https://doi.org/10.52843/cassyni.5yklnq. Dublin Philos. Mag. J. Sci. 36 (223) (December 1893) 507--531, https://doi.org/10.
[62] Ali Girayhan Özbay, Unsteady 2D flow reconstruction around arbitrary shapes via 1080/14786449308620508.
conformal mapping aided deep neural networks, PyFR Semin. Ser. (2022), https:// [71] Pedro Stefanin Volpiani, Jean-Baptiste Chapelier, Axel Schwöppe, Jens Jägersküp
doi.org/10.52843/cassyni.s1q5yf. per, Steeve Champagneux, Aircraft simulations using the new CFD software from
[63] Ali Girayhan Özbay, Sylvain Laizet, Deep learning fluid flow reconstruction around ONERA, DLR, and Airbus, J. Aircr. 61 (3) (May 2024) 857--869, https://doi.org/10.
arbitrary two-dimensional objects from sparse sensors using conformal mappings, 2514/1.c037506.
AIP Adv. (ISSN 2158-3226) 12 (4) (04 2022) 045126, https://doi.org/10.1063/5. [72] Nico Krais, Andrea Beck, Thomas Bolemann, Hannes Frank, David Flad, Gregor
0087488. Gassner, Florian Hindenlang, Malte Hoffmann, Thomas Kuhn, Matthias Sonntag,
[64] A.S. Iyer, F.D. Witherden, S.I. Chernyshenko, P.E. Vincent, Identifying eigenmodes Claus-Dieter Munz, FLEXI: a high order discontinuous Galerkin framework for
of averaged small-amplitude perturbations to turbulent channel flow, J. Fluid Mech. hyperbolic–parabolic conservation laws, Comput. Math. Appl. 81 (January 2021)
875 (2019) 758--780, https://doi.org/10.1017/jfm.2019.520. 186--219, https://doi.org/10.1016/j.camwa.2020.05.004.
[65] A. Aubry, Gradient-free aerodynamic shape optimization using PyFR and MADS, [73] J.-B. Chapelier, G. Lodato, A. Jameson, A study on the numerical dissipation of the
PyFR Semin. Ser. (2021), https://doi.org/10.52843/cassyni.nqp2sp. spectral difference method for freely decaying and wall-bounded turbulence, Com
[66] Lidia Caros, Oliver Buxton, Peter Vincent, Optimization of triangular airfoils for put. Fluids 139 (November 2016) 261--280, https://doi.org/10.1016/j.compfluid.
martian helicopters using direct numerical simulations, AIAA J. 61 (11) (2023) 2016.03.006.
4935--4945. [74] David J. Lusher, Satya P. Jammy, Neil D. Sandham, OpenSBLI: automated code
[67] Lidia Caros, Oliver Buxton, Tsuyoshi Shigeta, Takayuki Nagata, Taku Nonomura, generation for heterogeneous computing architectures applied to compressible fluid
Keisuke Asai, Peter Vincent, Direct numerical simulation of flow over a triangular dynamics on structured grids, Comput. Phys. Commun. 267 (October 2021) 108063,
airfoil under martian conditions, AIAA J. 60 (7) (2022) 3961--3972. https://doi.org/10.1016/j.cpc.2021.108063.
[68] David J. Lusher, Neil D. Sandham, Assessment of low-dissipative shock-capturing [75] P.E. Vincent, F.D. Witherden, B. Vermeire, J.S. Park, A. Iyer, Towards green aviation
schemes for the compressible Taylor–Green vortex, AIAA J. 59 (2) (February 2021) with python at petascale, in: SC’16: Proceedings of the International Conference
533--545, https://doi.org/10.2514/1.j059672. for High Performance Computing, Networking, Storage and Analysis, IEEE, 2016,
[69] Jean-Baptiste Chapelier, David J. Lusher, William Van Noordt, Christoph Wenzel, pp. 1--11.
Tobias Gibis, Pascal Mossier, Andrea Beck, Guido Lodato, Christoph Brehm, Matteo [76] Wim M. van Rees, Anthony Leonard, D.I. Pullin, Petros Koumoutsakos, A comparison
Ruggeri, Carlo Scalo, Neil Sandham, Comparison of high-order numerical method of vortex and pseudo-spectral methods for the simulation of periodic vortical flows
ologies for the simulation of the supersonic Taylor–Green vortex flow, Phys. Fluids at high Reynolds numbers, J. Comput. Phys. 230 (8) (2011) 2794--2805.
36 (5) (May 2024), https://doi.org/10.1063/5.0206359.
14