0% found this document useful (0 votes)

31 views

High Performance Cross Platform Architecture

Uploaded by

Cheng Yuan Chang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

High Performance Cross Platform Architecture

Uploaded by

Cheng Yuan Chang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

About Me

• 35-year career in video games and embedded software

• Started using C++ in 1995
• First cross-platform project in 1994
Cross-Platform Architecture Goals
• Take advantage of all platforms
• Focus on the compiler
• Minimize boilerplate and unnecessary code
• Minimize redundant code
• Minimize modifying existing code
• Minimize preprocessor macros
The Design
• Implement a family of quaternion classes, an illustrative example from a
larger project
• Project build issues
• Inclusion of platform-specific header files
• Concept hierarchies
• Class and Function Design
OCP: The Open–Closed Principle in C++
• Open for extension
• Closed for modification
• Implemented via delegated polymorphism

Since programs that conform to the open–closed principle are changed by

adding new code, rather than by changing existing code, they do not
experience the cascade of changes exhibited by non-conforming programs.
— Robert C. Martin (1996)
Example: Drawing Shapes — The C Way Before OCP
Shape Data Drawing Functions
typedef enum { void draw_circle(circle_shape *c);
circle, void draw_square(square_shape *s);
square
} shape_type; void draw_shapes(shape** shapes, int n) {
int i;
typedef struct { for (i = 0;i < n; i++) {
shape_type type; switch(shapes[i]->type) {
int radius; case circle:
} circle_shape;
draw_circle((circle_shape*)(shapes[i]));
typedef struct { break;
shape_type type;
int side_length; case square:
} square_shape;
draw_square((square_shape*)(shapes[i]));
break;
}
}
}
Example: Drawing Shapes — The Polymorphic C++ Way
Shape Classes Draw Function
// shape.h #include "shape.h"
class shape {
public: void draw_shapes(std::vector<shape*> shapes) {
virtual void draw() const = 0; for (auto const * s : shapes) {
}; s->draw();
}
// circle.h }
class circle : public shape {
int radius;
public:
void draw() const override;
};

// square.h
class square : public shape {
int side_length;
public:
void draw() const override;
};
OCP: The Original Open–Closed Principle
• If code has been made public, changes should not affect existing code.
• Re-use is through direct inheritance, not delegated polymorphism.
• Has been unrealistic in C++

The principles stated that a good module structure should be both closed
and open:
• Closed, because clients need the module’s services to proceed with
their own development, and once they have settled on a version of the
module should not be affected by the introduction of new services they
do not need.
• Open, because there is no guarantee that we will include right from the
start every service potentially useful to some client.

— Bertrand Meyer (1988)

Meyer OPC: Adapting a Module to New Clients (Before)

Meyer, B. (1997). Object-Oriented Software Construction (2nd ed.)

Meyer OPC: Adapting a Module to New Clients (After)

Meyer, B. (1997). Object-Oriented Software Construction (2nd ed.)

OCP in C++: Strong and Weak Principles
• No OCP
• Previous code modified
• Change in current behavior and/or interface
• Clients must change in response
• Weak OCP
• Previous code modified
• No change in current behavior and interface
• Recompilation occurs
• Strong OCP
• No previous code is modified in any way
• Nothing is recompiled
Architecture Guidelines
• Very close to Bertrand Meyer’s original vision
• Not OOP, data and functions separate
• Adding new platforms has no effect on previously-implemented platforms
• Adding new revisions to a feature has no effect on previously-implemented
revisions.
What is a Platform?
• A specific set of features
• A feature is an abstract unit of functionality requiring implementations that
differ depending upon the target machine architecture.
• Features may be hardware: CPU architecture, SIMD instruction set, DMA
controller, GPIO module, etc.
• Features may be software: OS, graphics API, etc.
• Features may not be totally orthogonal, e.g. x86/SSE, DirectX/Windows
Directory and File Structure
Flat Deep
plt/simd plt/simd
Simd.h Simd.h
Neon32.h Neon32.h
Sse.h Sse.h
Sse2.h Sse2.h
… …

plt/math plt/math
Quat.h Quat.h
Quat_Common.h Common/
Quat_Neon32.h Quat_Common.h
Quat_Sse.h Vec_Common.h
Quat_Sse2.h Mtx_Common.h
… …

Neon32/
Quat_Neon32.h
Vec_Neon32.h
Mtx_Neon32.h
…
Including Headers
• Each feature has its own header file to define the common definitions
• The header file is responsible for including the platform-specific code
• A series of preprocessor macros handles generating the header file name to
load
• Example:
INCLUDE_SIMD(Quat)

becomes:
”Quat_SSE2.h”
Header Inclusion Macros
• #define INCLUDE_PLT(Feature, File) INCLUDE_BUILD_FILENAME(Feature, File)
• #define INCLUDE_PLT_FEATURE(Feature) INCLUDE_STRINGIZE(Feature.h)
• #define INCLUDE_BUILD_FILENAME(Feature, File) INCLUDE_STRINGIZE(File ## _ ## Feature.h)
• #define INCLUDE_STRINGIZE(String) #String

• #define INCLUDE_SIMD(File) INCLUDE_PLT(PLT_SIMD, File)

Creating a Feature
• Define the feature in the build system
• Create a header file for the feature
• Create one header file for each unique implementation
The SIMD Feature in the Build System
• Place appropriate feature macro definition in toolchain files:
set(PLT_SIMD Common)
• In the common project, add the features to the preprocessor definitions:
add_compile_definitions(
PLT_SIMD=${PLT_SIMD}
)
Simd.h: The SIMD Feature Header
#if !defined(PLT_SIMD)
#error You must define PLT_SIMD.
#endif

#define INCLUDE_SIMD(File) INCLUDE_PLT(PLT_SIMD, File)

#include INCLUDE_PLT_LOCAL(PLT_SIMD))
Simd_Common.h: The SIMD Feature Header
namespace plt::simd
{
struct Common {};
}
Quaternions, Mathematically Speaking (1/2)
• Four-dimensional complex number:
𝑤 + 𝑥𝐢 + 𝑦𝐣 + 𝑧𝐤

• Where:
2
𝐢 = 𝐣2 = 𝐤 2 = 𝐢𝐣𝐤 = −1
𝐢𝐣 = 𝐤 = −𝐣𝐢
𝐣𝐤 = 𝐢 = −𝐤𝐣
𝐤𝐢 = 𝐣 = −𝐢𝐤
• Addition:
𝑎 + 𝑏 = 𝑎𝑤 + 𝑏𝑤 + 𝑎𝑥 + 𝑏𝑥 𝐢 + 𝑎𝑦 + 𝑏𝑦 𝐣 + 𝑎𝑧 + 𝑏𝑧 𝐤

• Multiplication:
𝑎𝑏 = 𝑎𝑤 𝑏𝑤 − 𝑎𝑥 𝑏𝑥 − 𝑎𝑦 𝑏𝑦 − 𝑎𝑧 𝑏𝑧
+ 𝑎𝑤 𝑏𝑥 + 𝑎𝑥 𝑏𝑤 + 𝑎𝑦 𝑏𝑧 − 𝑎𝑧 𝑏𝑦 𝐢
+ 𝑎𝑤 𝑏𝑦 − 𝑎𝑥 𝑏𝑧 + 𝑎𝑦 𝑏𝑤 + 𝑎𝑧 𝑏𝑥 𝐣
+ 𝑎𝑤 𝑏𝑧 + 𝑎𝑥 𝑏𝑦 − 𝑎𝑦 𝑏𝑥 + 𝑎𝑧 𝑏𝑤 𝐤
Quaternions, Mathematically Speaking (2/2)
• Conjugate:
q∗ = 𝑤 − 𝑥𝐢 − 𝑦𝐣 − 𝑧𝐤

• Dot Product:
𝑎 ∙ 𝑏 = 𝑎𝑤 𝑏𝑤 + 𝑎𝑥 𝑏𝑥 + 𝑎𝑦 𝑏𝑦 + 𝑎𝑧 𝑏𝑧
• Norm:
𝑞 = 𝑞∙𝑞 = 𝑤2 + 𝑥 2 + 𝑦 2 + 𝑧 2

• Multiplicative∗ Inverse:
∗
−1 𝑞 𝑞
𝑞 = 2
=
𝑞 𝑞∙𝑞

• Division:
𝑎 𝑎𝑏∗
= 𝑎𝑏1 =
𝑏 𝑏∙𝑏
Quaternion Concepts in Mathematics vs. C++ 20
• Mathematically, a quaternion is defined by its data and operations
• As a C++ concept, the quaternion is defined purely by its data
• The common implementation of operations rely only upon the concepts
• Optimized implementations conform to the concepts and common
implementations
C++ Quaternion Concept
template<typename Q>
concept Quaternion = requires(Q q)
{
typename Q::Scalar;
Arithmetic<typename Q::Scalar>;

{ q.w() } -> std::same_as<typename Q::Scalar>;

{ q.x() } -> std::same_as<typename Q::Scalar>;
{ q.y() } -> std::same_as<typename Q::Scalar>;
{ q.z() } -> std::same_as<typename Q::Scalar>;
};
Supporting Concepts
Arithmetic MutuallyArithmetic
template<typename T> template<typename T, typename U>
concept Arithmetic = std::is_arithmetic_v<T>; concept MutuallyArithmetic = requires (T t, U u)
{
requires Arithmetic<T>;
requires Arithmetic<U>;

{ t + u };
{ t - u };
{ t * u };
{ t / u };
};
“Standard” Quaternion Type: Declaration and Data
template<typename S, typename I = plt::simd::PLT_SIMD>
class Quat
{
public:
using Scalar = S;

private:
Scalar w_, x_, y_, z_;

// ... Constructors
// ... Accessors
}
“Standard” Quaternion Type: Constructors
Quat() = default;

Quat(Scalar w, Scalar x, Scalar y, Scalar z)

noexcept(std::is_nothrow_copy_constructible_v<Scalar>)
: w_(w), x_(x), y_(y), z_(z)
{}

Quat(Scalar && w, Scalar && x, Scalar && y, Scalar && z)

noexcept(std::is_nothrow_move_constructible_v<Scalar>)
: w_(move(w)), x_(move(x)), y_(move(y)), z_(move(z))
{}

template<Quaternion Q>
requires std::convertible_to<typename Q::Scalar, Scalar>
Quat(const Q& rhs)
noexcept(std::is_nothrow_convertible_v<typename Q::Scalar, Scalar>)
: Quat(Scalar{rhs.w()}, Scalar{rhs.x()}, Scalar{rhs.y()}, Scalar{rhs.z()})
{}
“Standard” Quaternion Type: Assignment Operator
template<Quaternion Q>
requires std::convertible_to<typename Q::Scalar, Scalar>
Quat& operator=(const Q& rhs)
noexcept(std::is_nothrow_convertible_v<typename Q::Scalar, Scalar>)
{
w_ = Scalar{rhs.w()};
x_ = Scalar{rhs.x()};
y_ = Scalar{rhs.y()};
z_ = Scalar{rhs.z()};
return *this;
}
“Standard” Quaternion Type: Accessors
const Scalar & w() const { return w_; } noexcept
const Scalar & x() const { return x_; } noexcept
const Scalar & y() const { return y_; } noexcept
const Scalar & z() const { return z_; } noexcept
General Operation Implementation
• Use expression trees
• Runs anywhere
• Works for any arithmetic scalar type
• Parameters defined using concepts
What is an Expression Tree? An Example
• Expression: q1+q2*q3
• Operators construct nodes in the tree
• Expression results in a tree:
Addition Example: Operator
template<Quaternion QL, Quaternion QR>
inline auto operator+(const QL & lhs, const QR & rhs) noexcept -> QuaternionAddition<QL, QR>
{
return QuaternionAddition(lhs, rhs);
}
Addition Example: Quaternion Binary Expression
template<Quaternion QL, Quaternion QR>
requires MutuallyArithmetic<typename QL::Scalar, typename QR::Scalar>
class QuaternionBinaryExpr : public QuaternionExpr
{
using SL = typename QL::Scalar;
using SR = typename QR::Scalar;

public:
using Scalar = typename std::common_type<SL, SR>::type;
};

template<typename QL, typename QR>

using QuaternionBinaryType = typename QuaternionBinaryExpr<QL, QR>::Scalar;
Addition Example: Expression Node
template<Quaternion QL, Quaternion QR>
class QuaternionAddition : public QuaternionBinaryExpr<QL, QR>
{
QL l_;
QR r_;

public:
using Scalar = QuaternionBinaryType<QL, QR>;

QuaternionAddition(const QL & lhs, const QR & rhs) noexcept

: l_(lhs), r_(rhs)
{}

Scalar w() const noexcept { return l_.w() + r_.w(); }

Scalar x() const noexcept { return l_.x() + r_.x(); }
Scalar y() const noexcept { return l_.y() + r_.y(); }
Scalar z() const noexcept { return l_.z() + r_.z(); }
};
Scalar-valued Functions
template<Quaternion QL, Quaternion QR>
inline auto Dot(const QL & lhs, const QR & rhs) noexcept -> QuaternionBinaryType<QL, QR>
{
return
lhs.w() * rhs.w() +
lhs.x() * rhs.x() +
lhs.y() * rhs.y() +
lhs.z() * rhs.z();
}
Usage of Common Implementation
Quat<float> eval(Quat<float> a, Quat<float> b, Quat<float> c) {
return (a * b + c) / (a - c);
}
Common Implementation Complete
• The common implementation is a set of interrelated concepts, classes and
functions.
• Quaternion-valued expressions will compile and run on any platform as long
as the scalar types are supported on the platform
• The code is as optimal as the compiler that optimizes it.
• But we can do better…
Optimizing for the ARM Neon
• Update the build system
• Add a SIMD tag
• Add a template specialization for the Neon data
• Add Neon-optimized functions
Updating the Build System for ARM Neon32
• Change the SIMD macro to Neon32
set(PLT_SIMD Neon32)
• May need to set a compiler flag to enable Neon
neon32.h: The ARM Neon32 Feature Header
#define SIMD_HAS_NEON32

#include <type_traits>
#include ”simd.h"

namespace plt::simd
{
struct Neon32 : Common {};

template<typename SIMD>
concept Neon32Family = std::derived_from<SIMD, Neon32>;
}
Quat_Neon32.h: Quat<float, Neon> Class Declaration and Data
#include <arm_neon.h>

template<>
class Quat<float, plt::simd::Neon32>
{
float32x4_t value_;

public:
using Scalar = float;

// ... Constructors
// ... Accessors
};
Quat_Neon32.h: Quat<float, Neon> Constructors
Quat() = default;

Quat(Scalar w, Scalar x, Scalar y, Scalar z) {

float vals[] = { w, x, y, z};
value_ = vld1q_f32(vals);
}

template<Quaternion Q>
Quat(const Q& rhs)
: Quat(static_cast<Scalar>(rhs.w()), static_cast<Scalar>(rhs.x()),
static_cast<Scalar>(rhs.y()),
static_cast<Scalar>(rhs.z()))
{}

Quat(float32x4_t value) : value_(value) {}

Quat_Neon32.h: Quat<float, Neon> Accessors
template<>
class Quat<float, plt::simd::Neon32>
{
// ...
Scalar w() const noexcept { return vgetq_lane_f32(NeonVal(), 0); }
Scalar x() const noexcept { return vgetq_lane_f32(NeonVal(), 1); }
Scalar y() const noexcept { return vgetq_lane_f32(NeonVal(), 2); }
Scalar z() const noexcept { return vgetq_lane_f32(NeonVal(), 3); }

float32x4_t NeonVal() const noexcept { return value_; }

};
Where are We Now?
• The class alone compiles, links, and passes tests
• Quaternions will now use Neon registers
• The algorithms are all common implementations
• Data is moved into general-purpose registers for computations
• Depending on the platform, may see a performance gain at this stage
Quat<float, Neon32> Functions
template<plt::simd::Neon32Family SIMD>
inline auto operator-(Quat<float, SIMD> q) -> Quat<float, SIMD>
{
float32x4_t value = q.NeonVal();
float32x4_t negation = (-q).NeonVal();
float32x4_t result = vcopyq_laneq_f32(negation, 0, value, 0);
return Quat<float, SIMD>(result);
}
Usage of Platform-specific Implementation
Quat<float> eval(Quat<float> a, Quat<float> b, Quat<float> c) {
return (a * b + c) / (a - c);
}
Adding a Platform with SSE
• Update the build scripts
• Add a SIMD tag
• Add a template specialization for the SSE data
• Add SSE-optimized functions
• SSE has 128-bit registers, including 4x floats, excluding 2x doubles
Updating the Build System for SSE
• Create a new toolchain file
• Set the SIMD feature macro:
set(PLT_SIMD SSE)
• No change is required to the other tool chain file or the common build files
sse.h: The SSE Feature Header
#define SIMD_HAS_SSE

#include <type_traits>
#include ”simd.h"

namespace plt::simd
{
struct Sse : Common {};

template<typename SIMD>
concept SseFamily = std::derived_from<SIMD, Sse>;
}
Quat_Sse.h: Quat<float, Sse> Declaration and Constructors
#include <immintrin.h>

template<>
class Quat<float, plt::simd::Sse>
{
__m128 value_;

public:
using Scalar = float;

// ...

Quat(Scalar w, Scalar x, Scalar y, Scalar z) {

value_ = _mm_setr_ps(w, x, y, z);
}

Quat(__m128_t value) : value_(value) {}

// ...
};
Quat_Sse.h: Quat<float, sse> Accessors
template<>
class Quat<float, plt::simd::Neon32>
{
// ...
Scalar w() const noexcept { return _mm_cvtss_f32(SseVal()); }

Scalar x() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(SseVal(), SseVal(),

_MM_SHUFFLE(1, 1, 1, 1))); }

Scalar y() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(SseVal(), SseVal(),

_MM_SHUFFLE(2, 2, 2, 2))); }

Scalar z() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(SseVal(), SseVal(),

_MM_SHUFFLE(3, 3, 3, 3))); }

__m128 SseVal() const noexcept { return value_; }

};
Quat_Sse Functions
template<plt::simd::Sse SIMD>
inline auto Dot(Quat<float, SIMD> lhs, Quat<float, SIMD> rhs) -> float
{
__m128 squares = _mm_mul_ps(lhs.SseVal(), rhs.SseVal());
__m128 badc = _mm_shuffle_ps(squares, squares, _MM_SHUFFLE(2, 3, 0, 1));
__m128 pairs = _mm_add_ps(squares, badc);
__m128 bbaa = _mm_shuffle_ps(pairs, pairs, _MM_SHUFFLE(0, 1, 2, 3));
__m128 dp = _mm_add_ps(pairs, bbaa);
float result = _mm_cvtss_f32(dp);
return result;
}
Two Platforms Now
• For this platform, as with Neon, we now build using SSE registers
• We now have two platforms whose optimizations are completely
independent
• Except for include guards and checking PLT_SIMD is defined, there are no #if
or #ifdef macros anywhere
• Each header file unconditionally includes the appropriate intrinsic header
• No feature-specific code is ever fed into the compiler for a platform that does
not have that version of the feature
Handling Feature Revisions
• With data class and functions written, we have a program optimized for SSE.
• Now we need to support a platform with SSE2
• SSE2 adds support for a 128-bit register with two doubles
• Now we have two optimized types Quat<float, Sse2> and Quat<double,
Sse2>
• Same steps, different hoops
Sse2.h: The SSE Feature Header
#define SIMD_HAS_SSE2

#include ”Sse.h"

namespace plt::simd
{
struct Sse2 : Sse {};

template<typename SIMD>
concept Sse2Family = SseFamily<SIMD> && std::derived_from<SIMD, Sse2>;
}
Quat_Sse2.h: Quat<float, Sse2> Class
#include "Quat_Sse.h"

template<>
class Quat<float, plt::simd::Sse2> : public Quat<float, plt::simd::Sse>
{
using Quat<float, plt::simd::Sse>::Quat;
};
Quat_Sse2.h: Quat<double, Sse2> Class Declaration and Data
template<>
class Quat<double, plt::simd::Sse2>
{
__m128d wx_;
__m128d yz_;

public:
using Scalar = double;

// ... Constructors
// ... Accessors
}
Quat_Sse2.h: Quat<double, Sse2> Class Constructors
Quat() = default;

Quat(Scalar w, Scalar x, Scalar y, Scalar z)

{
wx_ = _mm_set_pd(x, w);
yz_ = _mm_set_pd(z, y);
}

template<Quaternion Q>
Quat(const Q& rhs)
: Quat(static_cast<Scalar>(rhs.w()), static_cast<Scalar>(rhs.x()), static_cast<Scalar>(rhs.y()),
static_cast<Scalar>(rhs.z()))
{}

Quat(m128d wx, m128d yz)

: wx_(wx), yz_(yz)
{}
Quat_Sse2.h: Quat<double, Sse2> Accessors
Scalar w() const { return _mm_cvtsd_f64(SseWx()); }
Scalar x() const { return _mm_cvtsd_f64(_mm_unpackhi_pd(SseWx(), SseWx())); }
Scalar y() const { return _mm_cvtsd_f64 (SseYz()); }
Scalar z() const { return _mm_cvtsd_f64(_mm_unpackhi_pd(SseYz(), SseYz())); }

__m128d SseWx() const { return wx_; }

__m128d SseYz() const { return yz_; }
Quat<double, Sse2> Functions
template<plt::simd::Sse2 SIMD>
inline auto Dot(Quat<double, SIMD> lhs, Quat<double, SIMD> rhs) -> double
{
__m128d w2x2 = _mm_mul_pd(lhs.SseWx(), rhs.SseWx());
__m128d x2w2 = _mm_shuffle_pd(w2x2, w2x2, _MM_SHUFFLE2(0, 1));
__m128d wx2wx2 = _mm_add_pd(w2x2, x2w2);

__m128d y2z2 = _mm_mul_pd(lhs.SseYz(), rhs.SseYz());

__m128d z2y2 = _mm_shuffle_pd(y2z2, y2z2, _MM_SHUFFLE2(0, 1));
__m128d yz2yz2 = _mm_add_pd(y2z2, z2y2);

__m128d dp = _mm_add_pd(wx2wx2, yz2yz2);

double result = _mm_cvtsd_f64(dp);
return result;
}
On to SSE3
• Add another toolchain file for the new platform that has SSE3
• Add another SIMD tag for SSE3 inheriting from SSE2
• No new register formats, so the float and double Quat specializations get
pulled forward to Quat<float, Sse3> and Quat<double, Sse3>
• New instructions mean more-optimized implementations
Quat<float, Sse3> Functions
template<plt::simd::Sse3 SIMD>
inline auto Dot(Quat<float, SIMD> lhs, Quat<float, SIMD> rhs) -> float
{
__m128 squares = _mm_mul_ps(lhs.SseVal(), rhs.SseVal()); // w^2, x^2, y^2, z^2
__m128 add1st = _mm_hadd_ps(squares, squares); // w^2+x^2, y^2+z^2, w^2+x^2, y^2+z^2
__m128 add2nd = _mm_hadd_ps(add1st, add1st); // w^2+x^2+y^2+z^2, ...
float result = _mm_cvtss_f32(add2nd);
return result;
}

template<plt::simd::Sse3 SIMD>
inline auto Dot(Quat<double, SIMD> lhs, Quat<double, SIMD> rhs) -> double
{
__m128d w2x2 = _mm_mul_pd(lhs.SseWx(), rhs.SseWx()); // w^2, x^2
__m128d y2z2 = _mm_mul_pd(lhs.SseYz(), rhs.SseYz()); // y^2, z^2
__m128d add1 = _mm_hadd_pd(w2x2, y2z2); // w^2+x^2, y^2+z^2
__m128d add2 = _mm_hadd_pd(add1, add1); // w^2+x^2+y^2+z^2, ...

float result = _mm_cvtsd_f64(add2);

return result;
}
Function Overload Resolution: The Outline
• Overload resolution is the determination of which function to call given a set
of functions with the same name
• The compiler generates a list of candidates, trims it down to viable ones, and
then picks the best match
• In the event there is no best match the compiler emits an ambiguous
overload error
• With concepts, what constitutes the best match?
Concepts, Overload Resolution, and Subsumption
• Like enable_if<>, concepts constrain viability
• Unlike enable_if<>, concepts also partake in determining the best match
• For classes, a derived class has a higher priority than its base class
• Concepts do not derive from other concepts; however, other concepts may
appear in the body of concept definition, analogous to the idea concepts may
be composed but not inherited
• For concepts, a concept that subsumes another concept has the higher
priority
• What is subsumption?
Subsumption in Code
• Sse2Family is defined as struct Sse : Common {};
struct Sse2 : Sse {};
conforming to the sse_family
template<typename SIMD>
concept and the additional concept SseFamily = std::derived_from<SIMD, Sse>;
constraint of SIMD deriving from template<typename SIMD>
Sse2. Thus, Sse2Family subsumes concept Sse2Family = SseFamily<SIMD> &&
SseFamily std::derived_from<SIMD, Sse2>;

template<typename SIMD>
• When Sse2Derived is true, concept Sse2Derived = std::derived_from<SIMD, Sse2>;
SseFamily is also true (but not
necessarily the reverse); however, it
does not subsume SseFamily
Feature Revision Without Subsumption
• The first multiplication function is template<typename SIMD>
concept Sse3Family = std::derived_from<SIMD, Sse3>;
defined in SSE
template<plt::simd::SseFamily SIMD>
• A more-optimized one is inline auto operator*(Quat<float, SIMD> lhs,
Quat<float, SIMD> rhs) -> Quat<float, SIMD>;
implemented in SSE3
template<plt::simd::Sse3Family SIMD>
• Assume Sse3Family is defined only inline auto operator*(Quat<float, SIMD> lhs,
Quat<float, SIMD> rhs) -> Quat<float, SIMD>;
as a tag derived from Sse3
• This will fail to compile with an
ambiguous overload error
• The valid parameter types of
Sse3Family is a strict subset of
those acceptable by SseFamily
• But the concepts are unrelated and
thus ambiguous
Feature Revision With Subsumption
• The Sse3Family concept refers to template<typename SIMD>
concept Sse2Family = SseFamily<SIMD> &&
Sse2Family and thus subsumes it std::derived_from<SIMD, Sse3>;

• And the Sse2Family concept refers template<typename SIMD>

concept Sse3Family = Sse2Family<SIMD> &&
to SseFamily and thus subsumes it std::derived_from<SIMD, Sse3>;

• Subsumption is transitive thus template<plt::simd::SseFamily SIMD>

Sse3Family subsumes SseFamily inline auto operator*(Quat<float, SIMD> lhs,
Quat<float, SIMD> rhs) -> Quat<float, SIMD>;

• Consequently, the multiply template<plt::simd::Sse3Family SIMD>

constrained by Sse3Family is a inline auto operator*(Quat<float, SIMD> lhs,
Quat<float, SIMD> rhs) -> Quat<float, SIMD>;
better match than that constrained
by SseFamily, and there is no
ambiguity
The State of Affairs with SSE3
• The data definition of a Quat<float> is defined for SSE and then re-used for
SSE2 and SSE3 with minimal boilerplate.
• A full set of arithmetic operations are implemented in SSE
• SSE2 inherits the optimized SSE functions without a single line of code
• SSE3 implements two more-optimized functions and inherits the rest
• When adding a new revision, all platforms still on older revision do not see
any new code and build in exactly the same way as if no code was added
Fast Forward to AVX
• AVX is the next Intel SIMD standard after SSE4
• SSE uses 128-bit registers
• AVX uses 256-bit registers
• Quat<float> cannot take advantage of the wider register format, so it is pulled
forward as usual
• Quat<double> can now fit the entire quaternion into a single register, so it
gets a new data type
Quat<double, Avx> Class Data
• It now is quite like the original template<> class Quat<double, plt::simd::Avx>
{
Quat<float, SSE> with “double” __m256d value_;
replacing “float” and “__m256d” public:
replacing “__m128f” // ...
}
• If your class and math functions are
in the same header, compiling the
full test suite now will result in a
mass of failures
• The Avx tag derives from the SSE4
tag, so all the functions that operate
on two __m128f members fails
because we now have a single
__m256 member
• What to do?
Dividing Code Into Separate Headers
• Separating the class and functions has been described as good for
recompiles
• Additionally, it can avoid the problem of incompatible data across revisions
• Thus, instead of merely having Quat.h, Quat_Sse.h, etc., the project should
also have QuatMath.h, QuatMath_Sse.h, etc.
• But this still would not avoid the compilation errors!
The Root Cause of the Errors
• The initial split is necessary but not sufficient to fix the errors
• The problem is that Quat<double, Avx> should not include the prior
definitions
• But if QuatMath_Avx.h does not include QuatMath_Sse4.h then the
Quat<float, Avx> functions won’t get included and all of a sudden
Quat<float> is no longer optimized
• Thus, separate them into their own headers
Separate Math Headers AVX Example
• The math header merely includes #include "QuatFloatMath_Avx.h"
#include "QuatDoubleMath_Avx.h"
headers for each supported scalar
type
• QuatFloatMath_Avx.h includes
QuatFloatMath_Sse4.h to inherit
operations
• QuatDoubleMath_Avx.h would not
include a prior header because it is
incompatible with prior revisions
How Could I Know the Structure From Day 1?
• You probably couldn’t
• Given how large code bases can be, I’d refactor when it became known
• Such a refactor would trigger a change in files seen by older platforms
• Thus, it’s a weak OCP that is nearly strong
Summary
• We can write optimized code for each platform
• We can isolate platforms
• We can minimize redundancy
• With C++ 20, concepts, and some discipline we can have it all at once

• https://github.com/noahstein/Ark

ItBuzzPress WildFlyAdministrationGuide
No ratings yet
ItBuzzPress WildFlyAdministrationGuide
449 pages
C and C++ Report
100% (2)
C and C++ Report
29 pages
Installation Guide: Malaysk Custom ROM For RK3066 and RK3188 Head Units
No ratings yet
Installation Guide: Malaysk Custom ROM For RK3066 and RK3188 Head Units
11 pages
02 - Working With Abstraction
No ratings yet
02 - Working With Abstraction
37 pages
Overview of C++ Memory Management
No ratings yet
Overview of C++ Memory Management
19 pages
Bijay Sherchan Niraj Shrestha LX Adhikari Nepal Students Union Kathmandu Engineering College Kalimati, Kathmandu 9813122278
No ratings yet
Bijay Sherchan Niraj Shrestha LX Adhikari Nepal Students Union Kathmandu Engineering College Kalimati, Kathmandu 9813122278
50 pages
CS 233 Data Structures: Cse, Postech
No ratings yet
CS 233 Data Structures: Cse, Postech
27 pages
C Language
No ratings yet
C Language
48 pages
05 C++ Threads
No ratings yet
05 C++ Threads
28 pages
8 Week Report
No ratings yet
8 Week Report
23 pages
C Revision
No ratings yet
C Revision
37 pages
The Basic Structure of The C/C++ Language: I. Objectives
No ratings yet
The Basic Structure of The C/C++ Language: I. Objectives
8 pages
Eigen: A C++ Linear Algebra Template Library: MD Ashiqur Rahman
No ratings yet
Eigen: A C++ Linear Algebra Template Library: MD Ashiqur Rahman
20 pages
Oop Unit I
No ratings yet
Oop Unit I
33 pages
Advanced C++ PDF
No ratings yet
Advanced C++ PDF
257 pages
Getting Started With Matlab: Cs534 Ta: Matt Mcdaniel Alrecenk@Cs - Wisc.Edu Sep 17, 2012 - Fall 2011
No ratings yet
Getting Started With Matlab: Cs534 Ta: Matt Mcdaniel Alrecenk@Cs - Wisc.Edu Sep 17, 2012 - Fall 2011
46 pages
CS2209 - Oops Lab Manual
100% (1)
CS2209 - Oops Lab Manual
62 pages
HPC
No ratings yet
HPC
7 pages
C Intro
No ratings yet
C Intro
17 pages
Oops Manual
No ratings yet
Oops Manual
46 pages
Worksheet 1
No ratings yet
Worksheet 1
5 pages
Ch12 Pointers
No ratings yet
Ch12 Pointers
40 pages
Lec 1 Part2 M1 Digital Design Flow
No ratings yet
Lec 1 Part2 M1 Digital Design Flow
21 pages
Oop Practicals 1-14
No ratings yet
Oop Practicals 1-14
36 pages
MATLAB For Image Processing
No ratings yet
MATLAB For Image Processing
40 pages
C BootCamp
No ratings yet
C BootCamp
82 pages
Prelim Slides
No ratings yet
Prelim Slides
22 pages
Unit 5 Functions
100% (1)
Unit 5 Functions
26 pages
Lec01 ReviewCProgramming
No ratings yet
Lec01 ReviewCProgramming
48 pages
A C++ Crash Course: UW Association For Computing Machinery
No ratings yet
A C++ Crash Course: UW Association For Computing Machinery
57 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Generic Algorithms: (TIC++V2:C6)
No ratings yet
Generic Algorithms: (TIC++V2:C6)
15 pages
Polymorphism I
No ratings yet
Polymorphism I
63 pages
C++ Mini-Course - Best C++ Programing Book
No ratings yet
C++ Mini-Course - Best C++ Programing Book
57 pages
CIS 190: C/C++ Programming
No ratings yet
CIS 190: C/C++ Programming
99 pages
STRUCTURES
No ratings yet
STRUCTURES
25 pages
CG 2020 Journal
No ratings yet
CG 2020 Journal
6 pages
OOP. Complete PDF
No ratings yet
OOP. Complete PDF
149 pages
Cglab
No ratings yet
Cglab
43 pages
Linearalgebra - DLL Zedgraph - DLL
No ratings yet
Linearalgebra - DLL Zedgraph - DLL
46 pages
Advanced Classes and Objects in C++: Slides Created by Marty Stepp
No ratings yet
Advanced Classes and Objects in C++: Slides Created by Marty Stepp
23 pages
C_programming_Chapter01_SEN2201 (1)
No ratings yet
C_programming_Chapter01_SEN2201 (1)
42 pages
03 Sequences
No ratings yet
03 Sequences
35 pages
Manipulator (1)
No ratings yet
Manipulator (1)
27 pages
BSEE21036 Lab18a Oop
No ratings yet
BSEE21036 Lab18a Oop
9 pages
C Sharp (First Slide)
No ratings yet
C Sharp (First Slide)
38 pages
C++ Metaprogramming - Fedor Pikus - CppCon 2015
100% (1)
C++ Metaprogramming - Fedor Pikus - CppCon 2015
76 pages
A C++ Crash Course: UW Association For Computing Machinery
No ratings yet
A C++ Crash Course: UW Association For Computing Machinery
57 pages
CG 2024 Journal
No ratings yet
CG 2024 Journal
6 pages
CS1100 - Introduction To Programming
No ratings yet
CS1100 - Introduction To Programming
31 pages
ex3_cg sam (1)
No ratings yet
ex3_cg sam (1)
7 pages
C Programming
No ratings yet
C Programming
100 pages
IB352 Warwick Wk2 - Lecture - Introduction
No ratings yet
IB352 Warwick Wk2 - Lecture - Introduction
17 pages
CS2209 Oops Lab
No ratings yet
CS2209 Oops Lab
79 pages
Appendix Ii. Image Processing Using Opencv Library
No ratings yet
Appendix Ii. Image Processing Using Opencv Library
11 pages
Functions
No ratings yet
Functions
51 pages
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
"C Programming for Beginners: A Step-by-Step Guide"
From Everand
"C Programming for Beginners: A Step-by-Step Guide"
Lov kush
No ratings yet
C++ Learn in 24 Hours
From Everand
C++ Learn in 24 Hours
Alex Nordeen
No ratings yet
Unit 5 Part A - Moodle's Directory Structure
No ratings yet
Unit 5 Part A - Moodle's Directory Structure
6 pages
Alcatel-Lucent ISA ES16 Datasheet
No ratings yet
Alcatel-Lucent ISA ES16 Datasheet
8 pages
Toshiba Tecra M9 and Tecra A9: Two New Laptop With Santa Rosa Unveiled
No ratings yet
Toshiba Tecra M9 and Tecra A9: Two New Laptop With Santa Rosa Unveiled
11 pages
Dell Latitude E4200 E4300 Technicke Specifikace2 CZ
No ratings yet
Dell Latitude E4200 E4300 Technicke Specifikace2 CZ
71 pages
HEU27 Debug
No ratings yet
HEU27 Debug
12 pages
Analysis of Synchronization Mechanisms in Operating Systems
No ratings yet
Analysis of Synchronization Mechanisms in Operating Systems
19 pages
SETUP PROCEDURE: How To Force The Explorer Into Software Upload Mode
No ratings yet
SETUP PROCEDURE: How To Force The Explorer Into Software Upload Mode
4 pages
Datasheet: ICX35-HWC Industrial Cellular Gateway
No ratings yet
Datasheet: ICX35-HWC Industrial Cellular Gateway
3 pages
25directory Services - Coursera
No ratings yet
25directory Services - Coursera
1 page
Monitoring WebSphere DataPower SOA Appliances
No ratings yet
Monitoring WebSphere DataPower SOA Appliances
21 pages
Customer Experience June 2020 - Meraki SD-WAN
No ratings yet
Customer Experience June 2020 - Meraki SD-WAN
39 pages
EasyScopeX Install Guide
No ratings yet
EasyScopeX Install Guide
12 pages
DSE8005 Data Sheet
No ratings yet
DSE8005 Data Sheet
1 page
Assembly Language For Intel-Based Computers, 5 Edition
No ratings yet
Assembly Language For Intel-Based Computers, 5 Edition
67 pages
074 Using TASKING Script Debugger
No ratings yet
074 Using TASKING Script Debugger
17 pages
Ccna Security Ch14 Configuring Basic Firewall Policies Cisco Asa
No ratings yet
Ccna Security Ch14 Configuring Basic Firewall Policies Cisco Asa
38 pages
Cloud Foundry
No ratings yet
Cloud Foundry
14 pages
WD Elements Portable SE Specifications (English)
No ratings yet
WD Elements Portable SE Specifications (English)
2 pages
Nma Practical
No ratings yet
Nma Practical
100 pages
Process Scheduling - Module2
No ratings yet
Process Scheduling - Module2
61 pages
Cervoz - Industrial - SSD - 2.5inch - PATA - M120-datasheet-Rev2.0
No ratings yet
Cervoz - Industrial - SSD - 2.5inch - PATA - M120-datasheet-Rev2.0
12 pages
MTOOSD103 Advanced Operating System Unit 2 Notes
No ratings yet
MTOOSD103 Advanced Operating System Unit 2 Notes
33 pages
File Operations in Assembly Language - Mark's Blog
No ratings yet
File Operations in Assembly Language - Mark's Blog
4 pages
OS Full Notes
No ratings yet
OS Full Notes
33 pages
CY2550 Module 3-Justin
No ratings yet
CY2550 Module 3-Justin
28 pages
Symbols Latex PDF
No ratings yet
Symbols Latex PDF
2 pages
Hệ thống viễn thông - Chương 2
No ratings yet
Hệ thống viễn thông - Chương 2
15 pages
Prosys GB
100% (1)
Prosys GB
2 pages

High Performance Cross Platform Architecture

Uploaded by

High Performance Cross Platform Architecture

Uploaded by

About Me

• 35-year career in video games and embedded software

Since programs that conform to the open–closed principle are changed by

— Bertrand Meyer (1988)

Meyer, B. (1997). Object-Oriented Software Construction (2nd ed.)

Meyer, B. (1997). Object-Oriented Software Construction (2nd ed.)

• #define INCLUDE_SIMD(File) INCLUDE_PLT(PLT_SIMD, File)

#define INCLUDE_SIMD(File) INCLUDE_PLT(PLT_SIMD, File)

{ q.w() } -> std::same_as<typename Q::Scalar>;

Quat(Scalar w, Scalar x, Scalar y, Scalar z)

Quat(Scalar && w, Scalar && x, Scalar && y, Scalar && z)

template<typename QL, typename QR>

QuaternionAddition(const QL & lhs, const QR & rhs) noexcept

Scalar w() const noexcept { return l_.w() + r_.w(); }

Quat(Scalar w, Scalar x, Scalar y, Scalar z) {

Quat(float32x4_t value) : value_(value) {}

float32x4_t NeonVal() const noexcept { return value_; }

Quat(Scalar w, Scalar x, Scalar y, Scalar z) {

Quat(__m128_t value) : value_(value) {}

Scalar x() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(SseVal(), SseVal(),

Scalar y() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(SseVal(), SseVal(),

Scalar z() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(SseVal(), SseVal(),

__m128 SseVal() const noexcept { return value_; }

Quat(Scalar w, Scalar x, Scalar y, Scalar z)

Quat(__m128d wx, __m128d yz)

__m128d SseWx() const { return wx_; }

__m128d y2z2 = _mm_mul_pd(lhs.SseYz(), rhs.SseYz());

__m128d dp = _mm_add_pd(wx2wx2, yz2yz2);

float result = _mm_cvtsd_f64(add2);

• And the Sse2Family concept refers template<typename SIMD>

• Subsumption is transitive thus template<plt::simd::SseFamily SIMD>

• Consequently, the multiply template<plt::simd::Sse3Family SIMD>

You might also like

Quat(m128d wx, m128d yz)