0% found this document useful (0 votes)

43 views

When A Microsecond Is An Eternity - Carl Cook - CppCon 2017

This document discusses techniques for developing high-performance trading systems in C++. It begins by introducing electronic market making and the technical challenges of low latency trading. It then covers various C++ programming techniques for optimizing performance, including removing unnecessary code from the hotpath, using templates for configuration, leveraging lambda functions, reusing memory efficiently, and preferring templates over branches for control flow. The document also cautions against overusing threads and shared memory due to synchronization costs, and emphasizes measuring performance to validate optimizations.

Uploaded by

chinmaykarkar13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

When A Microsecond Is An Eternity - Carl Cook - CppCon 2017

Uploaded by

chinmaykarkar13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

When a Microsecond Is an

Eternity: High Performance

Trading Systems in C++
Carl Cook, Ph.D. Optiver

1
Introduction

About me:
● Software developer for several trading/finance companies
● Member of ISO SG14 (gaming, low latency, trading), but not a C++ expert

Today’s talk:
● What is electronic market making?
● Technical challenges
● Techniques for low latency C++, and then some surprises
● Measurement of performance

Disclaimer: This is not a discussion covering every C++ optimization technique - it’s
a quick sampler into the life of developing high performance trading systems

2
The three minute guide to electronic
market making

“The elements of good trading are: (1) cutting losses, (2) cutting losses, and (3)
cutting losses. If you can follow these three rules, you may have a chance.”
– Ed Seykota

3
● Two main activities:
○ Provide (continually updating) prices to the market
○ Spot profitable opportunities when they arise
● Objectives:
○ Make small, profitable, trades regularly
○ Avoid making large bad trades

● Aside from accurate pricing, successful algorithms are the fastest to:
○ Buy low
○ Sell high

● Success means being [any unit of time] faster than the competition

4
Photo by Alvin Loke / CC BY 2.0

5
Safety first

If anything appears to be wrong:

● Pull all orders, then start asking questions - not the other way round
● A lot could happen in a few seconds in an uncontrolled system

The best approach is to automate detection of failures

6
Technical challenges

“If you’re not at all interested in performance, shouldn’t you be in the Python
room down the hall?”
– Scott Meyers

7
● The “hotpath” is only exercised 0.01% of the time - the rest of the time the
system is idle or doing administrative work
● Operating systems, networks and hardware are focused on throughput and
fairness
● Jitter is unacceptable - it means bad trades

8
The role of C++

From Bjarne Stroustrup:

“C++ enables zero-overhead abstraction to get us away from the hardware
without adding cost”

However, even though C++ is good at saying what will be done, there are still other
factors:
● Compiler (and version)
● Machine architecture
● 3rd party libraries
● Build and link flags
We need to check what C++ is doing in terms of machine instructions...

9
… luckily there’s an app for that:

10
Calling std::sort on a std::vector<int>

Same:
● Hardware
● Operating system
● Binary
● Background load

One server is tuned for

production (no hyper
threading, etc), the
other not

11
How fast is fast?

12
Burj Khalifa

Height:
828 meters
2,722 feet

Speed of light:
~ 1 foot per ns

Photo by Donald Tong / CC BY-SA 3.0

13
A very good minimum
time (wire to wire) for
a software-based
trading system is
around 2.5us

That’s less than the

time it takes light to
travel from the top of
the spire to the
ground

Photo by Donald Tong / CC BY-SA 3.0

14
Low latency programming techniques

"When in doubt, use brute force."

– Ken Thompson

15
Slowpath removal

Avoid this: Aim for this:

if (checkForErrorA()) int64_t errorFlags;

handleErrorA(); ...
else if (checkForErrorB()) if (!errorFlags)
handleErrorB(); sendOrderToExchange();
else if (checkForErrorC()) else
handleErrorC(); HandleError(errorFlags);
else
sendOrderToExchange();

Tip: ensure that error handling code will not be inlined

16
Template-based configuration

● It’s convenient to have some things controlled via configuration files

○ However virtual functions (and even simple branches) can be expensive
● One possible solution:
○ Use templates (often overlooked, even though everyone uses the STL)
○ This removes branches, eliminates code that won’t be executed, etc

17
// 1st implementation // 2nd implementation
struct OrderSenderA { struct OrderSenderB {
void SendOrder() { void SendOrder() {
... ...
} }
}; };

template <typename T>

struct OrderManager : public IOrderManager {
void MainLoop() final {
// ... and at some stage in the future...
mOrderSender.SendOrder();
}
T mOrderSender;
};

18
std::unique_ptr<IOrderManager> Factory(const Config& config) {
if (config.UseOrderSenderA())
return std::make_unique<OrderManager<OrderSenderA>>();
else
return std::make_unique<OrderManager<OrderSenderB>>();
}

int main(int argc, char *argv[]) {

auto manager = Factory(config);
manager->MainLoop();
}

19
Lambda functions are fast and convenient

If you know at compile time which function is to be executed, then prefer lambdas

template <typename T>

void SendMessage(T&& lambda) { SendMessage([&](auto& msg) {
Msg msg = PrepareMessage(); msg.instrument = x;
lambda(msg); msg.price = z;
send(msg); ...
} });

20
Memory allocation

● Allocations are costly:

○ Use a pool of preallocated objects
● Reuse objects instead of deallocating:
○ delete involves no system calls (memory is not given back to the OS)
■ But: glibc free has 400 lines of book-keeping code
○ Reusing objects helps avoid memory fragmentation as well
● If you must delete large objects, consider doing this from another thread

21
Exceptions in C++

● Don’t be afraid to use exceptions (if using gcc, clang, msvc):

○ I’ve measured this in quite some detail:
■ They are zero cost if they don’t throw

● Don’t use exceptions for control flow:

○ That will get expensive:
■ My benchmarking suggests an overhead of at least 1.5us
○ Your code will look terrible

22
Prefer templates to branches

Branching approach:

enum class Side { Buy, Sell };

void RunStrategy(Side side) {

const float orderPrice = CalcPrice(side, fairValue, credit);
CheckRiskLimits(side, orderPrice);
SendOrder(side, orderPrice);
}

float CalcPrice(Side side, float value, float credit) {

return side == Side::Buy ? value - credit : value + credit;
}

23
Templated approach:

template<Side T>
void Strategy<T>::RunStrategy() {
const float orderPrice = CalcPrice(fairValue, credit);
CheckRiskLimits(orderPrice);
SendOrder(orderPrice);
}
template<>
float Strategy<Side::Buy>::CalcPrice(float value, float credit) {
return value - credit;
}
template<>
float Strategy<Side::Sell>::CalcPrice(float value, float credit) {
return value + credit;
}
};
24
Multi-threading

Multithreading is best avoided for

latency-sensitive code:
● Synchronization of data via locking
will get expensive
● Lock free code may still require
locks at the hardware level
● Mind-bendingly complex to
correctly implement parallelism
● Easy for the producer to
accidentally saturate the consumer

25
If you must use multiple threads...

● Keep shared data to an absolute minimum

○ Multiple threads writing to the same cacheline will get expensive
● Consider passing copies of data rather than sharing data
○ e.g. a single writer, single reader lock free queue
● If you have to share data, consider not using synchronization, e.g.:
○ Maybe you can live with out-of-sequence updates

26
Data lookups

Software engineering textbooks typically suggest:

struct Market { struct Instrument {

int32_t id; float price;
char shortName[4]; ...
int16_t quantityMultiplier; int32_t marketId;
... }
}

Message orderMessage;
orderMessage.price = instrument.price;
Market& market = Markets.FindMarket(instrument.marketId);
orderMessage.qty = market.quantityMultiplier * qty;
...

27
Data lookups

Actually, denormalized data is not a sin. Why not just pull all the data you care
about in the same cacheline?

struct Market { struct Instrument {

int32_t id; float price;
char shortName[4]; int16_t quantityMultiplier;
int16_t quantityMultiplier; ...
... int32_t marketId;
} }

This is better than trampling your cache to “save memory”

28
Fast associative containers (std::unordered_map)

Bucket 1 Bucket ... Bucket N

Key Value Key Value Key Value

{
Key Value std::pair<K, V> Key Value

Key Value
Default max_load_factor: 1
Average case insert: O(1) See:
Average case find: O(1) wg21.link/n1456
29
10K elements, keyed in the range std::uniform_int_distribution(0, 1e+12)

Complexity of find:

Average case: O(1)

Worst case: O(N)

30
Run on (32 X 2892.9 MHz CPU s), 2017-09-08 11:39:44
Benchmark Time
----------------------------------------------
FindBenchmark<unordered_map>/10 14 ns
FindBenchmark<unordered_map>/64 16 ns
FindBenchmark<unordered_map>/512 16 ns
FindBenchmark<unordered_map>/4k 20 ns
FindBenchmark<unordered_map>/10k 24 ns
----------------------------------------------

# 56.54% frontend cycles idle

# 21.61% backend cycles idle
# 0.67 insns per cycle
# 0.84 stalled cycles per insn
branch-misses # 0.63% of all branches
cache-misses # 0.153% of all cache refs
31
Alternatively, consider open addressing, e.g. Google’s dense_hash_map

Key Value Key Value Key Value

1 2 3 4 5

✓ Key/Value pairs are in contiguous memory - no pointer following between nodes

✘ Complexity around collision management

32
A lesser-known approach: a hybrid of both chaining and open addressing

Goals:
● Predictable cache access patterns (no jumping all over the place)
● Prefetched candidate hash values

33
Key 73 ➔ Hash 73 ➔ Index 1 Key 12 ➔ Hash 12 ➔ Index 3

✓ ✘ ✓
Hash Hash Hash
Ptr Ptr Ptr
73 98 12
1 2 3 4 5
✓
Key
Value
73

Key
Value
98

✓
Key
Value
12

It’s possible to implement this as a drop-in substitute for std::unordered_map

34
Run on (32 X 2892.9 MHz CPU s), 2017-09-08 11:40:08
Benchmark Time
----------------------------------------------
FindBenchmark<array_map>/10 7 ns
FindBenchmark<array_map>/64 7 ns
FindBenchmark<array_map>/512 7 ns
FindBenchmark<array_map>/4k 9 ns
FindBenchmark<array_map>/10k 9 ns
----------------------------------------------

# 38.26% frontend cycles idle

# 6.77% backend cycles idle
# 1.6 insns per cycle
# 0.24 stalled cycles per insn
branch-misses # 0.22% of all branches
cache-misses # 0.067% of all cache refs

35
((always_inline)) and ((noinlne))

● The inline keyword is somewhat misunderstood

○ It mainly means: multiple definitions are permitted
● ((always_inline)) and ((noinline)) are a stronger hint to the compiler
○ But be careful: measure

An example: forcing methods to be not inlined

CheckMarket(); __attribute__((noinline))
if (notGoingToSendAnOrder) void ComplexLoggingFunction()
ComplexLoggingFunction(); {
else ...
SendOrder(); }

36
Keeping the cache hot

Remember, the full hotpath is only exercised very infrequently - your cache has
most likely been trampled by non-hotpath data and instructions

Market data Strategy Execution

decoder engine

Market data
decoder

Market data Strategy

decoder

Market data
decoder

Market data Strategy Execution

decoder engine

Market data
decoder
37
A simple solution: run a very frequent dummy path through your entire system,
keeping both your data cache and instruction cache primed

Market data Strategy Execution

decoder engine

Market data Strategy Execution

decoder engine

Market data Strategy Execution

decoder engine

Market data Strategy Execution

decoder engine

Market data Strategy Execution

decoder engine

Market data Strategy Execution

decoder engine

Bonus: this also trains the hardware branch predictor correctly

38
Intel Xeon E5 processor

Source:
Intel Corporation
39
● Don’t share L3 - disable all but 1 core (or lock the cache)
● If you do have multiple cores enabled, choose your neighbours carefully:
○ Noisy neighbours should probably be moved to a different physical CPU

40
Surprises and war stories

"I have always wished for my computer to be as easy to use as my telephone;

my wish has come true because I can no longer figure out how to use my
telephone."
– Bjarne Stroustrup

41
Placement new can be slightly inefficient

Quick refresher:
#include <new>
Object* object = new(buffer)Object;

● However, if you use:

○ Any version of gcc without -std=c++17 or -std=c++1z
○ Any version of gcc below 7.1 (May 2017)
○ Any version of clang below 3.4 (January 2014)
● Placement new will perform a null pointer check on the memory passed in
○ And if null is passed in:
■ The returned object is also null
■ No calls to the constructor or destructor will take place

42
● Why do compilers check whether memory passed in might be null?
○ The C++ spec was ambiguous about what placement new must do
● Marc Glisse/Jonathan Wakely (wg21.link/cwg1748) clarified this in 2013:
○ Passing null to placement new is now Undefined Behaviour [5.3.4.15]
● For several trading systems written in gcc, this inefficiency had a considerable
negative performance effect:
○ More instructions mean fewer opportunities to inline and optimize
● There is a workaround: declare a throwing type-specific placement new
void* Object::operator new(size_t, void* mem) /* can throw */ {
return mem;
}

43
Small string optimization support
std::unordered_map<std::string, Instrument> instruments;
return instruments.find({“IBM”}) != instruments.end();

● This will avoid an allocation with:

○ gcc 5.1 or greater, and if the string is 15 characters or less
○ clang if the string is 22 characters or less

● However, if you are using gcc >= 5.1 and an ABI compatible linux distribution
such as Redhat/Centos/Ubuntu/Fedora, then you are probably still using the
old std::string implementation
○ Including C.O.W. semantics
○ First mentioned (as slow) by Herb Sutter in 1999!

44
Overhead of C++11 static local variable initialization

struct Random { Random::get():

int get() { movzx eax, BYTE PTR guard var
// threadsafe! test al, al
static int i = rand(); je .L13 // not yet initialized
return i; mov eax, DWORD PTR get()::i
} ret
}; .L13
// acquire and set the guard var
int main() {
5-10% overhead compared to
Random r; non-static access, even if binary is
return r.get(); single threaded
}

45
std::function may allocate

struct Point {
double dimensions[3];
};

int main() {
std::function<void()> no_op { [point = Point{}] {} };
}

main:
mov edi, 24
call operator new(unsigned long)

46
Consider inplace_function (D0419R0):
● http://github.com/WG21-SG14/SG14/blob/master/SG14/inplace_function.h
● Drop-in replacement for std::function
● Defaults to a 32 byte internal buffer

int main() {
inplace_function<void()> no_op { [point = Point{}] {} };
}

main:
xor eax, eax
ret

inplace_function<void(), 16> no_op { [point = Point{}] {} };

// error: static assertion failed: Closure larger than buffer

47
std::pow can be slow

std::pow is a transcendental function, meaning it goes into a second, slower

phase if the accuracy of the result isn’t acceptable after the first phase

auto base = 1.00000000000001, exp1 = 1.4, exp2 = 1.5;

std::pow(base, exp1) = 1.0000000000000140
std::pow(base, exp2) = 1.0000000000000151

Benchmark Time Iterations

-------------------------------------------------------
pow(base, 1.4) [glibc 2.17] 53 ns 13142054
pow(base, 1.4) [glibc 2.21] 53 ns 13142821
pow(base, 1.5) [glibc 2.17] 478195 ns 1457
pow(base, 1.5) [glibc 2.21] 63348 ns 11113

See http://entropymine.com/imageworsener/slowpow for a nice discussion

48
Measurement of low latency systems

“Bottlenecks occur in surprising places, so don't try to second guess and put in
a speed hack until you've proven that's where the bottleneck is.”
– Rob Pike

49
Measurement of low latency systems

● Two common approaches:

○ Profiling: examining what your code is doing (bottlenecks in particular)
○ Benchmarking: timing the speed of your system
● Caution: profiling is not necessarily benchmarking
○ Profiling is useful for catching unexpected things
○ Improvements in profiling results are not a 100% guarantee that your
system is now faster

50
Measurement of low latency systems

✘ Sampling profilers (e.g. gprof)

○ They miss the key events
✘ Instrumentation profilers (e.g. callgrind)
○ They are too intrusive
○ They don’t catch I/O slowness/jitter
✘ Microbenchmarks (e.g. Google benchmark)
○ They are not representative of a realistic environment
○ Takes some effort to force the compiler to not optimize out the test
○ Heap fragmentation can have an impact on subsequent tests

They are all useful in some ways, but not for micro-optimization of code

51
Measurement of low latency systems

✓ Most useful: measure end-to-end time in a production-like setup

Switch with high precision hardware-based

timestamping (appended to each packet)

Server which replays Server under test - listens

exchange market data to market data and sends
and accepts orders orders

Server which captures and parses each

network packet it sees, and calculates
response time (accurate to a few
nanoseconds)

52
Summary

“A language that doesn't affect the way you think about programming is not
worth knowing.”
– Alan Perlis

53
● Have a good knowledge of C++ well, including your compiler
● Understand the basics of machine architecture, and how it will impact your
code
● Aim for very simple runtime logic:
○ Compilers optimize simple code the best
○ Prefer approximations instead of perfect precision where appropriate
○ Do expensive work only when you have spare time
● Conduct accurate measurement - this is essential

54
Thanks for listening!
[email protected]

Ebin - Pub Building Low Latency Applications With C Develop A Complete Low Latency Trading Ecosystem From Scratch Using Modern C 1nbsped 1837639353 9781837639359
No ratings yet
Ebin - Pub Building Low Latency Applications With C Develop A Complete Low Latency Trading Ecosystem From Scratch Using Modern C 1nbsped 1837639353 9781837639359
508 pages
Silen Petri Clean Code Principles and Patterns Python Edition 2023
No ratings yet
Silen Petri Clean Code Principles and Patterns Python Edition 2023
588 pages
System Design Course Content - Gaurav Sen
No ratings yet
System Design Course Content - Gaurav Sen
14 pages
System Design Interview Fundamentals
100% (4)
System Design Interview Fundamentals
412 pages
Richard Fabian Data Oriented Design Software Engineering For Limited
No ratings yet
Richard Fabian Data Oriented Design Software Engineering For Limited
327 pages
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
HollySys Distributed Control System Overview
No ratings yet
HollySys Distributed Control System Overview
49 pages
Hyper Transport Technology
100% (1)
Hyper Transport Technology
27 pages
Slides
No ratings yet
Slides
68 pages
High Quality Software Engineering
No ratings yet
High Quality Software Engineering
84 pages
Software Architecture Design ( PDFDrive )
No ratings yet
Software Architecture Design ( PDFDrive )
169 pages
High Quality Sofware Engineering
No ratings yet
High Quality Sofware Engineering
84 pages
OOSE
No ratings yet
OOSE
44 pages
Oops Notes Unit Wise
No ratings yet
Oops Notes Unit Wise
91 pages
PHD Thesis 2004 - Efficient, Transparent, and Comprehensive Runtime Code Manipulation PHD
No ratings yet
PHD Thesis 2004 - Efficient, Transparent, and Comprehensive Runtime Code Manipulation PHD
306 pages
SS ZG653 (RL 9.3) : Software Architecture: Layering Pattern
No ratings yet
SS ZG653 (RL 9.3) : Software Architecture: Layering Pattern
23 pages
L8 Architecture
No ratings yet
L8 Architecture
44 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
Data Structures Big Wiiki Books Bigger Book
No ratings yet
Data Structures Big Wiiki Books Bigger Book
440 pages
SG 247497
No ratings yet
SG 247497
690 pages
C NPv1
No ratings yet
C NPv1
226 pages
Perfbook-1c 2019 12 22a PDF
No ratings yet
Perfbook-1c 2019 12 22a PDF
825 pages
S11 - System Architecture
No ratings yet
S11 - System Architecture
79 pages
Top 10 Architecture Characteristics
No ratings yet
Top 10 Architecture Characteristics
11 pages
Parallel Programming
No ratings yet
Parallel Programming
692 pages
System Design
No ratings yet
System Design
9 pages
System Design Resources
No ratings yet
System Design Resources
25 pages
Programming For Performance: Alosh Bennett
No ratings yet
Programming For Performance: Alosh Bennett
22 pages
Technical Documentation
No ratings yet
Technical Documentation
14 pages
Essential Data Structures for C++ Programmers
No ratings yet
Essential Data Structures for C++ Programmers
105 pages
Software Engineering (Week-6)
No ratings yet
Software Engineering (Week-6)
84 pages
Perfbook 1c E2 rc11
No ratings yet
Perfbook 1c E2 rc11
881 pages
Is Parallel Programming Hard - and - If So - What Can You Do About It
No ratings yet
Is Parallel Programming Hard - and - If So - What Can You Do About It
533 pages
Nalsd Workbook A4
No ratings yet
Nalsd Workbook A4
7 pages
CH0018234F7F349D6992DC125726C003D9560 (1)
No ratings yet
CH0018234F7F349D6992DC125726C003D9560 (1)
10 pages
The Parallel Book
No ratings yet
The Parallel Book
646 pages
Freebie - Top 52 Interview Q&A For SWEs
No ratings yet
Freebie - Top 52 Interview Q&A For SWEs
55 pages
Application System Are Used As The Basis of Applications
No ratings yet
Application System Are Used As The Basis of Applications
7 pages
Systems Design Interview Study Guide
100% (1)
Systems Design Interview Study Guide
18 pages
Software Modelling & Design Summary PDF
No ratings yet
Software Modelling & Design Summary PDF
43 pages
Pet Shop Software Presentation
No ratings yet
Pet Shop Software Presentation
9 pages
Computer Science Design Patterns
No ratings yet
Computer Science Design Patterns
277 pages
System Design Interview
No ratings yet
System Design Interview
4 pages
Networkingpdf
No ratings yet
Networkingpdf
105 pages
SE- Software Design
No ratings yet
SE- Software Design
99 pages
System Design Golden Rules
No ratings yet
System Design Golden Rules
37 pages
9 PDF
No ratings yet
9 PDF
20 pages
Clean Code V2
No ratings yet
Clean Code V2
672 pages
Maxime Klusman
No ratings yet
Maxime Klusman
98 pages
Why Use Iteration/Incremental Model?
No ratings yet
Why Use Iteration/Incremental Model?
4 pages
CSC:361-Software Engineering: Semester: Fall2020
No ratings yet
CSC:361-Software Engineering: Semester: Fall2020
39 pages
Everseen Test
No ratings yet
Everseen Test
5 pages
Overview of Com-Dcom
No ratings yet
Overview of Com-Dcom
47 pages
Oop
No ratings yet
Oop
220 pages
Concurrency Analysis Report
No ratings yet
Concurrency Analysis Report
42 pages
CodeCompass - An Open Software Comprehension Framework For Industrial Usage
No ratings yet
CodeCompass - An Open Software Comprehension Framework For Industrial Usage
9 pages
Classification of Software Qualities
No ratings yet
Classification of Software Qualities
6 pages
Middleware Architecture With Patterns and Framework
No ratings yet
Middleware Architecture With Patterns and Framework
437 pages
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Slot 3-Perkongsian Pintar Arduino Dan Projek Penjaga Air Tanaman
No ratings yet
Slot 3-Perkongsian Pintar Arduino Dan Projek Penjaga Air Tanaman
30 pages
Lesson 8: Cpus Used in Personal Computers
No ratings yet
Lesson 8: Cpus Used in Personal Computers
14 pages
MPMC QB V SEM ECE Adn EEE 2023-2024
No ratings yet
MPMC QB V SEM ECE Adn EEE 2023-2024
31 pages
651C-M (1.0)
No ratings yet
651C-M (1.0)
68 pages
Datasheet xc866
No ratings yet
Datasheet xc866
108 pages
Toshiba T2000SXe - Maintenance Manual
No ratings yet
Toshiba T2000SXe - Maintenance Manual
176 pages
Lecture 12 8051 Timer Programing v2
No ratings yet
Lecture 12 8051 Timer Programing v2
22 pages
All in One v4.8.22.3
No ratings yet
All in One v4.8.22.3
114 pages
ControlledJss Setup Forsitecore
No ratings yet
ControlledJss Setup Forsitecore
8 pages
The Vision of Autonomic Computing
No ratings yet
The Vision of Autonomic Computing
22 pages
QUESTION BANK FOR ETE CT1
No ratings yet
QUESTION BANK FOR ETE CT1
22 pages
ABS730X Service Manual
No ratings yet
ABS730X Service Manual
43 pages
Full PSPC Theory Complete C Programming Notes
No ratings yet
Full PSPC Theory Complete C Programming Notes
60 pages
Microsoft Word - OS
No ratings yet
Microsoft Word - OS
254 pages
Performance and Workload Management
No ratings yet
Performance and Workload Management
34 pages
3rd Lecture - 8085 Pin Diagram
No ratings yet
3rd Lecture - 8085 Pin Diagram
14 pages
Computer System Overview: Made By: Tarun Baria, PGT CS, JNV Leh
No ratings yet
Computer System Overview: Made By: Tarun Baria, PGT CS, JNV Leh
48 pages
MPMC Course File
No ratings yet
MPMC Course File
41 pages
Example Microprocessor Register Organizations
No ratings yet
Example Microprocessor Register Organizations
3 pages
ICT NOTES SENIOR 1 - Copy
No ratings yet
ICT NOTES SENIOR 1 - Copy
92 pages
Training Topics XRC
No ratings yet
Training Topics XRC
5 pages
AK3918AV100 Series Processors Feature List: Revision: 1.4.0 09/2022
No ratings yet
AK3918AV100 Series Processors Feature List: Revision: 1.4.0 09/2022
4 pages
Micro Blaze Processor Archtechture
No ratings yet
Micro Blaze Processor Archtechture
5 pages
NI - (PRELIM) Computer System
No ratings yet
NI - (PRELIM) Computer System
52 pages
Chapter 1 Exercieses
No ratings yet
Chapter 1 Exercieses
5 pages
Computer Application in Business
No ratings yet
Computer Application in Business
16 pages
Lesson 2
No ratings yet
Lesson 2
16 pages
Chapter 01
No ratings yet
Chapter 01
8 pages

When A Microsecond Is An Eternity - Carl Cook - CppCon 2017

Uploaded by

When A Microsecond Is An Eternity - Carl Cook - CppCon 2017

Uploaded by

When a Microsecond Is an

Eternity: High Performance

If anything appears to be wrong:

The best approach is to automate detection of failures

From Bjarne Stroustrup:

One server is tuned for

Photo by Donald Tong / CC BY-SA 3.0

That’s less than the

Photo by Donald Tong / CC BY-SA 3.0

"When in doubt, use brute force."

Avoid this: Aim for this:

if (checkForErrorA()) int64_t errorFlags;

Tip: ensure that error handling code will not be inlined

● It’s convenient to have some things controlled via configuration files

template <typename T>

int main(int argc, char *argv[]) {

template <typename T>

● Allocations are costly:

● Don’t be afraid to use exceptions (if using gcc, clang, msvc):

● Don’t use exceptions for control flow:

enum class Side { Buy, Sell };

void RunStrategy(Side side) {

float CalcPrice(Side side, float value, float credit) {

Multithreading is best avoided for

● Keep shared data to an absolute minimum

Software engineering textbooks typically suggest:

struct Market { struct Instrument {

struct Market { struct Instrument {

This is better than trampling your cache to “save memory”

Bucket 1 Bucket ... Bucket N

Key Value Key Value Key Value

Average case: O(1)

# 56.54% frontend cycles idle

Key Value Key Value Key Value

✓ Key/Value pairs are in contiguous memory - no pointer following between nodes

✘ Complexity around collision management

It’s possible to implement this as a drop-in substitute for std::unordered_map

# 38.26% frontend cycles idle

● The inline keyword is somewhat misunderstood

An example: forcing methods to be not inlined

Market data Strategy Execution

Market data Strategy

Market data Strategy Execution

Market data Strategy Execution

Market data Strategy Execution

Market data Strategy Execution

Market data Strategy Execution

Market data Strategy Execution

Market data Strategy Execution

Bonus: this also trains the hardware branch predictor correctly

"I have always wished for my computer to be as easy to use as my telephone;

● However, if you use:

● This will avoid an allocation with:

struct Random { Random::get():

inplace_function<void(), 16> no_op { [point = Point{}] {} };

std::pow is a transcendental function, meaning it goes into a second, slower

auto base = 1.00000000000001, exp1 = 1.4, exp2 = 1.5;

Benchmark Time Iterations

See http://entropymine.com/imageworsener/slowpow for a nice discussion

● Two common approaches:

✘ Sampling profilers (e.g. gprof)

✓ Most useful: measure end-to-end time in a production-like setup

Switch with high precision hardware-based

Server which replays Server under test - listens

Server which captures and parses each

You might also like