When A Microsecond Is An Eternity - Carl Cook - CppCon 2017
When A Microsecond Is An Eternity - Carl Cook - CppCon 2017
1
Introduction
About me:
● Software developer for several trading/finance companies
● Member of ISO SG14 (gaming, low latency, trading), but not a C++ expert
Today’s talk:
● What is electronic market making?
● Technical challenges
● Techniques for low latency C++, and then some surprises
● Measurement of performance
Disclaimer: This is not a discussion covering every C++ optimization technique - it’s
a quick sampler into the life of developing high performance trading systems
2
The three minute guide to electronic
market making
“The elements of good trading are: (1) cutting losses, (2) cutting losses, and (3)
cutting losses. If you can follow these three rules, you may have a chance.”
– Ed Seykota
3
● Two main activities:
○ Provide (continually updating) prices to the market
○ Spot profitable opportunities when they arise
● Objectives:
○ Make small, profitable, trades regularly
○ Avoid making large bad trades
● Aside from accurate pricing, successful algorithms are the fastest to:
○ Buy low
○ Sell high
● Success means being [any unit of time] faster than the competition
4
Photo by Alvin Loke / CC BY 2.0
5
Safety first
6
Technical challenges
“If you’re not at all interested in performance, shouldn’t you be in the Python
room down the hall?”
– Scott Meyers
7
● The “hotpath” is only exercised 0.01% of the time - the rest of the time the
system is idle or doing administrative work
● Operating systems, networks and hardware are focused on throughput and
fairness
● Jitter is unacceptable - it means bad trades
8
The role of C++
However, even though C++ is good at saying what will be done, there are still other
factors:
● Compiler (and version)
● Machine architecture
● 3rd party libraries
● Build and link flags
We need to check what C++ is doing in terms of machine instructions...
9
… luckily there’s an app for that:
10
Calling std::sort on a std::vector<int>
Same:
● Hardware
● Operating system
● Binary
● Background load
11
How fast is fast?
12
Burj Khalifa
Height:
828 meters
2,722 feet
Speed of light:
~ 1 foot per ns
15
Slowpath removal
16
Template-based configuration
17
// 1st implementation // 2nd implementation
struct OrderSenderA { struct OrderSenderB {
void SendOrder() { void SendOrder() {
... ...
} }
}; };
18
std::unique_ptr<IOrderManager> Factory(const Config& config) {
if (config.UseOrderSenderA())
return std::make_unique<OrderManager<OrderSenderA>>();
else
return std::make_unique<OrderManager<OrderSenderB>>();
}
19
Lambda functions are fast and convenient
If you know at compile time which function is to be executed, then prefer lambdas
20
Memory allocation
21
Exceptions in C++
22
Prefer templates to branches
Branching approach:
23
Templated approach:
template<Side T>
void Strategy<T>::RunStrategy() {
const float orderPrice = CalcPrice(fairValue, credit);
CheckRiskLimits(orderPrice);
SendOrder(orderPrice);
}
template<>
float Strategy<Side::Buy>::CalcPrice(float value, float credit) {
return value - credit;
}
template<>
float Strategy<Side::Sell>::CalcPrice(float value, float credit) {
return value + credit;
}
};
24
Multi-threading
25
If you must use multiple threads...
26
Data lookups
Message orderMessage;
orderMessage.price = instrument.price;
Market& market = Markets.FindMarket(instrument.marketId);
orderMessage.qty = market.quantityMultiplier * qty;
...
27
Data lookups
Actually, denormalized data is not a sin. Why not just pull all the data you care
about in the same cacheline?
28
Fast associative containers (std::unordered_map)
{
Key Value std::pair<K, V> Key Value
Key Value
Default max_load_factor: 1
Average case insert: O(1) See:
Average case find: O(1) wg21.link/n1456
29
10K elements, keyed in the range std::uniform_int_distribution(0, 1e+12)
Complexity of find:
30
Run on (32 X 2892.9 MHz CPU s), 2017-09-08 11:39:44
Benchmark Time
----------------------------------------------
FindBenchmark<unordered_map>/10 14 ns
FindBenchmark<unordered_map>/64 16 ns
FindBenchmark<unordered_map>/512 16 ns
FindBenchmark<unordered_map>/4k 20 ns
FindBenchmark<unordered_map>/10k 24 ns
----------------------------------------------
1 2 3 4 5
32
A lesser-known approach: a hybrid of both chaining and open addressing
Goals:
● Predictable cache access patterns (no jumping all over the place)
● Prefetched candidate hash values
33
Key 73 ➔ Hash 73 ➔ Index 1 Key 12 ➔ Hash 12 ➔ Index 3
✓ ✘ ✓
Hash Hash Hash
Ptr Ptr Ptr
73 98 12
1 2 3 4 5
✓
Key
Value
73
Key
Value
98
✓
Key
Value
12
35
((always_inline)) and ((noinlne))
CheckMarket(); __attribute__((noinline))
if (notGoingToSendAnOrder) void ComplexLoggingFunction()
ComplexLoggingFunction(); {
else ...
SendOrder(); }
36
Keeping the cache hot
Remember, the full hotpath is only exercised very infrequently - your cache has
most likely been trampled by non-hotpath data and instructions
Market data
decoder
Market data
decoder
Market data
decoder
37
A simple solution: run a very frequent dummy path through your entire system,
keeping both your data cache and instruction cache primed
38
Intel Xeon E5 processor
Source:
Intel Corporation
39
● Don’t share L3 - disable all but 1 core (or lock the cache)
● If you do have multiple cores enabled, choose your neighbours carefully:
○ Noisy neighbours should probably be moved to a different physical CPU
40
Surprises and war stories
41
Placement new can be slightly inefficient
Quick refresher:
#include <new>
Object* object = new(buffer)Object;
42
● Why do compilers check whether memory passed in might be null?
○ The C++ spec was ambiguous about what placement new must do
● Marc Glisse/Jonathan Wakely (wg21.link/cwg1748) clarified this in 2013:
○ Passing null to placement new is now Undefined Behaviour [5.3.4.15]
● For several trading systems written in gcc, this inefficiency had a considerable
negative performance effect:
○ More instructions mean fewer opportunities to inline and optimize
● There is a workaround: declare a throwing type-specific placement new
void* Object::operator new(size_t, void* mem) /* can throw */ {
return mem;
}
43
Small string optimization support
std::unordered_map<std::string, Instrument> instruments;
return instruments.find({“IBM”}) != instruments.end();
● However, if you are using gcc >= 5.1 and an ABI compatible linux distribution
such as Redhat/Centos/Ubuntu/Fedora, then you are probably still using the
old std::string implementation
○ Including C.O.W. semantics
○ First mentioned (as slow) by Herb Sutter in 1999!
44
Overhead of C++11 static local variable initialization
45
std::function may allocate
struct Point {
double dimensions[3];
};
int main() {
std::function<void()> no_op { [point = Point{}] {} };
}
main:
mov edi, 24
call operator new(unsigned long)
46
Consider inplace_function (D0419R0):
● http://github.com/WG21-SG14/SG14/blob/master/SG14/inplace_function.h
● Drop-in replacement for std::function
● Defaults to a 32 byte internal buffer
int main() {
inplace_function<void()> no_op { [point = Point{}] {} };
}
main:
xor eax, eax
ret
47
std::pow can be slow
“Bottlenecks occur in surprising places, so don't try to second guess and put in
a speed hack until you've proven that's where the bottleneck is.”
– Rob Pike
49
Measurement of low latency systems
50
Measurement of low latency systems
They are all useful in some ways, but not for micro-optimization of code
51
Measurement of low latency systems
52
Summary
“A language that doesn't affect the way you think about programming is not
worth knowing.”
– Alan Perlis
53
● Have a good knowledge of C++ well, including your compiler
● Understand the basics of machine architecture, and how it will impact your
code
● Aim for very simple runtime logic:
○ Compilers optimize simple code the best
○ Prefer approximations instead of perfect precision where appropriate
○ Do expensive work only when you have spare time
● Conduct accurate measurement - this is essential
54
Thanks for listening!
[email protected]
55