Guillaume Chatelet | aba80d0 | 2020-01-06 12:17:04 | [diff] [blame] | 1 | # Benchmarking `llvm-libc`'s memory functions |
| 2 | |
| 3 | ## Foreword |
| 4 | |
| 5 | Microbenchmarks are valuable tools to assess and compare the performance of |
| 6 | isolated pieces of code. However they don't capture all interactions of complex |
| 7 | systems; and so other metrics can be equally important: |
| 8 | |
| 9 | - **code size** (to reduce instruction cache pressure), |
| 10 | - **Profile Guided Optimization** friendliness, |
| 11 | - **hyperthreading / multithreading** friendliness. |
| 12 | |
| 13 | ## Rationale |
| 14 | |
| 15 | The goal here is to satisfy the [Benchmarking |
Cheng Wang | 7abd8f6 | 2021-09-04 12:14:54 | [diff] [blame] | 16 | Principles](https://en.wikipedia.org/wiki/Benchmark_(computing)#Benchmarking_Principles). |
Guillaume Chatelet | aba80d0 | 2020-01-06 12:17:04 | [diff] [blame] | 17 | |
| 18 | 1. **Relevance**: Benchmarks should measure relatively vital features. |
| 19 | 2. **Representativeness**: Benchmark performance metrics should be broadly |
| 20 | accepted by industry and academia. |
| 21 | 3. **Equity**: All systems should be fairly compared. |
| 22 | 4. **Repeatability**: Benchmark results can be verified. |
| 23 | 5. **Cost-effectiveness**: Benchmark tests are economical. |
| 24 | 6. **Scalability**: Benchmark tests should measure from single server to |
| 25 | multiple servers. |
| 26 | 7. **Transparency**: Benchmark metrics should be easy to understand. |
| 27 | |
| 28 | Benchmarking is a [subtle |
Cheng Wang | 7abd8f6 | 2021-09-04 12:14:54 | [diff] [blame] | 29 | art](https://en.wikipedia.org/wiki/Benchmark_(computing)#Challenges) and |
Guillaume Chatelet | aba80d0 | 2020-01-06 12:17:04 | [diff] [blame] | 30 | benchmarking memory functions is no exception. Here we'll dive into |
| 31 | peculiarities of designing good microbenchmarks for `llvm-libc` memory |
| 32 | functions. |
| 33 | |
| 34 | ## Challenges |
| 35 | |
Guillaume Chatelet | deae7e9 | 2020-12-17 13:16:14 | [diff] [blame] | 36 | As seen in the [README.md](README.md#stochastic-mode) the microbenchmarking |
Guillaume Chatelet | aba80d0 | 2020-01-06 12:17:04 | [diff] [blame] | 37 | facility should focus on measuring **low latency code**. If copying a few bytes |
| 38 | takes in the order of a few cycles, the benchmark should be able to **measure |
| 39 | accurately down to the cycle**. |
| 40 | |
| 41 | ### Measuring instruments |
| 42 | |
| 43 | There are different sources of time in a computer (ordered from high to low resolution) |
| 44 | - [Performance |
| 45 | Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to |
| 46 | introspect the internals of the CPU, |
| 47 | - [High Precision Event |
| 48 | Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to |
| 49 | trigger short lived actions, |
| 50 | - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used |
| 51 | to keep track of the computer's time. |
| 52 | |
| 53 | In theory **Performance Counters** provide cycle accurate measurement via the |
| 54 | `cpu cycles` event. But as we'll see, they are not really practical in this |
| 55 | context. |
| 56 | |
| 57 | ### Performance counters and modern processor architecture |
| 58 | |
| 59 | Modern CPUs are [out of |
| 60 | order](https://en.wikipedia.org/wiki/Out-of-order_execution) and |
| 61 | [superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a |
| 62 | consequence it is [hard to know what is included when the counter is |
| 63 | read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling), |
| 64 | some instructions may still be **in flight**, some others may be executing |
| 65 | [**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a |
| 66 | matter of fact **on the same machine, measuring twice the same piece of code will yield |
| 67 | different results.** |
| 68 | |
| 69 | ### Performance counters semantics inconsistencies and availability |
| 70 | |
| 71 | Although they have the same name, the exact semantics of performance counters |
| 72 | are micro-architecture dependent: **it is generally not possible to compare two |
| 73 | micro-architectures exposing the same performance counters.** |
| 74 | |
| 75 | Each vendor decides which performance counters to implement and their exact |
| 76 | meaning. Although we want to benchmark `llvm-libc` memory functions for all |
| 77 | available [target |
| 78 | triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there |
Guillaume Chatelet | deae7e9 | 2020-12-17 13:16:14 | [diff] [blame] | 79 | are **no guarantees that the counter we're interested in is available.** |
Guillaume Chatelet | aba80d0 | 2020-01-06 12:17:04 | [diff] [blame] | 80 | |
| 81 | ### Additional imprecisions |
| 82 | |
| 83 | - Reading performance counters is done through Kernel [System |
| 84 | calls](https://en.wikipedia.org/wiki/System_call). The System call itself |
| 85 | is costly (hundreds of cycles) and will perturbate the counter's value. |
| 86 | - [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response) |
| 87 | can occur during measurement. |
| 88 | - If the system is already under monitoring (virtual machines or system wide |
| 89 | profiling) the kernel can decide to multiplex the performance counters |
| 90 | leading to lower precision or even completely missing the measurement. |
| 91 | - The Kernel can decide to [migrate the |
| 92 | process](https://en.wikipedia.org/wiki/Process_migration) to a different |
| 93 | core. |
| 94 | - [Dynamic frequency |
| 95 | scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick |
| 96 | in during the measurement and change the ticking duration. **Ultimately we |
| 97 | care about the amount of work over a period of time**. This removes some |
| 98 | legitimacy of measuring cycles rather than **raw time**. |
| 99 | |
| 100 | ### Cycle accuracy conclusion |
| 101 | |
| 102 | We have seen that performance counters are: not widely available, semantically |
| 103 | inconsistent across micro-architectures and imprecise on modern CPUs for small |
| 104 | snippets of code. |
| 105 | |
| 106 | ## Design decisions |
| 107 | |
| 108 | In order to achieve the needed precision we would need to resort on more widely |
| 109 | available counters and derive the time from a high number of runs: going from a |
| 110 | single deterministic measure to a probabilistic one. |
| 111 | |
| 112 | **To get a good signal to noise ratio we need the running time of the piece of |
| 113 | code to be orders of magnitude greater than the measurement precision.** |
| 114 | |
| 115 | For instance, if measurement precision is of 10 cycles, we need the function |
| 116 | runtime to take more than 1000 cycles to achieve 1% |
| 117 | [SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio). |
| 118 | |
| 119 | ### Repeating code N-times until precision is sufficient |
| 120 | |
| 121 | The algorithm is as follows: |
| 122 | |
| 123 | - We measure the time it takes to run the code _N_ times (Initially _N_ is 10 |
| 124 | for instance) |
| 125 | - We deduce an approximation of the runtime of one iteration (= _runtime_ / |
| 126 | _N_). |
| 127 | - We increase _N_ by _X%_ and repeat the measurement (geometric progression). |
| 128 | - We keep track of the _one iteration runtime approximation_ and build a |
| 129 | weighted mean of all the samples so far (weight is proportional to _N_) |
| 130 | - We stop the process when the difference between the weighted mean and the |
| 131 | last estimation is smaller than _ε_ or when other stopping conditions are |
| 132 | met (total runtime, maximum iterations or maximum sample count). |
| 133 | |
| 134 | This method allows us to be as precise as needed provided that the measured |
| 135 | runtime is proportional to _N_. Longer run times also smooth out imprecision |
| 136 | related to _interrupts_ and _context switches_. |
| 137 | |
| 138 | Note: When measuring longer runtimes (e.g. copying several megabytes of data) |
| 139 | the above assumption doesn't hold anymore and the _ε_ precision cannot be |
| 140 | reached by increasing iterations. The whole benchmarking process becomes |
| 141 | prohibitively slow. In this case the algorithm is limited to a single sample and |
| 142 | repeated several times to get a decent 95% confidence interval. |
| 143 | |
| 144 | ### Effect of branch prediction |
| 145 | |
| 146 | When measuring code with branches, repeating the same call again and again will |
| 147 | allow the processor to learn the branching patterns and perfectly predict all |
| 148 | the branches, leading to unrealistic results. |
| 149 | |
| 150 | **Decision: When benchmarking small buffer sizes, the function parameters should |
| 151 | be randomized between calls to prevent perfect branch predictions.** |
| 152 | |
| 153 | ### Effect of the memory subsystem |
| 154 | |
| 155 | The CPU is tightly coupled to the memory subsystem. It is common to see `L1`, |
| 156 | `L2` and `L3` data caches. |
| 157 | |
| 158 | We may be tempted to randomize data accesses widely to exercise all the caching |
| 159 | layers down to RAM but the [cost of accessing lower layers of |
| 160 | memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html) |
| 161 | completely dominates the runtime for small sizes. |
| 162 | |
| 163 | So to respect **Equity** and **Repeatability** principles we should make sure we |
| 164 | **do not** depend on the memory subsystem. |
| 165 | |
| 166 | **Decision: When benchmarking small buffer sizes, the data accessed by the |
| 167 | function should stay in `L1`.** |
| 168 | |
| 169 | ### Effect of prefetching |
| 170 | |
| 171 | In case of small buffer sizes, |
| 172 | [prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick |
| 173 | in but in case of large buffers it may introduce a bias. |
| 174 | |
| 175 | **Decision: When benchmarking large buffer sizes, the data should be accessed in |
| 176 | a random fashion to lower the impact of prefetching between calls.** |
| 177 | |
| 178 | ### Effect of dynamic frequency scaling |
| 179 | |
| 180 | Modern processors implement [dynamic frequency |
| 181 | scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called |
| 182 | `performance` mode the CPU will increase its frequency and run faster than usual |
| 183 | within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The |
| 184 | increased clock rate is limited by the processor's power, current, and thermal |
| 185 | limits, the number of cores currently in use, and the maximum frequency of the |
| 186 | active cores."_ |
| 187 | |
| 188 | **Decision: When benchmarking we want to make sure the dynamic frequency scaling |
| 189 | is always set to `performance`. We also want to make sure that the time based |
| 190 | events are not impacted by frequency scaling.** |
| 191 | |
Cheng Wang | 7abd8f6 | 2021-09-04 12:14:54 | [diff] [blame] | 192 | See [README.md](README.md) on how to set this up. |
Guillaume Chatelet | aba80d0 | 2020-01-06 12:17:04 | [diff] [blame] | 193 | |
| 194 | ### Reserved and pinned cores |
| 195 | |
| 196 | Some operating systems allow [core |
| 197 | reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process). |
| 198 | It removes a set of perturbation sources like: process migration, context |
| 199 | switches and interrupts. When a core is hyperthreaded, both cores should be |
| 200 | reserved. |
| 201 | |
| 202 | ## Microbenchmarks limitations |
| 203 | |
| 204 | As stated in the Foreword section a number of effects do play a role in |
| 205 | production but are not directly measurable through microbenchmarks. The code |
| 206 | size of the benchmark is (much) smaller than the hot code of real applications |
| 207 | and **doesn't exhibit instruction cache pressure as much**. |
| 208 | |
| 209 | ### iCache pressure |
| 210 | |
| 211 | Fundamental functions that are called frequently will occupy the L1 iCache |
| 212 | ([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If |
| 213 | they are too big they will prevent other hot code to stay in the cache and incur |
| 214 | [stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory |
| 215 | functions should be as small as possible. |
| 216 | |
| 217 | ### iTLB pressure |
| 218 | |
| 219 | The same reasoning goes for instruction Translation Lookaside Buffer |
| 220 | ([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring |
| 221 | [TLB |
| 222 | misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling). |
| 223 | |
| 224 | ## FAQ |
| 225 | |
| 226 | 1. Why don't you use Google Benchmark directly? |
| 227 | |
| 228 | We reuse some parts of Google Benchmark (detection of frequency scaling, CPU |
| 229 | cache hierarchy informations) but when it comes to measuring memory |
| 230 | functions Google Benchmark have a few issues: |
| 231 | |
| 232 | - Google Benchmark privileges code based configuration via macros and |
| 233 | builders. It is typically done in a static manner. In our case the |
| 234 | parameters we need to setup are a mix of what's usually controlled by |
| 235 | the framework (number of trials, maximum number of iterations, size |
| 236 | ranges) and parameters that are more tied to the function under test |
| 237 | (randomization strategies, custom values). Achieving this with Google |
| 238 | Benchmark is cumbersome as it involves templated benchmarks and |
| 239 | duplicated code. In the end, the configuration would be spread across |
| 240 | command line flags (via framework's option or custom flags), and code |
| 241 | constants. |
| 242 | - Output of the measurements is done through a `BenchmarkReporter` class, |
| 243 | that makes it hard to access the parameters discussed above. |